charset problem - UTF-8 2003-02-21 - By Jesus M. Salvo Jr.
Scott Eade wrote:
>Okay, I'll answer my own question: >1. The character /u2019 will not be converted to a character reference when >UTF-8 is used > Correct
>(it will use two bytes and will not be displayed correctly in >applications that do not correctly deal with UTF-8 - e.g. Windows notepad). > Notepad _can_ display Unicode characters from files that have been saved as UTF-8, as long as the font you use on Notepad can display that character. At work, we have lots of files that contain Chinese characters that are saved as UTF-8, and I use the SimSun or SimHei font to view those files, including XML files in UTF-8.
When you do a "Save As", you have the option to save a file as UTF-8 ( and UTF-16 I think ). Notepad also puts a BOM ( Byte Order Marking ) on front of the file. You can see this BOM through a hex editor.
>2. In the cases where character references are used an editing component is >causing them to be encoded - the component is not being used in the places >where the characters are not encoded. >3. Windows file encodings are a PITA. > The default is called windows-1252 in most cases at least ( Will be different of course for someone running Windows Thai ). It's _not_ the same as iso-8859-1. You can think of windows-1252 as a superset of iso-8859-1.
http://czyborra.com/charsets/iso8859.html
On some websites, what were supposed to be "smart quote" characters appear as questions marks or as another funny character on your non-IE browser. It turns out that the HTTP header for the webpage was advertised as iso-8859-1, but the file itself was encoded in windows-1252.
>4. I know more now than I did before. > >Sorry for the noise. > >Scott > >
--------------------------------------------------------------------- To unsubscribe, e-mail: xerces-j-user-unsubscribe@(protected) For additional commands, e-mail: xerces-j-user-help@(protected)
|
|