character encodings with Xerces 2003-05-15 - By Andy Clark
Jonathan Whitall wrote: > I was wondering if Xerces can convert from one text > encoding to another specified one on the fly. I have > some data that is stored in UTF-8 in a database, and I > want to be able to create text nodes which are in the > set of Latin-1. If I pass UTF-8 to, say, the creator > of a text node, can it convert this automatically, or > do I have to lop off the bytes that I don't want > manually?
If your application is reading the UTF-8 bytes coming from the database and want to create, for example, DOM text nodes, then you need to convert the bytes into Java Strings to create the nodes. But this is easy in code.
Don't confuse the input/output encoding of a document with the encoding of the internal storage of those characters. Internally, Java stores everything in two byte Unicode characters. Therefore, Xerces does NOT create nodes in UTF-8 or ISO Latin-1 byte sequences.
The parser only reads an XML document into an internal format (e.g. SAX or DOM). For writing the document back to a file (or stream), you would use a serializer with the intended output encoding. The Xerces package comes with serializers for this purpose.
Does this answer your question?
-- Andy Clark * andyc@(protected)
--------------------------------------------------------------------- To unsubscribe, e-mail: xerces-j-user-unsubscribe@(protected) For additional commands, e-mail: xerces-j-user-help@(protected)
|
|