  | |  | Possible encoding related bug | Possible encoding related bug 2003-08-20 - By Sasa Bojanic
Hi,
I think that that there is an encoding related bug in Xerces2.5. When using DOM parser, and trying to parse a document that contains characters that do not belong to the character set that correspond to the specified document encoding (e.g. the character ä is contained in the document which encoding is specified as "us-ascii"), the parser is crashing. Here is the code snippet: try { DOMParser parser = new DOMParser(); parser.parse(toParse); }catch (Exception ex) { ex.printStackTrace(); }
* "toParse" is the path to the following document:
<?xml version="1.0" encoding="us-ascii"?> <Package Id="pkg1"> <!-- ä --> <PackageHeader> <XPDLVersion>1.0</XPDLVersion> <Vendor>Together</Vendor> <Created>2003-08-20 10:00:49</Created> </PackageHeader> </Package>
The parser crashes because of ä character, and I get the following stack trace: java.io.IOException: Byte "228" is not a member of the (7-bit) ASCII character set. at org.apache.xerces.impl.io.ASCIIReader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XML11EntityScanner.skipSpaces(Unknown Source) at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher .dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument (Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) at XML.main(XML.java:25) When I use Xerces2.4, everything goes fine! Regards, Sasa.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML><HEAD> <META content="text/html; charset=iso-8859-1" http-equiv=Content-Type> <META content="MSHTML 5.00.3315.2870" name=GENERATOR> <STYLE></STYLE> </HEAD> <BODY bgColor=#ffffff> <DIV><FONT face=Arial size=2>Hi,</FONT></DIV> <DIV> </DIV> <DIV><FONT face=Arial size=2>I think that that there is an encoding related bug in Xerces2.5.</FONT></DIV> <DIV><FONT face=Arial size=2>When using DOM parser, and trying to parse a document that contains characters that do not belong to the character set that correspond to the specified document encoding (e.g. the character ä is contained in the document which encoding is specified as "us-ascii"), the parser is crashing.</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>Here is the code snippet:</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2> try {<BR></FONT><FONT face=Arial size=2> DOMParser parser = new DOMParser();<BR> parser.parse(toParse);<BR> }catch (Exception ex) {<BR> ex.printStackTrace();<BR> }<BR></FONT></DIV> <DIV><FONT face=Arial size=2>* "toParse" is the path to the following document:</DIV></FONT> <DIV> </DIV> <DIV><FONT face=Arial size=2><?xml version="1.0" encoding="us-ascii"?><BR><Package Id="pkg1"><BR> <!-- ä --><BR> <PackageHeader><BR> <XPDLVersion>1.0</XPDLVersion><BR> <Vendor>Together</Vendor><BR> <Created>2003-08-20 10:00:49</Created><BR> </PackageHeader><BR></Package><BR></FONT></DIV> <DIV><FONT face=Arial size=2>The parser crashes because of ä character, and I get the following stack trace:</FONT></DIV> <DIV><FONT face=Arial size=2>java.io.IOException: Byte "228" is not a member of the (7-bit) ASCII character set.<BR> at org.apache.xerces.impl.io.ASCIIReader.read(Unknown Source)<BR> at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)<BR> at org.apache.xerces.impl.XML11EntityScanner.skipSpaces(Unknown Source)<BR> at org.apache.xerces.impl.XMLDocumentScannerImpl$PrologDispatcher.dispatch(Unknown Source)<BR> at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)<BR> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)<BR> at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)<BR> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)<BR> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)<BR> at XML.main(XML.java:25)</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>When I use Xerces2.4, everything goes fine!</FONT></DIV> <DIV><FONT face=Arial size=2></FONT> </DIV> <DIV><FONT face=Arial size=2>Regards,</FONT></DIV> <DIV><FONT face=Arial size=2>Sasa.</DIV></FONT></BODY></HTML>
|
|
 |