Subjects
Home
VOTE Move XML Commons to Xerces
Commented: (XERCESJ 589) Bug with pattern restriction on long strings
: Xerces J 2 8 1 Release on Wednesday, September 13th
: Xerces J 2 9 0 Release on Wednesday, November 22nd
Commented: (XERCESJ 1066) Restriction+choice+substitutionGroup error
Commented: (XERCESJ 1178) Error getting prefix for an attribute with no n
Updated: (XERCESJ 1244) XMLSchemaValidator does not contribute element 's
Some consideration about the xerces DOM implementation
Updated: (XERCESJ 1066) Restriction+choice+substitutionGroup error
Commented: (XERCESJ 1227) Poor performance / OutOfMemoryError for sequenc
retain exception stack traces
Updated: (XERCESJ 1193) NPE or hang when parsing using the "continue afte
Future of NekoHTML
Commented: (XERCESJ 1203) NPE in XMLDTDProcessor
DOM Level 3 APIs for Xalan J and a new Xalan release (2 7 1)
: xml commons external 1 3 04 Release on Wednesday, November 22nd
Commented: (XERCESJ 1247) Incorrect location information on SAX when usin
XInclude exceptions how to mirror Xerces J functionality into Xerces C++?
First proposal on SoC project "Add support for the StAX (JSR 173) cursor API
: xml commons resolver 1 2 Release on Wednesday, November 22nd
Typo in RangeToken java Please check
Validator features
java lang ClassCastException when adopting Node
using the org apache xerces impl xs identity package
Updated: (XERCESJ 1257) buffer overflow in UTF8Reader for characters out
Problem with ref attributes and schema validation
Updated: (XERCESJ 122) XMLSchemaValidator does not contribute element 's d
Performance problem under load Xerces with Weblogic 9 x
remove ignored memory allocation
Commented: (XERCESJ 1177) SAXXMLStreamReader doesn 't always report namesp
Commented: (XERCESJ 977) Null pointer exception during DOM parsing
Commented: (XERCESJ 1197) Code cleanup for org apache xml serialize
Commented: (XERCESJ 1201) Initial contribution for StAX Event API
Updated: (XERCESJ 1061) Regex "$ " and "^ " characters treated as special c
Commented: (XERCESJ 1199) SAXXMLStreamReader should attempt to register a
Commented: (XERCESJ 1061) Regex "$ " and "^ " characters treated as special
Updated: (XERCESJ 589) Bug with pattern restriction on long strings
StackOverflow
xerces Range unnecessarily not garbage collectable if not detached
Updated: (XERCESJ 1178) Error getting prefix for an attribute with no nam
Bug in xs:redefine
Commented: (XERCESJ 1204) Can not set XMLEntityResolver for LSParser
Updated: (XERCESJ 1253) Prototype for SoC2007 project "Add support for th
Updated: (XERCESJ 1259) Add SteamFilter Function to SoC2007 project "Add
Assigned: (XERCESJ 444) SAXException thrown by EntityResolver is reported
Google Summer of Code 2007
Xerces J and XInclude relative path issue
Assigned: (XERCESJ 206) Stack overflow when using a schema validation
Commented: (XERCESJ 1215) Restrictions involving two levels of substituti
Closed: (XERCESJ 1203) NPE in XMLDTDProcessor
non overriding equals methoda
Resolved: (XERCESJ 1079) invalid value returned for TOTALDIGITS facet in
Xerces AS3 port
Updated: (XERCESJ 325) Regular Expression; Pattern "| " clause order de
Updated: (XERCESJ 1196) Javadoc generation fails on Java SE 5 0
Closed: (XERCESJ 1202) DTD validation on XIncluded documents when the sch
Created: (XERCESJ 1124) Nonspecific schema error message
a bug in xerces
Updated: (XERCESJ 1201) Initial contribution for StAX Event API
Closed: (XERCESJ 1254) Empty uris in targetNamespace attribute not report
Links
Home
Oracle database error code
 
Search:  
Power your search with and, or, +, -, or "some phrase" operators.
Slow SAX parsing of large CDATA?

Slow SAX parsing of large CDATA?

2003-02-11       - By Daniel Rabe
Reply:     1     2  

I'm using SAX (Xerces 2.3.0 on Windows XP) to parse an XML file that can
contain large CDATA sections (where large is somewhere between 1 and 5 Mb).
The data is Base64-encoded. The code works properly, but when the CDATA is
over 1Mb or so, it's very slow. It seems like a 1Mb CDATA section can be
processed in several seconds, but once it gets up to about 3 or 4 Mb,
processing time goes up to about 10 minutes. It seems like Xerces is
building up a huge buffer of all the data before calling my characters
callback. (I'd prefer to get many characters callbacks so I can stream the
data to a file, rather than accumulating all the data in memory.) This is
the stack crawl I get while it's processing. Garbage collection is also very
active during this process. It doesn't seem to matter whether my max heap is
set to 128Mb or 256Mb... behavior is the same.

at org.apache.xerces.util.XMLStringBuffer.append(Unknown Source) at
org.apache.xerces.impl.XMLEntityScanner.scanData(Unknown Source) at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanCDATASection(Unkno
wn Source) at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatc
her.dispatch(Unknown Source) at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source) at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source) at
org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at
org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)

Since my data is base64-encoded, I don't really need the CDATA... I can just
treat it like element data. If I do this, I get a characters callback for
each line of the encoded data, and it's wonderfully fast. Unfortunately, the
XML files that I need to process are provided by another vendor and contain
the CDATA.

Has anybody else run into this? Any workarounds, or any way to give xerces a
clue that I want more frequent characters callbacks?

Thanks,
Daniel Rabe
drabe@(protected)

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
<META NAME="Generator" CONTENT="MS Exchange Server version 5.5.2653.12">
<TITLE>Slow SAX parsing of large CDATA?</TITLE>
</HEAD>
<BODY>

<P><FONT SIZE=2>I'm using SAX (Xerces 2.3.0 on Windows XP) to parse an XML file
that can contain large CDATA sections (where large is somewhere between 1 and 5
Mb). The data is Base64-encoded. The code works properly, but when the CDATA is
over 1Mb or so, it's very slow. It seems like a 1Mb CDATA section can be
processed in several seconds, but once it gets up to about 3 or 4 Mb,
processing time goes up to about 10 minutes. It seems like Xerces is building
up a huge buffer of all the data before calling my characters callback. (I'd
prefer to get many characters callbacks so I can stream the data to a file,
rather than accumulating all the data in memory.) This is the stack crawl I get
while it's processing. Garbage collection is also very active during this
process. It doesn't seem to matter whether my max heap is set to 128Mb or 256Mb
... behavior is the same.</FONT></P>

<P><FONT SIZE=2>at org.apache.xerces.util.XMLStringBuffer.append(Unknown Source
) at org.apache.xerces.impl.XMLEntityScanner.scanData(Unknown Source) at org
.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanCDATASection(Unknown
Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl
$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl
.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache
.xerces.parsers.DTDConfiguration.parse(Unknown Source) at org.apache.xerces
.parsers.DTDConfiguration.parse(Unknown Source) at org.apache.xerces.parsers
.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser
.parse(Unknown Source)</FONT></P>

<P><FONT SIZE=2>Since my data is base64-encoded, I don't really need the CDATA.
.. I can just treat it like element data. If I do this, I get a characters
callback for each line of the encoded data, and it's wonderfully fast.
Unfortunately, the XML files that I need to process are provided by another
vendor and contain the CDATA.</FONT></P>

<P><FONT SIZE=2>Has anybody else run into this? Any workarounds, or any way to
give xerces a clue that I want more frequent characters callbacks?</FONT></P>

<P><FONT SIZE=2>Thanks,</FONT>
<BR><FONT SIZE=2>Daniel Rabe</FONT>
<BR><FONT SIZE=2>drabe@(protected)</FONT>
</P>

</BODY>
</HTML>