More on (Re: (XERCESJ-589) Bug with pattern restriction on long strings) 2007-07-08 - By Michael Glavassevich
Hi Geoff,
"Geoff M. Granum" <geoff.granum@(protected)> wrote on 07/05/2007 10:17:42 PM:
> Hello Michael, > > Everything works so far, with the exception of UNION, which technically > works fine but exposes an infinite loop bug which was being broken by a > stack overflow before. > > The regex: (((((boy)|(girl))[0-1][x-z]{2})?)|(man|woman)[0-1]?[y|n])* > With Target: boy0xxwoman1ygirl1xyman > > Loops forever (the ending is invalid, needs [0-1]?[y|n] ). Well, forever > being 'until you run out of memory'. I didn't notice this until I set my > memory back to a reasonable size, as I had the -Xmx flag set to a GB > earlier for some playing. > > So the options are > a. ) Fix it. Being a serious edge case, I'm thinking this isn't really > worth the risk. More bluntly, it would take me far more time than I want > to spend on it to become comfortable with any fix I might produce. > > b. ) Ignore the case (let the system die of OOM *eventually*, recognizing > your CPU will be pegged for a while before this happens) > > c. ) Throw an exception if the stack depth gets to some crazy level (a > million deep, for instance), perhaps adding a System variable to set the > allowable depth. > > d. ) Other?
I would go with option d: open a new JIRA issue [1] to make folks aware that this is a bug. Perhaps someone will fix it one day.
> For reference, using the standard JVM (no flags, 64MB heap space, IIRC), I > can check the DNA string appended to itself ~425 times before reaching a > depth of a million stack elements. DNA_STRING is 2273 characters long. > > I run out of memory (heap space) around 650*DNA_STRING, with a depth of > 1,477,450. (the string being 1,477,450 characters long... big shock).
Cool. That's leaps and bounds better than the current code. I think folks were getting the stack overflow with around 2000 to 3000 characters (without increasing the stack size from the default).
> DNA String runs out of memory very quickly, whereas the above regex is > pretty slow about it because it pushes and pops a huge number of stack > items, *eventually* achieving an overflow. The DNA string pushes items > until it gets an answer, then backs out in the same order. > > Also, and oddly, the 'CAPTURE' option doesn't get hit a single time in the > 7000 or so regex tests provided by the W3C group. Testing it with the full > suite, which takes a painfully long time. > > I'm moving into optimization mode now that the regex tests check out; once > I hear back as to how the group wants the new bug handled, I'll implement > it and post the code for review. > > Oh, did you want the Test Suite code I had to implement? There's a lot of > generics code in it, and it's incredibly hackish, but it's free to whoever > wants it. It is by no means robust nor complete. IntelliJ has an 'Export > to Eclipse' setting for modules, if that interests you.
If your test suite can be back-ported to Java 1.3 perhaps it could be included with the other sanity/unit tests which run off the build.xml 'test' target.
> Cheers, > -- > Geoff M. Granum > Portland, Oregon
Thanks.
[1] http://issues.apache.org/jira/browse/XERCESJ
Michael Glavassevich XML Parser Development IBM Toronto Lab E-mail: mrglavas@(protected) E-mail: mrglavas@(protected)
--------------------------------------------------------------------- To unsubscribe, e-mail: j-dev-unsubscribe@(protected) For additional commands, e-mail: j-dev-help@(protected)
|
|