j-dev@xerces.apache.org
[Top] [All Lists]

[jira] Issue Comment Edited: (XERCESJ-970) Large comments are extremely

Subject: [jira] Issue Comment Edited: (XERCESJ-970) Large comments are extremely slow to parse
From: "Henning Moll (JIRA)"
Date: Wed, 4 Mar 2009 08:33:59 -0800 PST
    [ 
https://issues.apache.org/jira/browse/XERCESJ-970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678773#action_12678773
 ] 

drscott edited comment on XERCESJ-970 at 3/4/09 8:33 AM:
--------------------------------------------------------------

I stumbled upon the same problem, and here is a small example to reproduce it. 
Please add Xerces2 explicitly to the classpath or make sure that you use a 
JRE5. The intregrated version of Xerces in JRE6 does NOT show the problem 
(which means, that this bug seems to be fixed there):

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.InputSource;
import org.xml.sax.helpers.DefaultHandler;

public class TTT extends DefaultHandler {
    
    public static void main(String[] args) throws Exception {
        new TTT().parse(null);
    }

    public void parse(String[] args) throws Exception {
        long before = System.currentTimeMillis();
        InputSource source = new InputSource(new BufferedInputStream(new 
FileInputStream("garbage.xml")));
        SAXParserFactory factory = SAXParserFactory.newInstance();
        SAXParser parser = factory.newSAXParser();
        parser.parse(source, this);
        System.out.println(System.currentTimeMillis() - before);
    }
}

create a dummy xml-file "garbage.xml" with the following content:
<?xml version='1.0' encoding='UTF-8' standalone='no'?>
<blubb>
    ...
    [about 2Gig of CDATA]
    ...
</blubb>

The runtime of the example is very short with this version of the xml file. Now 
change the content to
<?xml version='1.0' encoding='UTF-8' standalone='no'?>
<!--blubb>
    ...
    [about 2Gig of CDATA]
    ...
</blubb-->

Only one big comment. But the runtime increases extremly (on my system about 
110ms vs. 50k(!)ms

The interesting part: The bundled Xerces of JRE5 does show this behaviour. The 
one from JRE6 does NOT.

      was (Author: drscott):
    I stumbled upon the same problem, and here is a small example to reproduce 
it. Please add Xerces2 explicitly to the classpath or make sure that you use a 
JRE5. The intregrated version of Xerces in JRE6 does NOT show the problem 
(which means, that this bug seems to be fixed there):

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.InputSource;
import org.xml.sax.helpers.DefaultHandler;

public class TTT extends DefaultHandler {
    
    public static void main(String[] args) throws Exception {
        new TTT().parse(null);
    }

    public void parse(String[] args) throws Exception {
        long before = System.currentTimeMillis();
        InputSource source = new InputSource(new BufferedInputStream(new 
FileInputStream("garbage.xml")));
        SAXParserFactory factory = SAXParserFactory.newInstance();
        SAXParser parser = factory.newSAXParser();
        parser.parse(source, this);
        System.out.println(System.currentTimeMillis() - before);
    }
}

create a dummy xml-file "garbage.xml" with the following content:
<?xml version='1.0' encoding='UTF-8' standalone='no'?>
<blubb>
    ...
    [about 2Gig of CDATA]
    ...
</blubb>

The runtime of the example is very short with this version of the xml file. Now 
change the content to
<?xml version='1.0' encoding='UTF-8' standalone='no'?>
<!--blubb>
    [up to 2Gig of CDATA]
</blubb-->

Only one big comment. But the runtime increases extremly (on my system about 
110ms vs. 50k(!)ms

The interesting part: The bundled Xerces of JRE5 does show this behaviour. The 
one from JRE6 does NOT.
  
> Large comments are extremely slow to parse
> ------------------------------------------
>
>                 Key: XERCESJ-970
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-970
>             Project: Xerces2-J
>          Issue Type: Bug
>          Components: XNI
>    Affects Versions: 2.2.0, 2.2.1, 2.3.0, 2.4.0, 2.5.0, 2.6.0, 2.6.1, 2.6.2
>         Environment: Windows XP running Java 1.4.2
>            Reporter: Sean Griffin
>            Priority: Minor
>
> Very large comments drastically increase the parsing time for both SAX and 
> DOM implementations.  Running the sax.Counter and dom.Counter samples with a 
> 410KB file where the entire thing is uncommented results in parse times in 
> the 100ms to 300ms range.  However, if I comment out 95% of the file and run 
> the same samples the parse times jump to between 40 and 50 seconds.  I ran 
> the same samples using the Aelfred parser shipped with Saxon 7.9 and, while 
> the file with the large comment was slower than without the comment, it 
> jumped by only 100ms or so.
> I briefly compared the code between the two parsers, and they don't look 
> significantly different when it comes to handling comments.  The only main 
> difference I noticed was around low/high byte character checks.  I suspect it 
> is an inefficiency in the XMLStringBuffer class, but I'm not seeing anything.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xxxxxxxxxxxxxxxxx
For additional commands, e-mail: j-dev-help@xxxxxxxxxxxxxxxxx

<Prev in Thread] Current Thread [Next in Thread>