<__NS1:!DOCTYPE #text=" " html="" #text=" " PUBLIC="" #text=" " "-//W3C//DTD="" #text=" " XHTML="" #text=" " 1.0="" #text=" " Transitional//EN"="" #text=" " #text="" xmlns:__NS1="http://www.w3.org/TR/REC-html40"><__NS1:HTML><__NS1:HEAD><__NS1:TITLE>SAX Benchmark - Sax Benchmark Notes<__NS1:STYLE #text=" " type="text/css" #text=" " media="all"> @import url("./style/maven-base.css"); @import url("./style/maven-theme.css");<__NS1:LINK #text=" " rel="stylesheet" #text=" " href="./style/print.css" #text=" " type="text/css" #text=" " media="print"><__NS1:LINK><__NS1:META #text=" " http-equiv="Content-Type" #text=" " content="text/html; charset=ISO-8859-1"><__NS1:META><__NS1:BODY #text=" " class="composite"><__NS1:DIV #text=" " id="banner"><__NS1:A #text=" " href="http://www.apache.org/" #text=" " id="organizationLogo"><__NS1:IMG #text=" " alt="Apache Software Foundation" #text=" " src="http://maven.apache.org/images/jakarta-logo-blue.gif"><__NS1:IMG><__NS1:A #text=" " href="http://maven.apache.org/reference/plugins/examples/" #text=" " id="projectLogo"><__NS1:IMG #text=" " alt="Html SAX Benchmark" #text=" " src="http://maven.apache.org/images/maven.jpg"><__NS1:IMG><__NS1:DIV #text=" " class="clear"><__NS1:HR><__NS1:HR><__NS1:DIV #text=" " id="breadcrumbs"><__NS1:DIV #text=" " class="xleft"> Last published: 13 May 2005 | Doc for 1.0<__NS1:DIV #text=" " class="xright"><__NS1:DIV #text=" " class="clear"><__NS1:HR><__NS1:HR><__NS1:DIV #text=" " id="leftColumn"><__NS1:DIV #text=" " id="navcolumn"><__NS1:DIV #text=" " id="menuProject"><__NS1:H5>Project<__NS1:H5><__NS1:UL><__NS1:LI #text=" " class="none"><__NS1:A #text=" " href="index.html">Overview<__NS1:LI #text=" " class="none"><__NS1:A #text=" " href="results.html">Results<__NS1:DIV #text=" " id="menuProject_Documentation"><__NS1:H5>Project Documentation<__NS1:H5><__NS1:UL><__NS1:LI #text=" " class="none"><__NS1:STRONG><__NS1:A #text=" " href="index.html">About Html SAX Benchmark<__NS1:STRONG><__NS1:LI #text=" " class="collapsed"><__NS1:A #text=" " href="project-info.html">Project Info<__NS1:LI #text=" " class="collapsed"><__NS1:A #text=" " href="maven-reports.html">Project Reports<__NS1:LI #text=" " class="none"><__NS1:A #text=" " href="http://maven.apache.org/development-process.html" #text=" " class="externalLink" #text=" " title="External Link">Development Process<__NS1:A #text=" " href="http://maven.apache.org/" #text=" " title="Built by Maven" #text=" " id="poweredBy"><__NS1:IMG #text=" " alt="Built by Maven" #text=" " src="./images/logos/maven-button-1.png"><__NS1:IMG><__NS1:DIV #text=" " id="bodyColumn"><__NS1:DIV #text=" " class="contentBox"><__NS1:DIV #text=" " class="section"><__NS1:A #text=" " name="Introduction"><__NS1:H2>Introduction<__NS1:H2><__NS1:P> The Sax Benchmark project was created to test the performance, correctness and usability of the various HTML SAX parsers available. <__NS1:P><__NS1:P> <__NS1:B>NOTE:<__NS1:B> As with all benchmarks, these need to be taken with an appropriately large dose of salt. This suite is intended as a helpful starting point in evaluating the various libraries out there but if my experience thus far is anything to go by, your mileage will vary. Also, these benchmarks may well contain inaccuracies, bad methodology, bad configuration and general confusion, although none deliberate. If you spot anything that could be improved, please let me know. <__NS1:P><__NS1:P> The four SAX parsers under the knife are: <__NS1:UL> <__NS1:LI><__NS1:A #text=" " href="http://www.tagsoup.info/" #text=" " class="externalLink" #text=" " title="External Link">TagSoup 1.0rc2 <__NS1:LI><__NS1:A #text=" " href="http://people.apache.org/~andyc/neko/doc/html/" #text=" " class="externalLink" #text=" " title="External Link">CyberNeko 0.9.4 <__NS1:LI><__NS1:A #text=" " href="http://htmlparser.sourceforge.net/" #text=" " class="externalLink" #text=" " title="External Link">HTMLParser 1.5 <__NS1:LI><__NS1:A #text=" " href="http://hotsax.sourceforge.net/" #text=" " class="externalLink" #text=" " title="External Link">HotSax 0.1.2c <__NS1:P><__NS1:P> There are several different criteria that can be used when evaluating a particular HTML SAX Parser and it is important to keep this in mind when interpreting the results. These criteria could include: correctness, speed, memory or failsafety and they may well be mutually exclusive. And as with any benchmarks the application under review may be tailored to work far better in the real world than in a benchmark. <__NS1:P><__NS1:P> I suggest reading through the notes below before proceeding to the <__NS1:A #text=" " href="results.html">results. <__NS1:P><__NS1:P> In no particular order, my thanks to John Cowen (TagSoup), Andy Clark (CyberNeko), Ed Howland (HotSax) and Derrick Oswald (HTMLParser) for helping improve this benchmark. All inaccuracies, inconsistencies etc. are all mine, though. <__NS1:P><__NS1:DIV #text=" " class="section"><__NS1:A #text=" " name="SAX_Benchmark_Notes_and_Caveats"><__NS1:H2>SAX Benchmark Notes and Caveats<__NS1:H2><__NS1:P> <__NS1:UL> <__NS1:LI><__NS1:B>Number of runs<__NS1:B> - The number of times a benchmark is run is configurable. A number of runs are required to get around the System.currentTimeMillis() 16ms minimum on Windows. <__NS1:LI><__NS1:B>JIT<__NS1:B> - In order to kickstart the JIT, the parser is run once without being timed. <__NS1:LI><__NS1:B>Removing IO Overhead<__NS1:B> - The source is parsed and loaded into a ByteArrayInputStream to remove IO Overhead. <__NS1:LI><__NS1:B>EmptyContentHandler<__NS1:B> - When measuring the parse time, a content handler with empty methods is set into the parser. <__NS1:LI><__NS1:B>Memory benchmark<__NS1:B> - The memory benchmark is unreliable as it depends on a System.gc() at the start and that one doesn't happen before then end of a single parse. All of that said, the numbers are pretty consistent between runs. <__NS1:LI><__NS1:B>Output<__NS1:B> - The output depends heavily on the HTML serializer that is being used. The benchmark uses a few different HTML serializers but the combinations become extreme as soon as different configurations are introduced per parser and per serializer. See the section on customizing this benchmark if you're interested in changing configuration settings for the serializers. <__NS1:LI><__NS1:B>Sources<__NS1:B> - The sources for the benchmark include a simple html file to test various scenarios like balance tag support etc. and a collection of reasonably popular sites on the internet. Since only the HTML was downloaded and the links weren't rewritten, viewing the sources or output in a browser will look wrong. I recommend taking a look at the source HTML. <__NS1:P><__NS1:DIV #text=" " class="section"><__NS1:A #text=" " name="Configuration"><__NS1:H2>Configuration<__NS1:H2><__NS1:DIV #text=" " class="subsection"><__NS1:A #text=" " name="TagSoup_Parser"><__NS1:H3>TagSoup Parser<__NS1:H3><__NS1:P> TagSoup was run out of the box without any configuration set. <__NS1:P><__NS1:DIV #text=" " class="subsection"><__NS1:A #text=" " name="HTMLParser"><__NS1:H3>HTMLParser<__NS1:H3><__NS1:P> HTMLParser was run with two configurations: <__NS1:UL> <__NS1:LI><__NS1:B>Default<__NS1:B> - The out of the box configuration <__NS1:LI><__NS1:B>Namespaces<__NS1:B> - run with namespaces off and namespace prefixes on as this was made the some of the serializers better behaved: setFeature("http://xml.org/sax/features/namespaces", false), setFeature("http://xml.org/sax/features/namespace-prefixes", true). <__NS1:P><__NS1:DIV #text=" " class="subsection"><__NS1:A #text=" " name="CyberNeko_Parser"><__NS1:H3>CyberNeko Parser<__NS1:H3><__NS1:P> There were two configurations tested: <__NS1:UL> <__NS1:LI><__NS1:B>Default<__NS1:B> - The out of the box configuration <__NS1:LI><__NS1:B>NoBalancing<__NS1:B> - The out of the box configuration with tag balancing switched off: setFeature("http://cyberneko.org/html/features/balance-tags", false); <__NS1:P><__NS1:DIV #text=" " class="subsection"><__NS1:A #text=" " name="HotSax_Parser"><__NS1:H3>HotSax Parser<__NS1:H3><__NS1:P> HotSax was run out of the box without any configuration set. <__NS1:P><__NS1:DIV #text=" " class="subsection"><__NS1:A #text=" " name="TagSoup_XMLWriter"><__NS1:H3>TagSoup XMLWriter<__NS1:H3><__NS1:P> The TagSoup XMLWriter was run with HTML mode on (setHTMLMode(true)). <__NS1:P><__NS1:DIV #text=" " class="subsection"><__NS1:A #text=" " name="SerializerFactory"><__NS1:H3>SerializerFactory<__NS1:H3><__NS1:P> The SerializerFactory was run using the "html" option. <__NS1:P><__NS1:DIV #text=" " class="subsection"><__NS1:A #text=" " name="ToHTMLStream"><__NS1:H3>ToHTMLStream<__NS1:H3><__NS1:P> ToHTMLStream was was run out of the box without any configuration set. <__NS1:P><__NS1:DIV #text=" " class="section"><__NS1:A #text=" " name="Parser_Notes"><__NS1:H2>Parser Notes<__NS1:H2><__NS1:DIV #text=" " class="subsection"><__NS1:A #text=" " name="TagSoup"><__NS1:H3>TagSoup<__NS1:H3><__NS1:P> Tagsoup's tagline is "Just Keep On Truckin'", a reference to it's emphasis on dealing with as wide a variety of HTML as possible. From the TagSoup site: "This is the home page of TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML." <__NS1:P><__NS1:DIV #text=" " class="subsection"><__NS1:A #text=" " name="CyberNeko"><__NS1:H3>CyberNeko<__NS1:H3><__NS1:P> From CyberNeko's site: "NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags." <__NS1:P><__NS1:DIV #text=" " class="subsection"><__NS1:A #text=" " name="HTMLParser"><__NS1:H3>HTMLParser<__NS1:H3><__NS1:P> HTMLParser describes itself as "a super-fast real-time parser for real-world HTML". HTMLParser includes a rich set of classes for gathering, filtering and transforming HTML. A standard SAX parser built on the underlying HTMLParser API is included and is a relatively recent addition. I'd recommend looking at the full library, especially if your requirements are broader than just a SAX compliant XMLReader. <__NS1:P><__NS1:P> Here are some notes on the HTMLParser XMLReader: <__NS1:P><__NS1:P> <__NS1:UL> <__NS1:LI><__NS1:B>The HTMLParser XMLReader only works with SystemIds<__NS1:B> - As the tests require an InputStream or Reader (to try to minimise IO overhead affecting the numbers), I subclassed the XMLReader to include this functionality. While I do not believe that this has had any negative effect, please bear this in mind when interpreting the results. Derrick Oswald has said that he will be including this code in a future release of HTML Parser. <__NS1:LI><__NS1:B>The HTMLParser XMLReader puts in #text=" " elements for attribute whitespace<__NS1:B> - Derrick Oswald confirmed that this is to preserve whitespace in the attributes. <__NS1:P><__NS1:DIV #text=" " class="subsection"><__NS1:A #text=" " name="HotSax"><__NS1:H3>HotSax<__NS1:H3><__NS1:P> HotSax describes itself as "... a small fast SAX2 parser for HTML, XHTML and XML." HotSax is at version 0.1.2c and is at Pre-Alpha. The project was last updated on May 3rd 2002. <__NS1:P><__NS1:P> <__NS1:UL> <__NS1:LI><__NS1:B>HotSax doesn't work with InputStreams<__NS1:B> - HotSax will throw a NullPointerException if you try to use an InputStream but will work fine if you use an InputSource with a Reader. <__NS1:LI><__NS1:B>HotSax doesn't populate the qname for elements<__NS1:B> - HotSax does populate the localName for elements but not the qname. It looks like most of the HTML serializers use the qname to output HTML. If they do use the localname, unfortunately, it looks like they use the localName for attributes in which case: <__NS1:LI><__NS1:B>HotSax doesn't populate the localName for attributes<__NS1:B> - This means that serializers that expect a localName instead of a qname don't output anything. <__NS1:LI><__NS1:B>HotSax throws an ArrayOutOfBoundsException<__NS1:B> - HotSax is the only parser that fails one of the tests as the result of an ArrayOutOfBoundsException. <__NS1:P><__NS1:P> Unfortunately, because of the localName / qName issues, it wasn't possible to get any useful HTML generated out of HotSax for comparison with the originals and the other libraries. It shouldn't be too hard to write a custom HTMLSerializer which uses the particular combination of attributes that HotSax produces. <__NS1:P><__NS1:DIV #text=" " class="subsection"><__NS1:A #text=" " name="Other_HTML_Parsers"><__NS1:H3>Other HTML Parsers<__NS1:H3><__NS1:P> This benchmark only tests HTML SAX parsers. There are a number of other libraries that parse HTML but don't have SAX implementations. A list of some of these can be found at <__NS1:A #text=" " href="http://www.java-source.net" #text=" " class="externalLink" #text=" " title="External Link"> java-source.net. <__NS1:P><__NS1:DIV #text=" " class="section"><__NS1:A #text=" " name="Serializers"><__NS1:H2>Serializers<__NS1:H2><__NS1:P> The results of the parsers tested depend heavily on the libraries outputting the results of the parsing. Unfortunately, while parsers and serializers are tightly coupled in practice, often the best pairing of parser and serializer is not an easy task. TagSoup provides it's own HTML Serializer (XMLWriter). HTMLParser relies on it's higher level API's to do native HTML serialization, so I had to try out some of the open source alternatives to do serialization using a ContentHandler. CyberNeko comes with an HTMLWriter but it is a Xerces Filter instead of a ContentHandler. <__NS1:P><__NS1:DIV #text=" " class="subsection"><__NS1:A #text=" " name="Serializers_-_Xalan_and_Xerces"><__NS1:H3>Serializers - Xalan and Xerces<__NS1:H3><__NS1:P> Including Xalan and Xerces in the Java 1.4 JDK has been problematic (see <__NS1:A #text=" " href="http://xml.apache.org/xalan-j/faq.html#faq-N100CC" #text=" " class="externalLink" #text=" " title="External Link">the FAQ). Over time, Xalan and Xerces have supported all of the following packages: <__NS1:UL> <__NS1:LI>org.apache.xalan.serialize (deprecated) <__NS1:LI>org.apache.xml.serialize (deprecated) <__NS1:LI>org.apache.xml.serializer <__NS1:P><__NS1:P> All of the above packages contain a SerializerFactory, which is how Serializers should be instantiated. <__NS1:P><__NS1:P> Serialization classes include: <__NS1:UL> <__NS1:LI>org.apache.xalan.serialize.SerializerToHTML (deprecated) <__NS1:LI>org.apache.xml.serializer.ToHTMLStream <__NS1:LI>org.apache.xml.serialize.HTMLSerializer (deprecated) <__NS1:P><__NS1:DIV #text=" " class="subsection"><__NS1:A #text=" " name="Tagsoup_XMLWriter"><__NS1:H3>Tagsoup XMLWriter<__NS1:H3><__NS1:P>TagSoup comes with it's own HTML Serializer, namely: XMLWriter. The HTMLMode was set to true for output<__NS1:P><__NS1:DIV #text=" " class="section"><__NS1:A #text=" " name="Using_SAX_Benchmark"><__NS1:H2>Using SAX Benchmark<__NS1:H2><__NS1:P> The SAX Benchmark project was created to test HTML parsers but because it uses only the standard SAX api's it could quite easily do benchmarking of regular SAX libraries. <__NS1:P><__NS1:DIV #text=" " class="subsection"><__NS1:A #text=" " name="Running"><__NS1:H3>Running<__NS1:H3><__NS1:P> SAX Benchmark uses Maven to build and run itself. Before running the benchmarks, you'll need to compile the application using: <__NS1:DIV #text=" " class="source"><__NS1:PRE> maven jar <__NS1:PRE> You can then run it using: <__NS1:DIV #text=" " class="source"><__NS1:PRE> maven -Dcount=10 saxbenchmark:run <__NS1:PRE> <__NS1:P><__NS1:DIV #text=" " class="subsection"><__NS1:A #text=" " name="Configuration"><__NS1:H3>Configuration<__NS1:H3><__NS1:P> Currently, the suite is not particularly configurable, so to try out your own parsers or writers, you'll need to change the *Supplier classes (or supply your own). Specifically: <__NS1:UL> <__NS1:LI>DefaultOutputterFactorySupplier <__NS1:LI>DefaultReaderSupplier <__NS1:P><__NS1:P> What the benchmark will do, is pick up any sources that are placed in the "./src/data" directory. So if you have any particular sites you'd like to benchmark, just put the source into that folder and run the benchmark. <__NS1:P><__NS1:DIV #text=" " class="clear"><__NS1:HR><__NS1:HR><__NS1:DIV #text=" " id="footer"><__NS1:DIV #text=" " class="xright">© 2002-2005, Apache Software Foundation<__NS1:DIV #text=" " class="clear"><__NS1:HR><__NS1:HR>