The Sax Benchmark project was created to test the performance, correctness and usability of the various HTML SAX parsers available.
NOTE: As with all benchmarks, these need to be taken with an appropriately large dose of salt. This suite is intended as a helpful starting point in evaluating the various libraries out there but if my experience thus far is anything to go by, your mileage will vary. Also, these benchmarks may well contain inaccuracies, bad methodology, bad configuration and general confusion, although none deliberate. If you spot anything that could be improved, please let me know.
The four SAX parsers under the knife are:
There are several different criteria that can be used when evaluating a particular HTML SAX Parser and it is important to keep this in mind when interpreting the results. These criteria could include: correctness, speed, memory or failsafety and they may well be mutually exclusive. And as with any benchmarks the application under review may be tailored to work far better in the real world than in a benchmark.
I suggest reading through the notes below before proceeding to the results.
In no particular order, my thanks to John Cowen (TagSoup), Andy Clark (CyberNeko), Ed Howland (HotSax) and Derrick Oswald (HTMLParser) for helping improve this benchmark. All inaccuracies, inconsistencies etc. are all mine, though.
There were three configurations tested:
HTMLParser was run with two configurations:
Tagsoup's tagline is "Just Keep On Truckin'", a reference to it's emphasis on dealing with as wide a variety of HTML as possible. From the TagSoup site: "This is the home page of TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML."
From CyberNeko's site: "NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags."
HTMLParser describes itself as "a super-fast real-time parser for real-world HTML". HTMLParser includes a rich set of classes for gathering, filtering and transforming HTML. A standard SAX parser built on the underlying HTMLParser API is included and is a relatively recent addition. I'd recommend looking at the full library, especially if your requirements are broader than just a SAX compliant XMLReader.
Here are some notes on the HTMLParser XMLReader:
HotSax describes itself as "... a small fast SAX2 parser for HTML, XHTML and XML." HotSax is at version 0.1.2c and is at Pre-Alpha. The project was last updated on May 3rd 2002.
Unfortunately, because of the localName / qName issues, it wasn't possible to get any useful HTML generated out of HotSax for comparison with the originals and the other libraries. It shouldn't be too hard to write a custom HTMLSerializer which uses the particular combination of attributes that HotSax produces.
This benchmark only tests HTML SAX parsers. There are a number of other libraries that parse HTML but don't have SAX implementations. A list of some of these can be found at java-source.net.
The results of the parsers tested depend heavily on the libraries outputting the results of the parsing. Unfortunately, while parsers and serializers are tightly coupled in practice, often the best pairing of parser and serializer is not an easy task. TagSoup provides it's own HTML Serializer (XMLWriter). HTMLParser relies on it's higher level API's to do native HTML serialization, so I had to try out some of the open source alternatives to do serialization using a ContentHandler. CyberNeko comes with an HTMLWriter but it is a Xerces Filter instead of a ContentHandler.
Including Xalan and Xerces in the Java 1.4 JDK has been problematic (see the FAQ). Over time, Xalan and Xerces have supported all of the following packages:
All of the above packages contain a SerializerFactory, which is how Serializers should be instantiated.
Serialization classes include:
The SAX Benchmark project was created to test HTML parsers but because it uses only the standard SAX api's it could quite easily do benchmarking of regular SAX libraries.
SAX Benchmark uses Maven to build and run itself. If you have maven installed, you can run it using:
maven -Dcount=10 saxbenchmark
Currently, the suite is not particularly configurable, so to try out your own parsers or writers, you'll need to change the *Supplier classes (or supply your own). Specifically:
What the benchmark will do, is pick up any sources that are placed in the "./src/data" directory. So if you have any particular sites you'd like to benchmark, just put the source into that folder and run the benchmark.