This afternoon I attended the XML session. The first speaker was Eric Perkins who spoke on XML Screamer, an integrated, high-performance XML parser/validator. This paper has been nominated for the best paper award. (See the paper.)
XML parsers are slow. Many people think that the human readability of XML is what makes it slow. How fast should we be able to go? Reading through an input file should take about 10 cycles/byte (1GHz processor). Xerces-C does 6Mbytes/Sec/GHz. Expat is 12Mbytes/Sec/GHz. What's happening with all the other cycles?
Eric walks through the steps required to parse a file. There are a dozen steps--a lot of UTF-8 to UTF-16 conversions. This is because schemas are typically in UTF-16 so comparisons all require conversion.
XML Screamer takes a schema and an desired output API and produces a custom parser in C or Java for that combination. Screamer optimizes across layers, avoids intermediate forms, and avoids format conversion.
XML Screamer is 1.9 times faster than Expat and 3.8 times faster than Xerces for non-validating tasks. For business object creation (non-validating), the numbers are 2.9 and 5.9 times as fast for Expat and Xerces respectively. For validating, these numbers go up to 5.5 and 11.6. This is getting to within 20-40% of the raw character scan rate.
A few conclusions:
- XML stacks are designed that way but parsers don't need to be built that way.
- Good API design is crucial to performance. Some APIs require string conversions, creation, or even buffer manipulations.
- Schema compilation means that compiled artifacts must be deployed with each application. This proves to be a significant drawback.
- This system is a prototype and IBM has no plans to release it.