Nux 1.0

nux.xom.binary
Class BinaryXMLCodec

java.lang.Object
  extended bynux.xom.binary.BinaryXMLCodec

public class BinaryXMLCodec
extends Object

Serializes (encodes) and deserializes (decodes) XOM XML documents to and from an efficient and compact custom binary XML data format (termed bnux format), without loss or change of any information. Serialization and deserialization is much faster than with the standard textual XML format, and the resulting binary data is more compressed than textual XML.

Applicability

The overall goal of the bnux algorithm is to maximize serialization and deserialization (parsing) performance without requiring any schema description. Serialization and deserialization speed are roughly balanced against each other; neither side is particularly favoured over the other. Another benefitial effect of the algorithm is that a considerable degree of XML data redundancy is eliminated, but compression is more a welcome side-effect than a primary goal in itself. The algorithm is primarily intended for tightly coupled high-performance systems exchanging large volumes of XML data over networks, as well as for compact main memory caches and for short-term storage as BLOBs in backend databases or files (e.g. "session" data with limited duration). In the case of BLOB storage, selecting matching BLOBs can be sped up by maintaining a simple metaindex side table for the most frequent access patterns. See the performance results below.

While the Java API is considered stable, the bnux data format should be considered a black box: Its internals are under-documented and may change without notice from release to release in backwards-incompatible manners. It is unlikely that support for reading data written with older Nux versions will ever be available. bnux is an exchange format but not an interoperability format.

This approach is expressly not intended as a replacement for standard textual XML in loosely coupled systems where maximum long-term interoperability is the overarching concern. It is also expressly not intended for long-term data storage. If you store data in bnux format there's every chance you won't be able to read it back a year or two from now, or even earlier. Finally, it is probably unwise to use this class if your application's performance requirements are not particularly stringent, or profiling indicates that the bottleneck is not related to XML serialization/deserialization anyway.

The bnux serialization algorithm is a three-pass batch algorithm, hence buffer-oriented, not stream-oriented. It has a throughput profile with short critical paths, rather than a low latency profile with long critical paths, rendering it ideal for large numbers of small to medium-sized XML documents, and impractical for individual documents that do not fit into main memory. The bnux deserialization algorithm is a single pass algorithm, and could in theory be streamed through a NodeFactory, but the current implementation does not do so.

Faithfully Preversing XML

Any and all arbitrary XOM XML documents are supported, and no schema is required. A XOM document that is serialized and subsequently deserialized by this class is exactly the same as the original document, preserving "as is" all names and data for elements, namespaces, additional namespace declarations, attributes, texts, document type, comments, processing instructions, whitespace, Unicode characters, etc. As a result, the W3C XML Infoset and the W3C Canonical XML representation is guaranteed to be preserved. In particular there always holds:
 java.util.Arrays.equals(XOMUtil.toCanonicalXML(doc), XOMUtil
 		.toCanonicalXML(deserialize(serialize(doc))));
 

Optional ZLIB Compression

The bnux algorithm considerably compresses XML data with little CPU consumption, by its very design. However, bnux also has an option to further compress/decompress its output/input with the ZLIB compression algorithm. ZLIB is based on Huffman coding and also used by the popular gzip (e.g. Deflater). ZLIB compression is rather CPU intensive, but it typically yields strong compression factors, in particular for documents containing mostly narrative text (e.g. the bible). For example, strong compression may be desirable over low-bandwith networks or when bnux data is known to be accessed rather infrequently. On the other hand, ZLIB compression probably kills performance in the presence of high-bandwidth networks such as ESnet, Internet2/Abilene or 10 Gigabit Ethernet/InfiniBand LANs, even with high-end CPUs. CPU drain is also a scalability problem in the presence of large amounts of concurrent connections. An option ranging from 0 (no ZLIB compression; best performance) to 1 (little ZLIB compression; reduced performance) to 9 (strongest ZLIB compression; worst performance) allows one to configure the CPU/memory consumption trade-off.

Reliability

This class has been successfully tested against many thousands of extremely weird and unique test documents, including the W3C XML conformance test suite, and no bugs are known.

Serialization employs no error checking at all, since malformed XOM input documents are impossible to produce given XOM's design: XOM strictly enforces wellformedness anyway. Deserialization employs some limited error checking, throwing exceptions for any improper API usage, non-bnux input data, data format version mismatch, or general binary data corruption. Beyond this, deserialization relies on XOM's hard-wired wellformedness checks, just like serialization does. Barring one of the above catastrophic situations, the bnux algorithm will always correctly and faithfully reconstruct the exact same well-formed XOM document.

The implementation has one notable limitation, which is deemed irrelevant for all practical purposes: XML data such as texts, element names, attribute values, etc. MUST NOT contain a NIL character (char NIL = (char) 0x00). An attempt to serialize a document containing a NIL character will be rejected with an IllegalArgumentException. This limitation is not conceptually inherent, it merely allows for more efficient implementation. Any other arbitrary characters are fine, including Unicode surrogates.

Example Usage:

 // parse standard textual XML, convert to binary format, round-trip it and compare results
 Document doc = new Builder().build(new File("/tmp/test.xml"));
 BinaryXMLCodec codec = new BinaryXMLCodec();
 byte[] bnuxDoc = codec.serialize(doc, 0);
 Document doc2 = codec.deserialize(bnuxDoc);
 boolean isEqual = java.util.Arrays.equals(XOMUtil.toCanonicalXML(doc), XOMUtil
 		.toCanonicalXML(doc2));
 System.out.println("isEqual = " + isEqual);
 System.out.println(doc2.toXML());
 
 // write binary XML document to file
 OutputStream out = new FileOutputStream("/tmp/test.xml.bnux");
 out.write(bnuxDoc);
 out.close();
 
 // read binary XML document from file
 bnuxDoc = XOMUtil.toByteArray(new FileInputStream("/tmp/test.xml.bnux"));
 Document doc3 = codec.deserialize(bnuxDoc);
 System.out.println(doc3.toXML());
 

Performance

This class has been carefully profiled and optimized. Preliminary performance results over a wide range of real-world documents are given below.

Contrasting bnux BinaryXMLCodec with the XOM Builder, Serializer and toXML():

For meaningful comparison, MB/s and compression factors are always given in relation to the original standard textual XML file size.
Example Interpretation: Note that in contrast to other algorithms, bnux includes XOM tree building and walking, hence measures delivering data to and from actual XML applications, rather than merely to and from a low-level SAX event stream (which is considerably cheaper and deemed less useful).

The deserialization speedup is further multiplied when DTDs or schema validation is used while parsing standard textual XML.

This class heavily relies on advanced Java compiler optimizations, which take considerable time to warm up. Hence, for comparative benchmarks, use a server-class VM and make sure to repeat runs for at least 30 seconds.

Further, you will probably want to eliminate drastic XOM hotspots by applying two small backwards-compatible patches to the XOM source code base. These patches 1) allow to selectively disable the Verifiersanity checks for PCDATA via a System property, and 2) maintain an internal String instead of an UTF-8 encoded byte array in Text, which eliminates the expensive character conversions implied for each access to a Text object. This increases performance at the expense of memory footprint. The measurements above report numbers using these patches, both for xom and bnux. If you're curious about the whereabouts of bottlenecks, run java with the non-perturbing '-server -agentlib:hprof=cpu=samples,depth=10' flags, then study the trace log and correlate its hotspot trailer with its call stack headers (see hprof tracing ).

Use class BinaryXMLTestto reproduce results, verify correctness or to evaluate performance for your own datasets.

Author:
whoschek.AT.lbl.DOT.gov, $Author: hoschek3 $

Constructor Summary
BinaryXMLCodec()
          Constructs an instance of this class.
 
Method Summary
 Document deserialize(byte[] bnuxDocument)
          Returns the XOM document obtained by deserializing the given binary XML document (bnux document).
 byte[] serialize(Document document, int zlibCompressionLevel)
          Returns the binary XML document ( bnux document) obtained by serializing the given XOM document.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

BinaryXMLCodec

public BinaryXMLCodec()
Constructs an instance of this class. An instance can be reused serially, but is not thread-safe, just like a Builder.

Method Detail

deserialize

public Document deserialize(byte[] bnuxDocument)
                     throws BinaryParsingException
Returns the XOM document obtained by deserializing the given binary XML document (bnux document).

If the document is in ZLIB compressed bnux format, it will be auto-detected and decompressed before applying deserialization.

Parameters:
bnuxDocument - the bnux document to deserialize.
Returns:
the new XOM document obtained from deserialization.
Throws:
BinaryParsingException - if the bnux document is unreadable or corrupt for some reason

serialize

public byte[] serialize(Document document,
                        int zlibCompressionLevel)
Returns the binary XML document ( bnux document) obtained by serializing the given XOM document.

An optional zlib compression level ranging from 0 (no ZLIB compression; best performance) to 1 (little ZLIB compression; reduced performance) to 9 (strongest ZLIB compression; worst performance) allows one to configure the CPU/memory consumption trade-off.

Unless there is a good reason to the contrary, you should always use level 0: the bnux algorithm typically already precompresses considerably.

Parameters:
document - the XOM document to serialize
zlibCompressionLevel - a number in the range 0..9
Returns:
the bnux document obtained from serialization.
Throws:
IllegalArgumentException - if the compression level is out of range.
IllegalArgumentException - if the input document contains a NIL character (char NIL = (char) 0x00).

Nux 1.0