|
Nux 1.0 | ||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectnux.xom.binary.BinaryXMLCodec
Serializes (encodes) and deserializes (decodes) XOM XML documents to and from an efficient and compact custom binary XML data format (termed bnux format), without loss or change of any information. Serialization and deserialization is much faster than with the standard textual XML format, and the resulting binary data is more compressed than textual XML.
While the Java API is considered stable, the bnux data format should be considered a black box: Its internals are under-documented and may change without notice from release to release in backwards-incompatible manners. It is unlikely that support for reading data written with older Nux versions will ever be available. bnux is an exchange format but not an interoperability format.
This approach is expressly not intended as a replacement for standard textual XML in loosely coupled systems where maximum long-term interoperability is the overarching concern. It is also expressly not intended for long-term data storage. If you store data in bnux format there's every chance you won't be able to read it back a year or two from now, or even earlier. Finally, it is probably unwise to use this class if your application's performance requirements are not particularly stringent, or profiling indicates that the bottleneck is not related to XML serialization/deserialization anyway.
The bnux serialization algorithm is a three-pass batch algorithm, hence
buffer-oriented, not stream-oriented. It has a throughput profile with short
critical paths, rather than a low latency profile with long critical paths,
rendering it ideal for large numbers of small to medium-sized XML documents,
and impractical for individual documents that do not fit into main memory.
The bnux deserialization algorithm is a single pass algorithm, and could in
theory be streamed through a NodeFactory
, but the current
implementation does not do so.
java.util.Arrays.equals(XOMUtil.toCanonicalXML(doc), XOMUtil .toCanonicalXML(deserialize(serialize(doc))));
gzip
(e.g. Deflater
). ZLIB compression
is rather CPU intensive, but it typically yields strong compression factors,
in particular for documents containing mostly narrative text (e.g. the
bible). For example, strong compression may be desirable over low-bandwith
networks or when bnux data is known to be accessed rather infrequently. On
the other hand, ZLIB compression probably kills performance in the presence
of high-bandwidth networks such as ESnet, Internet2/Abilene or 10 Gigabit Ethernet/InfiniBand LANs,
even with high-end CPUs. CPU drain is also a scalability problem in the
presence of large amounts of concurrent connections. An option ranging from 0
(no ZLIB compression; best performance) to 1 (little ZLIB compression;
reduced performance) to 9 (strongest ZLIB compression; worst performance)
allows one to configure the CPU/memory consumption trade-off.
Serialization employs no error checking at all, since malformed XOM input documents are impossible to produce given XOM's design: XOM strictly enforces wellformedness anyway. Deserialization employs some limited error checking, throwing exceptions for any improper API usage, non-bnux input data, data format version mismatch, or general binary data corruption. Beyond this, deserialization relies on XOM's hard-wired wellformedness checks, just like serialization does. Barring one of the above catastrophic situations, the bnux algorithm will always correctly and faithfully reconstruct the exact same well-formed XOM document.
The implementation has one notable limitation, which is deemed irrelevant for
all practical purposes: XML data such as texts, element names, attribute
values, etc. MUST NOT contain a NIL character (char NIL = (char) 0x00). An
attempt to serialize a document containing a NIL character will be rejected
with an IllegalArgumentException
. This limitation is not
conceptually inherent, it merely allows for more efficient implementation.
Any other arbitrary characters are fine, including Unicode surrogates.
// parse standard textual XML, convert to binary format, round-trip it and compare results Document doc = new Builder().build(new File("/tmp/test.xml")); BinaryXMLCodec codec = new BinaryXMLCodec(); byte[] bnuxDoc = codec.serialize(doc, 0); Document doc2 = codec.deserialize(bnuxDoc); boolean isEqual = java.util.Arrays.equals(XOMUtil.toCanonicalXML(doc), XOMUtil .toCanonicalXML(doc2)); System.out.println("isEqual = " + isEqual); System.out.println(doc2.toXML()); // write binary XML document to file OutputStream out = new FileOutputStream("/tmp/test.xml.bnux"); out.write(bnuxDoc); out.close(); // read binary XML document from file bnuxDoc = XOMUtil.toByteArray(new FileInputStream("/tmp/test.xml.bnux")); Document doc3 = codec.deserialize(bnuxDoc); System.out.println(doc3.toXML());
Contrasting bnux BinaryXMLCodec with the XOM Builder, Serializer and toXML():
The deserialization speedup is further multiplied when DTDs or schema validation is used while parsing standard textual XML.
This class heavily relies on advanced Java compiler optimizations, which take considerable time to warm up. Hence, for comparative benchmarks, use a server-class VM and make sure to repeat runs for at least 30 seconds.
Further, you will probably want to eliminate drastic XOM hotspots by applying
two small backwards-compatible patches to the XOM
source code base. These patches 1) allow to selectively disable the
Verifier
sanity checks for PCDATA via a System
property, and 2) maintain an internal String instead of an UTF-8 encoded
byte array in Text
, which eliminates the expensive character
conversions implied for each access to a Text object. This increases
performance at the expense of memory footprint. The measurements above report
numbers using these patches, both for xom and bnux. If you're curious about
the whereabouts of bottlenecks, run java with the non-perturbing '-server
-agentlib:hprof=cpu=samples,depth=10' flags, then study the trace log and
correlate its hotspot trailer with its call stack headers (see
hprof tracing ).
Use class BinaryXMLTest
to reproduce results, verify
correctness or to evaluate performance for your own datasets.
Constructor Summary | |
BinaryXMLCodec()
Constructs an instance of this class. |
Method Summary | |
Document |
deserialize(byte[] bnuxDocument)
Returns the XOM document obtained by deserializing the given binary XML document (bnux document). |
byte[] |
serialize(Document document,
int zlibCompressionLevel)
Returns the binary XML document ( bnux document) obtained by serializing the given XOM document. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
public BinaryXMLCodec()
Builder
.
Method Detail |
public Document deserialize(byte[] bnuxDocument) throws BinaryParsingException
If the document is in ZLIB compressed bnux format, it will be auto-detected and decompressed before applying deserialization.
bnuxDocument
- the bnux document to deserialize.
BinaryParsingException
- if the bnux document is unreadable or corrupt for some reasonpublic byte[] serialize(Document document, int zlibCompressionLevel)
An optional zlib compression level ranging from 0 (no ZLIB compression; best performance) to 1 (little ZLIB compression; reduced performance) to 9 (strongest ZLIB compression; worst performance) allows one to configure the CPU/memory consumption trade-off.
Unless there is a good reason to the contrary, you should always use level 0: the bnux algorithm typically already precompresses considerably.
document
- the XOM document to serializezlibCompressionLevel
- a number in the range 0..9
IllegalArgumentException
- if the compression level is out of range.
IllegalArgumentException
- if the input document contains a NIL character (char NIL =
(char) 0x00).
|
Nux 1.0 | ||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |