The dbXML 2.0 Document Table Storage Model (DTSM)
Introduction
The dbXML 2.0 Document Table Storage Model will replace the
dbXML Compressed DOM in order provide a cleanly streamed system. It addresses
several limitations of the original dbXML Compressed DOM, including:
- It only supported 8-bit encodings.
- It had variable length records; requiring recursive seeks from the document
root to retrieve nested elements.
- It required recursive traversal to produce SAX events.
- It stored string values in-line, requiring restructuring when values
changed.
- Only supported string values, requiring coercion to perform numeric,
date and other comparisons.
Some of the features of the new format include:
- Full UNICODE support.
- Fixed length records that provide quick seeks, streaming capabilities,
and sequential access.
- Easily supports SAX, DOM, Text models.
- Efficient string modifications.
- Supports schema-based value representation in binary form.
An Example
This example assumes that dbXML (or another processor of the
DTSM) is maintaining the following Symbol Table for the document or the collection
that the document belongs to. Symbol tables are used in dbXML to perform indexing
and other pre-digested operations against the documents.
Symbol Table (From External Source) |
symId |
value |
namespaceURI |
0 |
myElem |
n/a |
1 |
myAttr |
n/a |
The following is a simple XML document that will be converted
into the DTSM format. It contains a few of the more common components of an
XML document, including elements, attributes, text, and comments.
The Document |
<?xml version="1.0" encoding="UTF-8"?>
<myElem myAttr="myVal">
myText
<!-- comment -->
</myElem> |
When converted to the DTSM format, the resulting document would
be broken out into two independent chunks. The first is a set of records, containing
integer fields. These records represent the hierarchical structure of the document.
The second is a value table, where the textual values of the document are stored,
possibly converted into a schema-based binary representation. Element and attribute
names are placed under symbol table control.
The document table includes end records for the document itself
and elements. All other portions of the document are treated as atomic, and
will not require this type of nesting indication.
Document Table |
Obj ID |
Sym/Val ID |
1Doc |
-1 |
2 Elem |
0 Sym - myElem |
3 Attr |
1 Sym - myAttr |
10
Text |
0 Val - myVal |
-3 |
1 |
10 Text |
1 Val - myText |
9 Comment |
2 Val - comment |
-2 |
0 |
-1 |
-1 |
The value tables will typically only store string values, but
if a document is under schema control, the DTSM processor may opt to convert
those values into binary representations for efficient comparisons.
Value Table |
offset |
length |
value |
0 |
5 |
myVal |
5 |
14 |
...myText... |
19 |
7 |
comment |
The following table totals the size of the document and value
tables. The original document was around 102 bytes, so in this case, the result
is only a slight reduction, but in typical cases, the result of isolating element
and attribute names, extracting values, and possibly converting them to binary
representations will reduce the size of typical documents significantly.
Document Table: |
45 bytes |
Value Table: |
40 bytes |
Total |
95 bytes |
Overall Stream Layout
The overall stream layout based on this document can be represented by the
following table.
Region Name |
Region Size |
Region Content |
Entry Count |
4 bytes |
7 |
Value Count |
4 bytes |
3 |
Document Table |
45 bytes |
(1,-1)(2,0)(3,1)(10,0)(-3,1)(10,1)(9,2)(-2,0)(-1-1) |
Value Table |
24 bytes |
(0,5)(5,14)(19,7) |
Document Data |
26 bytes |
myVal...myText...comment |
DTSM Data Types and Definitions
There are several data types and identifiers that DTSM requires
to function and be implemented properly.
XML Object IDs and XSD Value Types
The following is a table of object type identifiers. These will
loosely map to DOM Node Types, except that they will be one based, and end
records will be represented using negative values (ex: -1 for end of document).
XML Object Ids |
|
1 / -1 |
Start/End Document |
15 |
float |
2 / -2 |
Start/End Element |
16 |
double |
3 / -3 |
Start/End Attribute |
17 |
duration |
4 / -4 |
Processing Instruction |
18 |
dateTime |
5 / -5 |
Notation Node |
19 |
time |
6 / -6 |
Entity |
20 |
date |
7 / -7 |
Document Type |
21 |
gYearMonth |
8 |
Entity Reference |
22 |
gYear |
9 |
Comment |
23 |
gMonthDay |
10 |
Text |
24 |
gDay |
11 |
CDATA |
25 |
gMonth |
XSD Value Types |
26 |
hexBinary |
12 |
string |
27 |
base64Binary |
13 |
boolean |
28 |
anyURI |
14 |
decimal |
29 |
QName |
Data Structures
These are the actual C structures that will be used to represent
the on-disk and in-memory representation of the DTSM image. DTSM, as implemented
by dbXML will be written in Java, but the goal is that these structures should
be easily digested by various languages, and on various platforms.
Structures |
struct DocumentTable {
int entryCount;
int valueCount;
DocumentTableEntry[] entries;
ValueTableEntry[] values;
byte[] data;
}
struct DocumentTableEntry {
byte objId;
int lookupId;
}
struct ValueTableEntry {
int offset;
int length;
} |