The dbXML 2.0 Document Table Storage Model (DTSM)
Tom Bradford - bradford@dbxmlgroup.com
Last updated: August 24, 2002

Introduction
The dbXML 2.0 Document Table Storage Model will replace the dbXML Compressed DOM in order provide a cleanly streamed system. It addresses several limitations of the original dbXML Compressed DOM, including:
Some of the features of the new format include:
 
An Example
This example assumes that dbXML (or another processor of the DTSM) is maintaining the following Symbol Table for the document or the collection that the document belongs to. Symbol tables are used in dbXML to perform indexing and other pre-digested operations against the documents.
Symbol Table (From External Source)
symId value namespaceURI
0 myElem n/a
1 myAttr n/a
The following is a simple XML document that will be converted into the DTSM format. It contains a few of the more common components of an XML document, including elements, attributes, text, and comments.
The Document
<?xml version="1.0" encoding="UTF-8"?>
<myElem myAttr="myVal"> 
   myText 
   <!-- comment -->
</myElem>
When converted to the DTSM format, the resulting document would be broken out into two independent chunks. The first is a set of records, containing integer fields. These records represent the hierarchical structure of the document. The second is a value table, where the textual values of the document are stored, possibly converted into a schema-based binary representation. Element and attribute names are placed under symbol table control.
The document table includes end records for the document itself and elements. All other portions of the document are treated as atomic, and will not require this type of nesting indication.
Document Table
Obj ID Sym/Val ID
1Doc -1
   2 Elem 0 Sym - myElem
      3 Attr 1 Sym - myAttr
         10 Text 0 Val - myVal
      -3 1
      10 Text 1 Val - myText
      9 Comment 2 Val - comment
   -2 0
-1 -1
The value tables will typically only store string values, but if a document is under schema control, the DTSM processor may opt to convert those values into binary representations for efficient comparisons.
Value Table
offset length value
0 5 myVal
5 14 ...myText...
19 7 comment
The following table totals the size of the document and value tables. The original document was around 102 bytes, so in this case, the result is only a slight reduction, but in typical cases, the result of isolating element and attribute names, extracting values, and possibly converting them to binary representations will reduce the size of typical documents significantly.
Document Table: 45 bytes
Value Table: 40 bytes
Total 95 bytes
 
Overall Stream Layout
The overall stream layout based on this document can be represented by the following table.
Region Name Region Size Region Content
Entry Count
4 bytes 7
Value Count
4 bytes 3
Document Table
45 bytes (1,-1)(2,0)(3,1)(10,0)(-3,1)(10,1)(9,2)(-2,0)(-1-1)
Value Table
24 bytes (0,5)(5,14)(19,7)
Document Data
26 bytes myVal...myText...comment
 
DTSM Data Types and Definitions
There are several data types and identifiers that DTSM requires to function and be implemented properly.
 
XML Object IDs and XSD Value Types
The following is a table of object type identifiers. These will loosely map to DOM Node Types, except that they will be one based, and end records will be represented using negative values (ex: -1 for end of document).
XML Object Ids  
1 / -1 Start/End Document 15 float
2 / -2 Start/End Element 16 double
3 / -3 Start/End Attribute 17 duration
4 / -4 Processing Instruction 18 dateTime
5 / -5 Notation Node 19 time
6 / -6 Entity 20 date
7 / -7 Document Type 21 gYearMonth
8 Entity Reference 22 gYear
9 Comment 23 gMonthDay
10 Text 24 gDay
11 CDATA 25 gMonth
XSD Value Types 26 hexBinary
12 string 27 base64Binary
13 boolean 28 anyURI
14 decimal 29 QName
  
Data Structures
These are the actual C structures that will be used to represent the on-disk and in-memory representation of the DTSM image. DTSM, as implemented by dbXML will be written in Java, but the goal is that these structures should be easily digested by various languages, and on various platforms.
Structures
struct DocumentTable { 
   int entryCount;
   int valueCount;
   DocumentTableEntry[] entries; 
   ValueTableEntry[] values;
   byte[] data;
}

struct DocumentTableEntry { 
   byte objId; 
   int lookupId; 
}

struct ValueTableEntry {
   int offset;
   int length;
}