360 likes | 496 Views
Analysis and Evaluation of a Native XML Database. Kenneth H. Wenker, Ph.D. May 19, 2003 Dr. C. Edward Chow, Advisor University of Colorado, Colorado Springs. Outline of the Presentation. 1. The nature of this masters project 2. XML and databases 3. NeoCore XML database architecture
E N D
Analysis and Evaluation of a Native XML Database Kenneth H. Wenker, Ph.D. May 19, 2003 Dr. C. Edward Chow, Advisor University of Colorado, Colorado Springs
Outline of the Presentation 1. The nature of this masters project 2. XML and databases 3. NeoCore XML database architecture 4. Testing architecture 5. Testing results 6. Conclusion
Goals of the Master Project • To analyze and explain some significant architectural features of the NeoCore XML database, specifically those that have been patented and are therefore public • To test the performance of those features as they have been implemented in v. 2.6 for Windows • To create benchmarks for evaluating the XML database performance
What is an XML Database? • information = data + context • THE CHALLENGE: with XML, not only the data, but also the context, can vary. XML is inherently extensible. • A traditional RMDBS is not inherently extensible. • An “XML-enabled” database is an RMDBS which has been modified in an attempt to handle XML extensibility. • XML-enabled databases need to deal with both storing and reconstructing the tag structure. • A “native XML database” is specifically designed to process XML documents efficiently • Tamino: must provide some sort of structure definition • dbXML: weak doing updates and inserts
USES FOR AN XML DATABASE • e-commerce—since we are frequently using XML to transport data, it would be efficient to store it that way, provided we can store it, manipulate it, and retrieve it easily. • When the tag structures are frequently and unpredictably changing, as in, for example, DNA research. • When the tag structures are more important than the data, as in (perhaps) a next-generation search engine.
XML Database Benchmarks • No authoritative group has recognized any benchmark for XML databases • There are five academic benchmarks • The Michigan Benchmark (Univ. of Michigan) • XOO7 (Nat’l Univ. of Singapore) • XBench (Univ. of Waterloo, Canada) • XMach1 (Univ. of Leipzig) • XMark (National Research Institute for Mathematics and Computer Science in the Netherlands) • The Michigan Benchmark is designed to allow developers to tune databases. The others are usable for this project. • None of them provides an adequate test of extensibility, in my opinion. The extensibility they allow is within the bounds of predefined schema.
XOO7 BENCHMARK • Document Structure: <Module> <Manual>Text of Configurable Length</Manual> <ComplexAssembly> <ComplexAssembly> <BaseAssembly> <CompositePart> <Document>Text of Configurable Length</Document> <Connection> <AtomicPart> • Typical Configuration File: NumAssmPerAssm 3NumCompPerAssm 3NumCompPerModule 50NumAssmLevels 5NumAtomicPerComp 200NumConnPerAtomic 3DocumentSize 1000ManualSize 3800
CLIENT SERVER Console (runs on client’s browser) Used for administration Used for manual input NeoCore XMS HTTP DB User Application API C++, Java, VB, .COM, or create your own HTTP HTTP CLIENT/SERVER OVERVIEW
Icon Gene-rator Set of Indices Map Store Tag Dictionary Data Dictionary Figure 2-2. Overview of NeoCore Database Architecture OVERVIEW OF NEOCORE STORAGE ARCHITECTURE
THE “ICON” • It is a transform of a string of indefinite length to a n-bit binary number. • Similar to a hash or CRC code. • The width of the icon is typically 64 bits. The first part of the icon, depending on the number of rows in the index, is used to specify which row of an index is to be accessed. The rest is used as a “confirmer” to be explained shortly. • NeoCore uses a patented technique to create the icon: to create a 64-bit icon you would access four 256-row tables and XOR the resultant values. The technique allows processing 32 bits at a time from the input string.
AN EXAMPLE: 1. The XML Document <my-phone-book> <entry> <name>me</name> <nmbr>6198654227</nmbr> </entry> </my-phone-book>
AN EXAMPLE: 2. The Data Dictionary me 6198654227
AN EXAMPLE: 3. The Tag Dictionary my-phone-book>entry>name> entry>name> name> my-phone-book>entry>nmbr> entry>nmbr> nmbr>
AN EXAMPLE: 5. An Empty Data Core Index 1000 0111 0110 0101 0100 0011 0010 0001 0000
AN EXAMPLE: 6. The Data Core Index, 1 entry 1000 0111 0110 0101 0100 0011 0010 0001 0000
AN EXAMPLE: 7. The Data Core Index, 2 entries 1000 0111 0110 0101 0100 0011 0010 0001 0000
Dupe Index Map Store Data Dictionary Icon Gen-erator Index “Smith” “Smith” Figure 6-2. Pointers in the Duplicate Index for a Specific Data Item A TYPICAL QUERY for $a in document(“myPhoneBook.xml”)//[last=“Smith”] return count($a) “last>Smith”
KEY ARCHITECTURAL INNOVATIONS • Separate Tag Store and Data Store • No-string Map Store • Multi-Table Icon Generator • 1.5 Look-ups Core Indices • Use of Duplicate Indices
TESTING PHILOSOPHY • Interested only in the core database, not in the broader XMS • For this round of testing, did not evaluate the ability to handle huge amounts of data • A good test of the NeoCore architecture requires testing it with several different kinds of XML documents, as shown on the next page.
CLIENT SERVER Console NeoCore XMS Docu- ment Gene- ration Query Gene- ration HTTP KornShell Tools DB OS File System NeoCore API Query Tool HTTP Store Tool TESTING ARCHITECTURE Client and server were run on the same Dell 350 platform, using Windows XP Professional, a 3.0 GHz CPU, and 1.50 GB of RAM.
NINE TESTS 1. Disk space used 2. Flattening & restoring 3. Storage speed 4. Query accuracy 5. Query speed 6. Full core-index check 7. Extensibility test 8. Insertion test 9. Non-indexed string searches
DISK STORAGE USED Original XOO7 Benchmark Mean Results: approx 3.0:1 NeoCore XOO7 Benchmark Results: approx 2.5:1
Storage Time • Mean time to store a “small3.config” XOO7 Document was slightly over 21 seconds. • The XOO7 team reported the time to prepare or convert the XML document so that it could be stored. The mean time for a “small3.config” document was about 5 minutes. They did not report how much time it subsequently took to store the document. • Our testing: Dell 350, Windows XP Professional, 3GHz CPU, 1.5GB RAM. XOO7 original testing: SunOS 5.7 Unix, 333MHz, 256MB RAM
FINDINGS--STRENGTHS • Handles extensibility well—no difference between extensibility documents and XOO7 documents • Indexed searches very fast (if enough RAM)—almost 300 times faster than the original XOO7 findings for some queries (although some of that difference is due to difference in platform capabilities) • Disk requirements about the same as needed by the XOO7 team just for their conversion documents • Efficiency remains even when the core indices are full—effective collision management
FINDINGS--WEAKNESSES • In some cases, we do not get out exactly what was put in: • White-Space inaccuracies • Some dropped comments • Some DOCTYPE information is dropped or enclosed in database-created “<prolog>” tags • Duplicate indices can be a bottleneck • Core indices do not scale easily if expansion is needed
FUTURE RESEARCH NEEDS • Conduct this testing on a more heavy-duty platform using more robust data sets. • Test future releases—3.0 to have significant changes. • Analyze the patent for new duplicate indices once it is published on the USPTO site. • Publish an article on the (currently unimplemented) non-indexed substring search algorithm. • Do the same testing with NeoCore competitors, if they will support it (Tamino would not). • Publish related articles in IT journals.
Conclusion • On each key performance indicator the NeoCore XMS is as good or better than the initial results reported by the XOO7 team. • It handles the extensibility of XML well—its primary strength. • To realize full performance, run with enough accessible RAM to hold all indices and still have plenty of room for buffers. • The NeoCore database remains a work in progress: significant improvements are being planned for future releases. • Acceptance likely to depend more on the broader management system than on the database at its heart.