1 / 50

Design Principles of Semantic Binary Database Management Systems

Design Principles of Semantic Binary Database Management Systems. Dmitry Vasilevsky Dissertation Defense March 31, 2004. Publications.

taro
Download Presentation

Design Principles of Semantic Binary Database Management Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design Principles of Semantic Binary Database Management Systems Dmitry Vasilevsky Dissertation Defense March 31, 2004

  2. Publications • Naphtali Rishe, Artyom Shaposhnikov, Alexander Vaschillo, Dmitry Vasilevsky, Shu-Ching Chen. "High Performance Lempel-Ziv compression using optimized longest string parsing and adaptive Huffman window size". IEEE Data Compression Conference 2000, March 28-30, 2000, Snowbird, Utah. • Naphtali Rishe, Alexander Vaschillo, Dmitry Vasilevsky, Artyom Shaposhnikov, Shu-Ching Chen. "The Architecture for Semantic Data Access to Heterogeneous Information Sources." ISCA 15th International Conference on Computers and Their Applications, New Orleans, Louisiana - March 29-31, 2000. • Naphtali Rishe, Alexander Vaschillo, Dmitry Vasilevsky, Artyom Shaposhnikov, Shu-Ching Chen. "A Benchmarking Technique for DBMS`s with Advanced Data Models," ACM SIGMOD ADBIS-DASFAA Symposium on Advances in Databases and Information Systems, September 2000. • Naphtali Rishe, Jun Yuan, Rukshan Athauda, Xiaoling Lu, Xiaobin Ma, Alexander Vaschillo, Artyom Shaposhnikov, Dmitry Vasilevsky, Shu-Ching Chen. "Semantic Access to Databases with Easier Query Facility," VLDB-2000, September 10-14, 2000, Cairo, Egypt.

  3. Main Topics • Semantic Binary Model Overview • Database Engine Design Overview • Fact, Record, Bitmap data structures • Optimization of Bitmaps • Object ID distribution and encoding • Semantic data model extensions • XML representation and interoperability

  4. Semantic Binary Model (SBM) • Object is a central notion of Semantic model. It can be any real-world entity • Objects are grouped into categories according to common properties • Binary relations between objects Instructor x teaches student y teaches Instructor Student

  5. Advantages of SBM • Powerful and intuitive model • Supports complex structured data hierarchies • Sparse data • Supports m:m binary relations and attributes natively • Allows complex ad-hoc queries

  6. SBM implementation Represented as a set of facts about objects R is ID of Relation teaches x is ID of Instructor object y is ID of Student object (x, R, y) – denoted xRy Implemented as a set of facts lexicographically ordered in a B-Tree for efficient querying x1R1y1 x1R2y2 x2R1y1 … Order

  7. Problematic applications Application that collects data from sensors A, B, C, and D A-1 A1 ID1 A-1 A2 ID2 … B-1 B1 ID1 B-1 B2 ID2 … C-1 C1 ID1 C-1 C2 ID2 … D-1 D1 ID1 D-1 D2 ID2 … ID1 A1 B1 C1 D1 ID2 A2 B2 C2 D2 ID3 A3 B3 C3 D3 … ID1 A A1 ID1 B B1 ID1 C C1 ID1 D D1 ID2 A A2 ID2 B B2 ID2 C C2 ID2 D D2 ID3 A A3 ID3 B B3 … Relational Semantic

  8. Problem areas in Semantic DB • Performance on “relational” type applications (Example: TPC-D) • Excessive database size on “relational” type applications • Inability to tune engine for different applications Many improvements researchedNeed design that combines them

  9. Main Topics • Semantic Binary Model Overview • Database Engine Design Overview • Fact, Record, Bitmap data structures • Optimization of Bitmaps • Object ID distribution and encoding • Semantic data model extensions • XML representation and interoperability

  10. Proposed DBMS design Kernel API at fact composition level Data generalized as binary, conversion is handled above Kernel API permitting pluggable data types

  11. Proposed kernel design • Physical representation is handled below Kernel API permitting different types of storage structures

  12. Main Topics • Semantic Binary Model Overview • Database Engine Design Overview • Fact, Record, Bitmap data structures • Optimization of Bitmaps • Object ID distribution and encoding • Semantic data model extensions • XML representation and interoperability

  13. Storage structure: Fact • Original storage structure (xRy, xRv) • Very general and good for sparse data • Space overhead: to store every v, we need to store x, R, and v • Performance overhead: need to parse several facts to retrieve several attributes

  14. Storage structure: Record • Fact structure aR’v1v2…vN - Fixed-size attributes in positional record • Object ID and relation ID is stored only once per multiple attributes • Inapplicable for m:m relations, variable size data • Other variations possible

  15. Fact and Record Fact Record ID1 A A1 ID1 B B1 ID1 C C1 ID1 D D1 ID2 A A2 ID2 B B2 ID2 C C2 ID2 D D2 ID3 A A3 ID3 B B3 … ID1 R’ A1 B1 C1 D1 ID2 R’ A2 B2 C2 D2 ID3 R’ A3 B3 C2 D2 … Coexistence ID1 R’ A1 B1 C1 D1 ID1 R X1,1 ID1 R X1,2 ID2 R’ A2 B2 C2 D2 …

  16. Index structures • User data - R-1vx • Categorization information - Cx Set of objects with value of total Boolean attribute = True Set of objects belonging to a category C = (0, 1, 3, 4, 5, …) Fact Membership vector C 0 C 1 C 3 C 4 C 5 … True True False True True True … 0 1 2 3 4 5 … Bitmap 11011 …

  17. Storage structure: Bitmap • Represents set of objects • One bit per object: 1 – object belongs to set, 0 – otherwise • Bitmap size is proportional to domain size. Very compact for dense data • Perfect for categorization information • Suitable for Flag, Boolean, and some other attribute types • Very fast set operations

  18. Set operations on bitmaps • Intersection, Union, and Subtraction C – set of objects belonging to category C B – set of objects with value of total Boolean attribute = True C  B – set of objects belonging to C and value of B = True C: 11011011 11101100 … AND C  B: 11010011 01100100 … B: 11110111 01110111 … Intel SSE Extensions: 128 bit at a time on 32 bit CPU

  19. Bitmap efficiency

  20. Main Topics • Semantic Binary Model Overview • Database Engine Design Overview • Fact, Record, Bitmap data structures • Optimization of Bitmaps • Object ID distribution and encoding • Semantic data model extensions • XML representation and interoperability

  21. Bitmap optimizations • Bitmap for categorization information • Size is proportional to the number of objects in the database • If database has many categories, lots of 0 in the bitmaps • How to avoid storing 0-s for objects that can not belong to considered category?

  22. Bitmap control structure Bitmap 00000000 00000000 … 00000000 11011011 11011111 … 11011111 11111111 11111111 … 11111111 … All 0-s Control structure Bitmap data Mixed All 0-s Pointer All 1-s … 11011011 11011111 … 11011111 … All 1-s 4-8-byte array entries 16-128K blocks

  23. Bitmap compression • Introduced control structure to a bitmap. Replaced blocks of 0-s and blocks of 1-s with special array entries • Very simple and efficient compression • How to ensure that there are be blocks of 0-s and 1-s? • We have control over object ID distribution!

  24. Main Topics • Semantic Binary Model Overview • Database Engine Design Overview • Fact, Record, Bitmap data structures • Optimization of Bitmaps • Object ID distribution and encoding • Semantic data model extensions • XML representation and interoperability

  25. Object ID generation • Originally random generation or sequential generation • Simple, but doesn’t provide any benefits • Proposed algorithm generate object IDs in blocks. Blocks are given to categories • Every created object receives ID from its category

  26. Clustered ID distribution Category C1 Abstract Object All 1-s All 0-s All 0-s … Category C2 All 1-s All 1-s All 1-s … All 0-s All 1-s All 0-s … Category C3 Central ID generator All 0-s All 0-s All 1-s …

  27. Clustered distribution benefits • Wanted to avoid storing 0-s, but got more, almost total compression! • Still need bitmap data for last bitmap block, subcategories, deleted objects • New objects will re-use IDs • Object migration may spoil picture • Object migration is specific to limited number of applications

  28. Object ID Encoding • Fixed size object IDs don’t require encoding but wastes too much space • Variable length encoding must satisfy certain properties to be valid 1. a,b[0..N], a<b (a)< (b) 2. a,b[0..N], (a)< (b), ,-string(a)< (b) 3. a,b[0..N], if ,-string so that (a)= (b)  a=b Original Encoded 0 0 0 5 0 0 1 2 0 0 5 1 … 1 2 5 5 1 … Violates 1

  29. Encoding efficiency • Interval-based approach to encoding is not suitable for Object IDs. All object IDs are equally likely; and algorithm is slow • Variable encoding saves space for small object IDs (like schema objects) • Encoding/decoding should be fast

  30. Proposed object ID encoding Mapped length encoding 1 byte 00 010011 64 IDs 2 bytes 01 010011 00110010 16320 IDs 4 bytes 10 010011 00110010 00110010 00110010 ~1 billion IDs 8 bytes 11 010011 00110010 00110010 00110010 … 00110010 ~4.6 • 1018 IDs Object space is 262 ~ 4611 Peta objects Very simple encoding/decoding, about 10 Intel assembly instructions Copying 8-byte ID requires 4 instructions

  31. Mapped length encoding • Use first bits to encode length of ID • Use remaining bits and bytes to place integer directly • Use 1-, 2-, 4-, and 8-byte encoded IDs • Use short 1- and 2-byte IDs for objects in the schema, because they are present in every fact

  32. Main Topics • Semantic Binary Model Overview • Database Engine Design Overview • Fact, Record, Bitmap data structures • Optimization of Bitmaps • Object ID distribution and encoding • Semantic data model extensions • XML representation and interoperability

  33. Why extend SBM? • Original semantic binary model treats attributes of objects equally • Good under assumption that query based on every attribute is equally likely • Semantic database is optimized for ad-hoc queries • However, many applications (OLTP) have pre-defined set of queries

  34. Proposed extensions • Removing of reverse facts • Ordered categories • Ordered relations • FIFO and LIFO ordering

  35. Removing reverse facts • Reverse facts (R-1vx) allow fast execution of queries ?Rv and ?R[v1, v2] • Reverse facts introduce redundancy and double storage space • When application query set doesn’t include queries on R, these facts can be removed

  36. Reverse facts example • Application collects sensor data Semantic schema Original implementation Measurement TakenAt: DateTime, total V1: 1..1024, total V2: 1..1024, total V3: 1..1024, total V4: 1..1024, total id1 TakenAt 04/1/1 id1 V1 5 id1 V2 128 id1 V3 100 id1 V4 120 id2 TakenAt 04/1/2 id2 V1 6 id2 V2 120 id2 V3 101 id2 V4 1 TakenAt 04/1/1 id1 TakenAt 04/1/2 id2 V1 5 id1 V1 6 id2 V2 120 id2 V2 128 id1 V3 100 id1 V3 101 id2 V4 1 id2 V4 120 id1 id1 R 04/1/1 5 128 100 120 id2 R 04/1/2 6 120 101 1 TakenAt 04/1/1 id1 TakenAt 04/1/2 id2 Record representation Reverse facts removed

  37. Ordered Categories • Order based on object IDs is not useful for the end user • Order on combination of attributes (Last Name, First Name) • Search on combination of attributes (First, Last, Middle) • Fact structure Cv1v2…vNx

  38. Ordered category example • Application that only search people by Last and First name Classical Ordered C id1 C id2 C id3 C id4 C Anderson Thomas id2 C Connor John id3 C Connor Sarah id4 C Smith John id1 Get objects of category C Get objects with first name Sarah Get objects with last name Connor Intersect three sets Get facts with prefix “C Connor Sarah”

  39. Ordered relations • Both on range and on domain • Fact structure is aRv1v2…vNb • Give me all customers of this store ordered by (Last Name, First Name) C2 Anderson Thomas R Facts: id1 R Anderson Thomas id2 id1 R Connor John id3 id1 R Connor Sarah id4 C1 Connor John Connor Sarah

  40. Main Topics • Semantic Binary Model Overview • Database Engine Design Overview • Fact, Record, Bitmap data structures • Optimization of Bitmaps • Object ID distribution and encoding • Semantic data model extensions • XML representation and interoperability

  41. XML Semantic Definition Language • Language that defines database schema in a readable format • Language to export/import schema and data from/to the database • Semantic database need to be interoperable with other software on the market • Proposed XSDL language is an XML representation of semantic database

  42. XSDL – Representing schema • Represents schema in an interoperable standard format • Preserves all information from schema • Future-proof by using standard XML features – containment and references • Microsoft applied for US patent on similar representation in 2003

  43. Schema representation example Person First Name: string, total Last Name: string, total Birth Year: 1900..2100 <Database name=“University”> <Schema> <Category Name=“Person” Type=“Abstract”> <Attribute Name=“First Name” Range=“String” IsTotal=“True” /> <Attribute Name=“Last Name” Range=“String” IsTotal=“True” /> <Attribute Name=“Birth Year” Range=“BirthYears” /> </Category> <Category Name=“BirthYears” Type=“Concrete” > <Integer LowerBound=“1900” UpperBound=“2100” /> </Category> </Schema> </Database>

  44. Containment allows extensions Person First Name: string, total Last Name: string, total Birth Year: 1900..2100 Personal Data First Name: string, total Last Name: string, total Birth Year: 1900..2100 Old tool renames category .cA STUDENT .aT STUDENT first-name string, total .aT STUDENT last-name string, total .aT STUDENT birth-year 1900..2100 .tC DISPLAY STUDENT 5 120 Solid <Category Name=“Person” Type=“Abstract”> <Attribute Name=“First Name” Range=“String” IsTotal=“True” /> <Attribute Name=“Last Name” Range=“String” IsTotal=“True” /> <Attribute Name=“Birth Year” Range=“BirthYears” /> <DisplayTool X=“5” Y=“120” Border=“Solid” /> </Category>

  45. XSDL – representing data • XSDL represents data in interoperable standard format – XML • XSDL fully preserves logical structure of the database • Supports several options for convenience and application compatibility • Custom schemas may be supported by applying XSLT in XML domain • Future work may analyze native support for custom schemas

  46. Data representation example Disk Title: string, total Author: string Track Number: 1..99, total Title: string Author: string Of (m:1) <Track> <Object ID=“20”> <Number>1</Title> <Title>Song about a Crow</Title> <Of>1</Of> </Object> <Object ID=“21”> <Number>2</Title> <Title>Habla español</Title> <Of>1</Of> </Object> </Track> <Disk> <Object ID=“1”> <Title>Live in Seattle</Title> <Author>Anna Stovall</Author> </Object> </Disk>

  47. Demo – interoperability Consuming unloaded data in applications

  48. Main contributions • Novel engine design that allows tuning performance • New data structures and algorithms • Way of combining existing data structures and new ones • XML representation of Semantic Database facilitates interoperability • 4 publications in scientific conferences

  49. Q & A Session

  50. Florida International University School of Computer Science High Performance Database Research Center

More Related