500 likes | 652 Views
Design Principles of Semantic Binary Database Management Systems. Dmitry Vasilevsky Dissertation Defense March 31, 2004. Publications.
E N D
Design Principles of Semantic Binary Database Management Systems Dmitry Vasilevsky Dissertation Defense March 31, 2004
Publications • Naphtali Rishe, Artyom Shaposhnikov, Alexander Vaschillo, Dmitry Vasilevsky, Shu-Ching Chen. "High Performance Lempel-Ziv compression using optimized longest string parsing and adaptive Huffman window size". IEEE Data Compression Conference 2000, March 28-30, 2000, Snowbird, Utah. • Naphtali Rishe, Alexander Vaschillo, Dmitry Vasilevsky, Artyom Shaposhnikov, Shu-Ching Chen. "The Architecture for Semantic Data Access to Heterogeneous Information Sources." ISCA 15th International Conference on Computers and Their Applications, New Orleans, Louisiana - March 29-31, 2000. • Naphtali Rishe, Alexander Vaschillo, Dmitry Vasilevsky, Artyom Shaposhnikov, Shu-Ching Chen. "A Benchmarking Technique for DBMS`s with Advanced Data Models," ACM SIGMOD ADBIS-DASFAA Symposium on Advances in Databases and Information Systems, September 2000. • Naphtali Rishe, Jun Yuan, Rukshan Athauda, Xiaoling Lu, Xiaobin Ma, Alexander Vaschillo, Artyom Shaposhnikov, Dmitry Vasilevsky, Shu-Ching Chen. "Semantic Access to Databases with Easier Query Facility," VLDB-2000, September 10-14, 2000, Cairo, Egypt.
Main Topics • Semantic Binary Model Overview • Database Engine Design Overview • Fact, Record, Bitmap data structures • Optimization of Bitmaps • Object ID distribution and encoding • Semantic data model extensions • XML representation and interoperability
Semantic Binary Model (SBM) • Object is a central notion of Semantic model. It can be any real-world entity • Objects are grouped into categories according to common properties • Binary relations between objects Instructor x teaches student y teaches Instructor Student
Advantages of SBM • Powerful and intuitive model • Supports complex structured data hierarchies • Sparse data • Supports m:m binary relations and attributes natively • Allows complex ad-hoc queries
SBM implementation Represented as a set of facts about objects R is ID of Relation teaches x is ID of Instructor object y is ID of Student object (x, R, y) – denoted xRy Implemented as a set of facts lexicographically ordered in a B-Tree for efficient querying x1R1y1 x1R2y2 x2R1y1 … Order
Problematic applications Application that collects data from sensors A, B, C, and D A-1 A1 ID1 A-1 A2 ID2 … B-1 B1 ID1 B-1 B2 ID2 … C-1 C1 ID1 C-1 C2 ID2 … D-1 D1 ID1 D-1 D2 ID2 … ID1 A1 B1 C1 D1 ID2 A2 B2 C2 D2 ID3 A3 B3 C3 D3 … ID1 A A1 ID1 B B1 ID1 C C1 ID1 D D1 ID2 A A2 ID2 B B2 ID2 C C2 ID2 D D2 ID3 A A3 ID3 B B3 … Relational Semantic
Problem areas in Semantic DB • Performance on “relational” type applications (Example: TPC-D) • Excessive database size on “relational” type applications • Inability to tune engine for different applications Many improvements researchedNeed design that combines them
Main Topics • Semantic Binary Model Overview • Database Engine Design Overview • Fact, Record, Bitmap data structures • Optimization of Bitmaps • Object ID distribution and encoding • Semantic data model extensions • XML representation and interoperability
Proposed DBMS design Kernel API at fact composition level Data generalized as binary, conversion is handled above Kernel API permitting pluggable data types
Proposed kernel design • Physical representation is handled below Kernel API permitting different types of storage structures
Main Topics • Semantic Binary Model Overview • Database Engine Design Overview • Fact, Record, Bitmap data structures • Optimization of Bitmaps • Object ID distribution and encoding • Semantic data model extensions • XML representation and interoperability
Storage structure: Fact • Original storage structure (xRy, xRv) • Very general and good for sparse data • Space overhead: to store every v, we need to store x, R, and v • Performance overhead: need to parse several facts to retrieve several attributes
Storage structure: Record • Fact structure aR’v1v2…vN - Fixed-size attributes in positional record • Object ID and relation ID is stored only once per multiple attributes • Inapplicable for m:m relations, variable size data • Other variations possible
Fact and Record Fact Record ID1 A A1 ID1 B B1 ID1 C C1 ID1 D D1 ID2 A A2 ID2 B B2 ID2 C C2 ID2 D D2 ID3 A A3 ID3 B B3 … ID1 R’ A1 B1 C1 D1 ID2 R’ A2 B2 C2 D2 ID3 R’ A3 B3 C2 D2 … Coexistence ID1 R’ A1 B1 C1 D1 ID1 R X1,1 ID1 R X1,2 ID2 R’ A2 B2 C2 D2 …
Index structures • User data - R-1vx • Categorization information - Cx Set of objects with value of total Boolean attribute = True Set of objects belonging to a category C = (0, 1, 3, 4, 5, …) Fact Membership vector C 0 C 1 C 3 C 4 C 5 … True True False True True True … 0 1 2 3 4 5 … Bitmap 11011 …
Storage structure: Bitmap • Represents set of objects • One bit per object: 1 – object belongs to set, 0 – otherwise • Bitmap size is proportional to domain size. Very compact for dense data • Perfect for categorization information • Suitable for Flag, Boolean, and some other attribute types • Very fast set operations
Set operations on bitmaps • Intersection, Union, and Subtraction C – set of objects belonging to category C B – set of objects with value of total Boolean attribute = True C B – set of objects belonging to C and value of B = True C: 11011011 11101100 … AND C B: 11010011 01100100 … B: 11110111 01110111 … Intel SSE Extensions: 128 bit at a time on 32 bit CPU
Main Topics • Semantic Binary Model Overview • Database Engine Design Overview • Fact, Record, Bitmap data structures • Optimization of Bitmaps • Object ID distribution and encoding • Semantic data model extensions • XML representation and interoperability
Bitmap optimizations • Bitmap for categorization information • Size is proportional to the number of objects in the database • If database has many categories, lots of 0 in the bitmaps • How to avoid storing 0-s for objects that can not belong to considered category?
Bitmap control structure Bitmap 00000000 00000000 … 00000000 11011011 11011111 … 11011111 11111111 11111111 … 11111111 … All 0-s Control structure Bitmap data Mixed All 0-s Pointer All 1-s … 11011011 11011111 … 11011111 … All 1-s 4-8-byte array entries 16-128K blocks
Bitmap compression • Introduced control structure to a bitmap. Replaced blocks of 0-s and blocks of 1-s with special array entries • Very simple and efficient compression • How to ensure that there are be blocks of 0-s and 1-s? • We have control over object ID distribution!
Main Topics • Semantic Binary Model Overview • Database Engine Design Overview • Fact, Record, Bitmap data structures • Optimization of Bitmaps • Object ID distribution and encoding • Semantic data model extensions • XML representation and interoperability
Object ID generation • Originally random generation or sequential generation • Simple, but doesn’t provide any benefits • Proposed algorithm generate object IDs in blocks. Blocks are given to categories • Every created object receives ID from its category
Clustered ID distribution Category C1 Abstract Object All 1-s All 0-s All 0-s … Category C2 All 1-s All 1-s All 1-s … All 0-s All 1-s All 0-s … Category C3 Central ID generator All 0-s All 0-s All 1-s …
Clustered distribution benefits • Wanted to avoid storing 0-s, but got more, almost total compression! • Still need bitmap data for last bitmap block, subcategories, deleted objects • New objects will re-use IDs • Object migration may spoil picture • Object migration is specific to limited number of applications
Object ID Encoding • Fixed size object IDs don’t require encoding but wastes too much space • Variable length encoding must satisfy certain properties to be valid 1. a,b[0..N], a<b (a)< (b) 2. a,b[0..N], (a)< (b), ,-string(a)< (b) 3. a,b[0..N], if ,-string so that (a)= (b) a=b Original Encoded 0 0 0 5 0 0 1 2 0 0 5 1 … 1 2 5 5 1 … Violates 1
Encoding efficiency • Interval-based approach to encoding is not suitable for Object IDs. All object IDs are equally likely; and algorithm is slow • Variable encoding saves space for small object IDs (like schema objects) • Encoding/decoding should be fast
Proposed object ID encoding Mapped length encoding 1 byte 00 010011 64 IDs 2 bytes 01 010011 00110010 16320 IDs 4 bytes 10 010011 00110010 00110010 00110010 ~1 billion IDs 8 bytes 11 010011 00110010 00110010 00110010 … 00110010 ~4.6 • 1018 IDs Object space is 262 ~ 4611 Peta objects Very simple encoding/decoding, about 10 Intel assembly instructions Copying 8-byte ID requires 4 instructions
Mapped length encoding • Use first bits to encode length of ID • Use remaining bits and bytes to place integer directly • Use 1-, 2-, 4-, and 8-byte encoded IDs • Use short 1- and 2-byte IDs for objects in the schema, because they are present in every fact
Main Topics • Semantic Binary Model Overview • Database Engine Design Overview • Fact, Record, Bitmap data structures • Optimization of Bitmaps • Object ID distribution and encoding • Semantic data model extensions • XML representation and interoperability
Why extend SBM? • Original semantic binary model treats attributes of objects equally • Good under assumption that query based on every attribute is equally likely • Semantic database is optimized for ad-hoc queries • However, many applications (OLTP) have pre-defined set of queries
Proposed extensions • Removing of reverse facts • Ordered categories • Ordered relations • FIFO and LIFO ordering
Removing reverse facts • Reverse facts (R-1vx) allow fast execution of queries ?Rv and ?R[v1, v2] • Reverse facts introduce redundancy and double storage space • When application query set doesn’t include queries on R, these facts can be removed
Reverse facts example • Application collects sensor data Semantic schema Original implementation Measurement TakenAt: DateTime, total V1: 1..1024, total V2: 1..1024, total V3: 1..1024, total V4: 1..1024, total id1 TakenAt 04/1/1 id1 V1 5 id1 V2 128 id1 V3 100 id1 V4 120 id2 TakenAt 04/1/2 id2 V1 6 id2 V2 120 id2 V3 101 id2 V4 1 TakenAt 04/1/1 id1 TakenAt 04/1/2 id2 V1 5 id1 V1 6 id2 V2 120 id2 V2 128 id1 V3 100 id1 V3 101 id2 V4 1 id2 V4 120 id1 id1 R 04/1/1 5 128 100 120 id2 R 04/1/2 6 120 101 1 TakenAt 04/1/1 id1 TakenAt 04/1/2 id2 Record representation Reverse facts removed
Ordered Categories • Order based on object IDs is not useful for the end user • Order on combination of attributes (Last Name, First Name) • Search on combination of attributes (First, Last, Middle) • Fact structure Cv1v2…vNx
Ordered category example • Application that only search people by Last and First name Classical Ordered C id1 C id2 C id3 C id4 C Anderson Thomas id2 C Connor John id3 C Connor Sarah id4 C Smith John id1 Get objects of category C Get objects with first name Sarah Get objects with last name Connor Intersect three sets Get facts with prefix “C Connor Sarah”
Ordered relations • Both on range and on domain • Fact structure is aRv1v2…vNb • Give me all customers of this store ordered by (Last Name, First Name) C2 Anderson Thomas R Facts: id1 R Anderson Thomas id2 id1 R Connor John id3 id1 R Connor Sarah id4 C1 Connor John Connor Sarah
Main Topics • Semantic Binary Model Overview • Database Engine Design Overview • Fact, Record, Bitmap data structures • Optimization of Bitmaps • Object ID distribution and encoding • Semantic data model extensions • XML representation and interoperability
XML Semantic Definition Language • Language that defines database schema in a readable format • Language to export/import schema and data from/to the database • Semantic database need to be interoperable with other software on the market • Proposed XSDL language is an XML representation of semantic database
XSDL – Representing schema • Represents schema in an interoperable standard format • Preserves all information from schema • Future-proof by using standard XML features – containment and references • Microsoft applied for US patent on similar representation in 2003
Schema representation example Person First Name: string, total Last Name: string, total Birth Year: 1900..2100 <Database name=“University”> <Schema> <Category Name=“Person” Type=“Abstract”> <Attribute Name=“First Name” Range=“String” IsTotal=“True” /> <Attribute Name=“Last Name” Range=“String” IsTotal=“True” /> <Attribute Name=“Birth Year” Range=“BirthYears” /> </Category> <Category Name=“BirthYears” Type=“Concrete” > <Integer LowerBound=“1900” UpperBound=“2100” /> </Category> </Schema> </Database>
Containment allows extensions Person First Name: string, total Last Name: string, total Birth Year: 1900..2100 Personal Data First Name: string, total Last Name: string, total Birth Year: 1900..2100 Old tool renames category .cA STUDENT .aT STUDENT first-name string, total .aT STUDENT last-name string, total .aT STUDENT birth-year 1900..2100 .tC DISPLAY STUDENT 5 120 Solid <Category Name=“Person” Type=“Abstract”> <Attribute Name=“First Name” Range=“String” IsTotal=“True” /> <Attribute Name=“Last Name” Range=“String” IsTotal=“True” /> <Attribute Name=“Birth Year” Range=“BirthYears” /> <DisplayTool X=“5” Y=“120” Border=“Solid” /> </Category>
XSDL – representing data • XSDL represents data in interoperable standard format – XML • XSDL fully preserves logical structure of the database • Supports several options for convenience and application compatibility • Custom schemas may be supported by applying XSLT in XML domain • Future work may analyze native support for custom schemas
Data representation example Disk Title: string, total Author: string Track Number: 1..99, total Title: string Author: string Of (m:1) <Track> <Object ID=“20”> <Number>1</Title> <Title>Song about a Crow</Title> <Of>1</Of> </Object> <Object ID=“21”> <Number>2</Title> <Title>Habla español</Title> <Of>1</Of> </Object> </Track> <Disk> <Object ID=“1”> <Title>Live in Seattle</Title> <Author>Anna Stovall</Author> </Object> </Disk>
Demo – interoperability Consuming unloaded data in applications
Main contributions • Novel engine design that allows tuning performance • New data structures and algorithms • Way of combining existing data structures and new ones • XML representation of Semantic Database facilitates interoperability • 4 publications in scientific conferences
Florida International University School of Computer Science High Performance Database Research Center