Indexing Data Relationships

Indexing Data Relationships Michael J. Franklin University of California, Berkeley & RightOrder Inc.

Overview • Data relationships can be complex. • Hierarchical views: XML, LDAP, … • Semistructure & dynamic schema • Approach:Encode paths as tagged strings • “raw” paths encode structure • “refined” paths accelerate lookups • Index strings in a highly-compact structure. • Live on top of, next to or inside DBMS. • Benefits • Performance, Scalability + Adaptivity • Leverages mature DBMS technology

Raw paths w/Designators Invoice as a tree a Invoice p c b Seller Itemlist Buyer g d Name e e d e g Name Item Address Item Item Address ABC Corp. 123 ABC Way 17 Main St. widget thingy jobber Goods Inc. abdABC Corp. apg17 Main St. acewidget acejobber abg123 ABC Way apdGoods Inc. acethingy

Refined paths tABC Corp. Goods Inc. tXY tXYZ Corp. Acme Inc. fABC Corp. jobber widget fXY Z hXY Z fXYZ Corp. drill hammer hjobber thingy widget hdrill hammer nail • Optimize specific access paths “Find invoices where X sold to Y ” “Find invoices where X bought Y and Z” “Find invoices where a buyer bought X, Y and Z ”

Index Fabric • An index structure for long strings. • Provides fast lookups • Handles long strings • Ideal substrate for designated keys • Based on Patricia tries • Highly compressed string representation • Cost in index independent of string length • But, need to balance.

Patricia tries g c e a w 0 r 2 t grass corn cow b 2 2 5 greenbeans greentea Indexes first point of difference between keys greenbeans greentea D. R. Morrison. “PATRICIA – Practical algorithm to retrieve information coded in alphanumeric.” J. ACM, 15 (1968) pp. 514-534

Multiple Hierarchical Views a a cow cow b b corn corn • Can store multiple permulations of relationships • Find animals and the plants they eat • Find plants and the animals that eat them • Represent as a new set of keys • Store data once using “permutation records”

Example a a cow cow a cat 0 b b corn wheat b corn 5 1 2 4 6 5 a b a w o c b a c c

Example a cow a cat 0 b b wheat corn 5 2 4 5 6 1 a b a w o c b a c c a b

Balancing Patricia tries g c e a w 0 r 2 t grass corn cow b 5 2 2 greenbeans greentea

Balancing Patricia tries g c e a w 0 r 2 t grass corn cow b 2 5 2 greenbeans greentea Step 1: divide trie into blocks

Balancing Patricia tries g c e a w 0 0 r 2 t grass corn cow b 5 2 2 2 greenbeans greentea Step 2: build another layer g e Layer 1 Layer 0

Balancing Patricia tries g c e a w 0 0 r 2 t grass corn cow b 2 2 2 5 greenbeans greentea Search for “cash” greenbeans g e Layer 1 Layer 0

Balancing Patricia tries 0 5 2 Search for “cash” 0 g c g 2 2 e a w r e 2 t grass corn cow b greenbeans greenbeans greentea Layer 1 Layer 0

Balancing Patricia tries 0 5 2 Search for “cash” 0 g c g 2 2 e a w r greenbeans e 2 t grass corn cow b greenbeans greentea Layer 1 Layer 0

Balancing Patricia tries Search Layer 2 Layer 2 Layer 3 Layer 1 Layer 1 Data Layer 0 Layer 0

Performance • Number of layers is small • Fixed (small) space per key • High branching factor per block • Bushy, shallow tree • Example: • 8 KB blocks • 32 bit pointers + 2 bytes for keys/structure • = 1000+ pointers per block • = 3 layers for 1 billion pointers to data (10003) • Upper layers are tiny (10 megabytes), in RAM • Only layer 0 on disk • Usually one index I/O per key lookup Data

Find publications by co-authors 2.5 : 1 5 : 1 25 : 1 Index Fabric Refined Paths Index Fabric Raw Paths RDBMS STORED 10,000 queries RDBMS Edge mapping

Find publications by co-authors 2.1 : 1 4 : 1 20 : 1 Index Fabric Refined Paths Index Fabric Raw Paths RDBMS STORED RDBMS Edge mapping 10,000 queries

Conclusion • Index arbitrary relationships • Encode as designated strings • Relationships and structures can be complex • Index many data access paths • No need for DTD or pre-defined schema • Index Fabric • Special data structure for long keys • High performance key lookups • Supports designator encoding

For more information • technology@rightorder.com • www.rightorder.com

Indexing Data Relationships