High-dimensional indexing techniques

High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani

The big picture Large Distributed Data mining dataset Request Interpreter file MPI-IO storage grid

The big picture Logical request Request interpreter LBNL Qualified objects Request planning/execution PPDG Sub-task schedule Execution services MPI-IO, … grid

Problem statement • Main objective: maps logical request to qualified objects • a logical request: • 20001015<=eventTime & 200<energy<300 … • objects: • set of object IDs; • set of files containing the objects; • offsets within the files, …

Requirements & Status • General requirements • User request data in terms of their scientific domain, not file names or offsets in files • Each object may be described in hundreds of attributes • Each request is in terms of range predicates on a handful of attributes (partial range query) • Status • Initially motivated by a HENP experiment: STAR • Software originally developed under GC and is currently in use at BNL

Large high-dimensional datasets • Number of attributes / columns: 200 – 500 • Number of objects / events: 108 – 109 • File containing one attribute: 400MB – 4GB • Total size over all attributes: 80GB – 2TB Object ID A1 A2 A3 A4 … 0 1 2 . . . Curse of dimensionality • Goal: develop an index, so that: • Read as little as possible from disk • Minimize computation in memory 108 . . . 109

Well known indexing methods • B-tree based indices • One or a small number of attributes • Index size may be up to 3 times the data size • R-tree based indices • Small number of attributes, say, < 10 • UB-tree • Use space filling curves to map high-dimensional data to one-dimension • One range query is mapped into many many queries on the B-tree based index • Even sequential scan • Better than B-tree and R-tree if dimension > 10 • Simply read all data and compare take too long

Another class of indexes: Bitmap index Data values • Example queries on the attribute, say, A • One-sided range query: A < 2 • b0 OR b1 • Two-sided range query: 2<A<5 • b3 OR b4 • Basic steps of building a bitmap index • Binning • Encoding • Compressing b0 b1 b2 b3 b4 b5 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 5 3 1 2 0 4 1 =0 =1 =2 =3 =4 =5

Edge bin Edge bin Range(x) . . . . . . . . Range(y) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How many bins? More bins Less objects in edge bins

Interval encoding Range encoding Equality encoding 0 1 2 3 4 5 6 bins How to encode

Advantages of bitmap indices • Fast operations • The most common operations are the bitwise logical operations • They are well supported by hardware • Easy to compress, potentially small index size • Each individual bitmap is small and frequently used ones can be cached in memory • Efficient for read-mostly data: data produced from scientific experiments can be appended in large groups • Available in most major commercial DBMS

Why our own bitmap index • Early tests shown that we can do an order of magnitude better than ORACLE (using equality encoding) • Vertical partition: allows one to only read data of the attributes involved in a query • New compression method • Best known: Byte-aligned Bitmap Code (BBC) • Developed 2 Word-Aligned Schemes: WAH, WBC • Different encoding schemes under compression • Equality encoding – used in ORACLE and others • Range encoding – one-sided range queries • Interval encoding – two-sided range queries

Information about the test machines • Hardware and system • Sun enterprise 450 (Ultrasparc II 400MHz) • 4GB RAM • VARITAS volume manager (stripped disk) • Real application data from STAR • Above 2 million objects • Picked 12 attributes with varying distributions • Measures: • Logical operation time without IO • Logical operation time with IO • Query processing time

Logical operation time (no IO)

Logical operation time (including IO)

New compression schemes • Overall, use about 50% more space than BBC • On average, 12 times faster than BBC • Faster than the uncompressed in more cases: • New schemes are faster than the uncompressed scheme when the compression ratios are less than 0.3 • BBC is faster than the uncompressed when the compression ratios are less than 0.03

Sizes of bitmap indices • Conclusion: • equality encoding is most space efficient • Compression gain is at least a factor of 2.5

Average query processing time • Conclusion: • interval and range encoding are the best • For these cases, there is practically no penalty to compression

Interval encoding is better overall Sequential scan time: 0.557 sec

Summary • Better compression scheme • 50% more space, but 10-12 time faster !!! • Among the different encoding schemes • the interval encoding is the better than the equality encoding and the range encoding • Selecting the number of bins => Bitmap index size and operation efficiency. For example: • 10% of data size => 3 x speed of sequential scan • 20% of data size => 6 x speed of sequential scan • Equality encoding currently used in the STAR experiment. Next version will include the interval encoding.

Future work • Support NULL value and categorical values • On-line update: add new data and update index without interrupting request processing • Recovery mechanism for robustness • Potential new applications: climate, astrophysics, biology • Study different non-uniform binning strategies • Integrate with conventional database system: to better handle metadata, to provide more versatile front-end

High-dimensional indexing techniques