1 / 13

IO-Efficient Faceted Search

IO-Efficient Faceted Search. Talk at Dagstuhl Seminar „Data Structures” February 20 th , 2008. Holger Bast Max-Planck-Institute for Informatics Saarbrücken joint work with: Omid Amini, Hubert Chan, Andreas Karrenbauer. Faceted Search. Data Set of n objects

skule
Download Presentation

IO-Efficient Faceted Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IO-Efficient Faceted Search Talk at Dagstuhl Seminar „Data Structures” February 20th, 2008 Holger Bast Max-Planck-Institute for Informatics Saarbrücken joint work with: Omid Amini, Hubert Chan, Andreas Karrenbauer

  2. Faceted Search • Data • Set of n objects • for example, scientific papers • Each object has a number of labels; labels are organized into categories (the facets) • for example, year:1990, author:Kurt Mehlhorn, author:Robert Tarjan, venue:JACM • Query • Given: set I с {1,…,n} of object ids (matching docs) • Compute: multi-set of labels of these objects (all their labels) • Objective: space-efficient and IO-efficient

  3. IO-Efficiency • RAM Model • count the number of operations operation = arithmetic or access to single memory cell • important ingredient of time complexity analysis, but … • by itself completely inadequate for running time prediction on modern computers, no matterwhether the data is in cache or in main memory (modern = since about 20 years) • 100 disk seeks take about half a second • in that time can read 200 MB of contiguous data(if stored compressed) • main memory: 100 non-local accesses 10 KB data block

  4. IO-Efficiency • RAM Model • count the number of operations operation = arithmetic or access to single memory cell • important ingredient of time complexity analysis, but … • by itself completely inadequate for running time prediction on modern computers, no matterwhether the data is in cache or in main memory (modern = since about 20 years) • IO / External Memory Model • count the number of block accesses to the data one block access = read / write B consecutive bytes • ignore everything else • good predictor if computation is negligible

  5. Abstract Problem Formulation • Precomputation: • given n elements a1,…,an • organize in array of size N ≥ n • Query: • given I = {i1,…, im} с {1,…,n} • return elements ai1,…, aim using as few IOs as possible • Extreme solutions: • space: n #IOs: min{n / B, |I|} (optimal space) • space: B ∙ (n choose B)#IOs:|I| / B (optimal #IOs) n = 8, N = 24 I = {1, 6, 8}, B = 4 get a1, a6, a8with 1 IO ??? ??? How much space is needed for which IO-efficiency? Called an indexability problem in: Hellerstein et al, PODS’97 / JACM’02

  6. A first simple result • Theorem: • if we want <|I| IOs for every query I • we need ≥ n2 / (4∙B) space • Proof: • construct graph G with n vertices edge {i, j} iff aiand aj can be read in one IO  m ≤ 2B ∙ N more edges  more space • every I = {i, j} can be read with < |I| = 2, that is, one IO, hence edge {i, j} exists  m ≥ (n choose 2) ≈ n2 / 2 better IO-efficiency  more edges n = 4, N = 8 a2 a1 a3 a4 B = 2 The short queries alone make the problem hard

  7. Restrict to large queries • Theorem: • if we want < |I| IOs for all queries with |I| ≥ M • we need ≥ n2 / (4∙B∙M) space • Proof sketch: • construct graph G as before  m ≤ 2B ∙ N more edges  more space 2. Consider arbitrary I with |I| ≥ M  I not independent in G (otherwise |I| IOs necessary)  the minimal independent set is of size MIS < M IO-inefficient query  independent set • Turan’s theorem implies m ≥ (n choose 2) / MIS no large independent sets  more edges a2 a1 a3 a4 n = 4, N = 8 B = 2

  8. Turán numbers (extremal set theory) • Definition: for n ≥ k ≥ r T(n, k, r) = the minimal number of r-subsets of {1,…n} such that every k-subset of {1,…,n} contains one of the r-subsets For r = 2: minimal number of edges in an n-vertex graph, where all independent sets have size < k • Turan’s theorem: • limn∞ T(n, k, r) / (n choose r) exists • exact value of limit unknown for k ≥ 2 • Lower bound • T(n, k, r) ≥ (r / k)r-1 ∙ (n ch. r) Paul (Pál) Turán *1910 in Budapest †1976 in Budapest Erdös number 1

  9. Near-Optimal IO • Theorem: • if we want ≤ c ∙ |I| / B IOs for all queries with |I| ≥ M • we need ≥ nr / (4∙B∙M)r-1 space, where r = B/c • Proof sketch: • construct hyper-graph G with n vertices edge {i1,…, ir} if the corresponding r elements can be read in one IO as before: more edges  more space • as before: large queries IO-efficient no large independent sets • as before: no large independent sets  many edges need version of Turán’s theorem for hyper-graphs

  10. Near-Optimal IO • Theorem: • if we want ≤ c ∙ |I| / B IOs for all queries with |I| ≥ M • we need ≥ nr / (4∙B∙M)r-1 space, where r = B/c • Proof sketch: • construct hyper-graph G with n vertices edge {i1,…, ir} if the corresponding r elements can be read in one IO as before: more edges  more space • as before: large queries IO-efficient no large independent sets • as before: no large independent sets  many edges there is hope for M linear in n

  11. Fixed set of linear-size queries • Fixed set of queries I = {I1,…, Iℓ}, |I| = ΣIєIindex size assume each Ii is random M-subset of {1,…,n} • Goal ≤ c ∙ |I| / B IOs and ≤ є ∙ |I| space • Algorithm, special case • Pick random pair i, j • If Ii and Ij have B elements in common, remove them from each of the sets, and add to precomputed array • Repeat until total volume left is ≤ є’ ∙ |I| • Cover remainder of each query separately

  12. Fixed set of linear-size queries • Fixed set of queries I = {I1,…, Iℓ}, |I| = ΣIєI |I| index size assume each Ii is random M-subset of {1,…,n} • Goal ≤ c ∙ |I| / B IOs and ≤ є ∙ |I| space • Algorithm, general case • Pick random k-tuple i1,…, ik • If each Ii has B / c elements from a common block of size B, remove them from each of the sets, and add to precomputed array  non-trivial to check this ! • Repeat until total volume left is ≤ є’ ∙ |I| • Cover remainder of each query separately

  13. Main Theorem • For I from I = {I1,…, Iℓ}, each |Ii| = M and random #IOs:≤ c ∙ |I|/B space: ≈log(n/M) / log(c∙n/B) ∙ |I| • For example: c = 1, M = B  space: 100% ∙ |I| c = 1, M = n  space: 0% ∙ |I| c = 1, M = n/10, n/B = 1000  space: 33% ∙ |I| c = 2, M = n/20, n/B = 1000  space: 40% ∙ |I| • Proof: • well …

More Related