130 likes | 234 Views
IO-Efficient Faceted Search. Talk at Dagstuhl Seminar „Data Structures” February 20 th , 2008. Holger Bast Max-Planck-Institute for Informatics Saarbrücken joint work with: Omid Amini, Hubert Chan, Andreas Karrenbauer. Faceted Search. Data Set of n objects
E N D
IO-Efficient Faceted Search Talk at Dagstuhl Seminar „Data Structures” February 20th, 2008 Holger Bast Max-Planck-Institute for Informatics Saarbrücken joint work with: Omid Amini, Hubert Chan, Andreas Karrenbauer
Faceted Search • Data • Set of n objects • for example, scientific papers • Each object has a number of labels; labels are organized into categories (the facets) • for example, year:1990, author:Kurt Mehlhorn, author:Robert Tarjan, venue:JACM • Query • Given: set I с {1,…,n} of object ids (matching docs) • Compute: multi-set of labels of these objects (all their labels) • Objective: space-efficient and IO-efficient
IO-Efficiency • RAM Model • count the number of operations operation = arithmetic or access to single memory cell • important ingredient of time complexity analysis, but … • by itself completely inadequate for running time prediction on modern computers, no matterwhether the data is in cache or in main memory (modern = since about 20 years) • 100 disk seeks take about half a second • in that time can read 200 MB of contiguous data(if stored compressed) • main memory: 100 non-local accesses 10 KB data block
IO-Efficiency • RAM Model • count the number of operations operation = arithmetic or access to single memory cell • important ingredient of time complexity analysis, but … • by itself completely inadequate for running time prediction on modern computers, no matterwhether the data is in cache or in main memory (modern = since about 20 years) • IO / External Memory Model • count the number of block accesses to the data one block access = read / write B consecutive bytes • ignore everything else • good predictor if computation is negligible
Abstract Problem Formulation • Precomputation: • given n elements a1,…,an • organize in array of size N ≥ n • Query: • given I = {i1,…, im} с {1,…,n} • return elements ai1,…, aim using as few IOs as possible • Extreme solutions: • space: n #IOs: min{n / B, |I|} (optimal space) • space: B ∙ (n choose B)#IOs:|I| / B (optimal #IOs) n = 8, N = 24 I = {1, 6, 8}, B = 4 get a1, a6, a8with 1 IO ??? ??? How much space is needed for which IO-efficiency? Called an indexability problem in: Hellerstein et al, PODS’97 / JACM’02
A first simple result • Theorem: • if we want <|I| IOs for every query I • we need ≥ n2 / (4∙B) space • Proof: • construct graph G with n vertices edge {i, j} iff aiand aj can be read in one IO m ≤ 2B ∙ N more edges more space • every I = {i, j} can be read with < |I| = 2, that is, one IO, hence edge {i, j} exists m ≥ (n choose 2) ≈ n2 / 2 better IO-efficiency more edges n = 4, N = 8 a2 a1 a3 a4 B = 2 The short queries alone make the problem hard
Restrict to large queries • Theorem: • if we want < |I| IOs for all queries with |I| ≥ M • we need ≥ n2 / (4∙B∙M) space • Proof sketch: • construct graph G as before m ≤ 2B ∙ N more edges more space 2. Consider arbitrary I with |I| ≥ M I not independent in G (otherwise |I| IOs necessary) the minimal independent set is of size MIS < M IO-inefficient query independent set • Turan’s theorem implies m ≥ (n choose 2) / MIS no large independent sets more edges a2 a1 a3 a4 n = 4, N = 8 B = 2
Turán numbers (extremal set theory) • Definition: for n ≥ k ≥ r T(n, k, r) = the minimal number of r-subsets of {1,…n} such that every k-subset of {1,…,n} contains one of the r-subsets For r = 2: minimal number of edges in an n-vertex graph, where all independent sets have size < k • Turan’s theorem: • limn∞ T(n, k, r) / (n choose r) exists • exact value of limit unknown for k ≥ 2 • Lower bound • T(n, k, r) ≥ (r / k)r-1 ∙ (n ch. r) Paul (Pál) Turán *1910 in Budapest †1976 in Budapest Erdös number 1
Near-Optimal IO • Theorem: • if we want ≤ c ∙ |I| / B IOs for all queries with |I| ≥ M • we need ≥ nr / (4∙B∙M)r-1 space, where r = B/c • Proof sketch: • construct hyper-graph G with n vertices edge {i1,…, ir} if the corresponding r elements can be read in one IO as before: more edges more space • as before: large queries IO-efficient no large independent sets • as before: no large independent sets many edges need version of Turán’s theorem for hyper-graphs
Near-Optimal IO • Theorem: • if we want ≤ c ∙ |I| / B IOs for all queries with |I| ≥ M • we need ≥ nr / (4∙B∙M)r-1 space, where r = B/c • Proof sketch: • construct hyper-graph G with n vertices edge {i1,…, ir} if the corresponding r elements can be read in one IO as before: more edges more space • as before: large queries IO-efficient no large independent sets • as before: no large independent sets many edges there is hope for M linear in n
Fixed set of linear-size queries • Fixed set of queries I = {I1,…, Iℓ}, |I| = ΣIєIindex size assume each Ii is random M-subset of {1,…,n} • Goal ≤ c ∙ |I| / B IOs and ≤ є ∙ |I| space • Algorithm, special case • Pick random pair i, j • If Ii and Ij have B elements in common, remove them from each of the sets, and add to precomputed array • Repeat until total volume left is ≤ є’ ∙ |I| • Cover remainder of each query separately
Fixed set of linear-size queries • Fixed set of queries I = {I1,…, Iℓ}, |I| = ΣIєI |I| index size assume each Ii is random M-subset of {1,…,n} • Goal ≤ c ∙ |I| / B IOs and ≤ є ∙ |I| space • Algorithm, general case • Pick random k-tuple i1,…, ik • If each Ii has B / c elements from a common block of size B, remove them from each of the sets, and add to precomputed array non-trivial to check this ! • Repeat until total volume left is ≤ є’ ∙ |I| • Cover remainder of each query separately
Main Theorem • For I from I = {I1,…, Iℓ}, each |Ii| = M and random #IOs:≤ c ∙ |I|/B space: ≈log(n/M) / log(c∙n/B) ∙ |I| • For example: c = 1, M = B space: 100% ∙ |I| c = 1, M = n space: 0% ∙ |I| c = 1, M = n/10, n/B = 1000 space: 33% ∙ |I| c = 2, M = n/20, n/B = 1000 space: 40% ∙ |I| • Proof: • well …