400 likes | 412 Views
Ranking in DB. Laks V.S. Lakshmanan Depf. of CS UBC. Why ranking in query answering? 1/3. Mutimedia data – fuzzy querying: e.g., “find top 2 red objects with a soft texture”. Overall score. Combine scores. Why ranking? 2/3.
E N D
Ranking in DB Laks V.S. Lakshmanan Depf. of CS UBC
Why ranking in query answering? 1/3 • Mutimedia data – fuzzy querying: e.g., “find top 2 red objects with a soft texture”. Overall score Combine scores
Why ranking? 2/3 • IR: “find top 5 documents relevant to `computational’, `neuroscience’ and `brain theory’. • IR systems maintain full text indexes; inverted lists of docs w.r.t. each keyword. • Same Q/A paradigm as before.
Why ranking? 3/3 • Data stream, e.g., of network flow data: “find 10 users with the max. BW consumption and max. #packets communicated”. • In a social net, find 5 items tagged as most relevant to “lawn mowing” by user’s friends. • etc. • Fagin et al. – pioneering papers PODS’96, 01, TODS 2003. Burgeoned into a field now. • Focus on middleware algorithm, which given a score combo. function, computes top-K answers by probing diff. subsystems (or ranked lists).
Computational model • Naïve method. • How to compute top-K efficiently? • Access methods: • Sorted access (sequential access) [SA]. • Random access [RA]. • Diff. optimization metrics: • Overall running time of algorithm. • SA < RA: minimize RAs. • RA not possible#: avoid RAs. • Combined optimization. • Has led to a variety of algorithms. • Memory vs. disk model. #: typical in IR systems.
Fagin’s Algorithm (FA) • m lists sorted by descending scores. • Access (SA) all lists in parallel. • For each new object seen, fetch scores from other lists by RA. Overall score t(x) = t(x1, …, xm). Store (obj, score) in set Y. • Remember each object seen (under SA) in all lists in set H. • Repeat until |H| >= K. • For each seen object, do RA on lists as needed to find “missing” scores. Compute score of x as t(x) = t(x1, …, xm). • Sort Y in descending order of scores, breaking ties arbitrarily, and output top K.
Example of FA Answers seen in >=1 list, i.e., Y unsorted. A B C D E F G H I J C(0.95) E(1.00) J(1.00) B(0.90) C(0.95) J(0.80) G(0.95) D(0.70) G(0.85) E(0.85) H(0.90) H(0.80) H(0.65) B(0.85) E(0.75) G(0.75) G(0.60) D(0.80) B(0.55) C(0.70) B(0.75) I(0.70) D(0.65) F(0.60) I(0.50) A(0.65) E(0.45) I(0.55) A(0.60) A(0.50) D(0.40) J(0.55) F(0.40) F(0.45) Answers seen (under SA) in all 4 lists, i.e., H. A(0.30) J(0.30) F(0.50) I(0.30)
Example of FA Answers seen in >=1 list, i.e., Y unsorted. A B C D E F G H I J C(0.95) E(1.00) J(1.00) B(0.90) C(0.95) J(0.80) G(0.95) D(0.70) G(0.85) E(0.85) H(0.90) H(0.80) H(0.65) B(0.85) E(0.75) G(0.75) G(0.60) D(0.80) B(0.55) C(0.70) B(0.75) I(0.70) D(0.65) F(0.60) I(0.50) A(0.65) E(0.45) I(0.55) A(0.60) A(0.50) D(0.40) J(0.55) F(0.40) F(0.45) Answers seen (under SA) in all 4 lists, i.e., H. A(0.30) J(0.30) F(0.50) I(0.30)
Example of FA Answers seen in >=1 list, i.e., Y unsorted. A B C D E F G H I J C(0.95) E(1.00) J(1.00) B(0.90) C(0.95) J(0.80) G(0.95) D(0.70) G(0.85) E(0.85) H(0.90) H(0.80) H(0.65) B(0.85) E(0.75) G(0.75) G(0.60) D(0.80) 3.30 B(0.55) C(0.70) B(0.75) I(0.70) D(0.65) F(0.60) I(0.50) A(0.65) E(0.45) I(0.55) A(0.60) A(0.50) D(0.40) J(0.55) F(0.40) F(0.45) Answers seen (under SA) in all 4 lists, i.e., H. A(0.30) J(0.30) F(0.50) I(0.30)
Example of FA Answers seen in >=1 list, i.e., Y unsorted. A B C D E F G H I J C(0.95) E(1.00) J(1.00) B(0.90) C(0.95) J(0.80) G(0.95) D(0.70) G(0.85) E(0.85) H(0.90) H(0.80) H(0.65) B(0.85) E(0.75) G(0.75) G(0.60) D(0.80) 3.30 B(0.55) C(0.70) B(0.75) I(0.70) D(0.65) F(0.60) I(0.50) A(0.65) 2.65 E(0.45) I(0.55) A(0.60) A(0.50) D(0.40) J(0.55) F(0.40) F(0.45) Answers seen (under SA) in all 4 lists, i.e., H. A(0.30) J(0.30) F(0.50) I(0.30)
Example of FA Answers seen in >=1 list, i.e., Y unsorted. A B C D E F G H I J C(0.95) E(1.00) J(1.00) 3.40 B(0.90) C(0.95) J(0.80) G(0.95) D(0.70) G(0.85) E(0.85) H(0.90) 3.05 H(0.80) H(0.65) B(0.85) E(0.75) G(0.75) G(0.60) D(0.80) 3.30 B(0.55) C(0.70) B(0.75) I(0.70) D(0.65) F(0.60) I(0.50) A(0.65) 2.65 E(0.45) I(0.55) A(0.60) A(0.50) D(0.40) J(0.55) F(0.40) F(0.45) Answers seen (under SA) in all 4 lists, i.e., H. A(0.30) J(0.30) F(0.50) I(0.30)
Example of FA Answers seen in >=1 list, i.e., Y unsorted. A B C D E F G H I J 3.05 C(0.95) E(1.00) J(1.00) 3.40 B(0.90) C(0.95) J(0.80) G(0.95) D(0.70) G(0.85) E(0.85) H(0.90) 3.05 H(0.80) H(0.65) B(0.85) E(0.75) G(0.75) G(0.60) D(0.80) 3.15 3.30 B(0.55) C(0.70) B(0.75) I(0.70) D(0.65) F(0.60) I(0.50) A(0.65) 2.65 E(0.45) I(0.55) A(0.60) A(0.50) D(0.40) J(0.55) F(0.40) F(0.45) Answers seen (under SA) in all 4 lists, i.e., H. A(0.30) J(0.30) F(0.50) I(0.30)
Example of FA Answers seen in >=1 list, i.e., Y unsorted. A B C D E F G H I J 3.05 C(0.95) E(1.00) J(1.00) 3.40 B(0.90) C(0.95) J(0.80) G(0.95) 2.55 D(0.70) G(0.85) E(0.85) H(0.90) 3.05 H(0.80) H(0.65) B(0.85) E(0.75) G(0.75) G(0.60) D(0.80) 3.15 3.30 B(0.55) C(0.70) B(0.75) I(0.70) D(0.65) F(0.60) I(0.50) A(0.65) 2.65 E(0.45) I(0.55) A(0.60) A(0.50) D(0.40) J(0.55) F(0.40) F(0.45) Answers seen (under SA) in all 4 lists, i.e., H. A(0.30) J(0.30) F(0.50) I(0.30)
Example of FA Answers seen in >=1 list, i.e., Y unsorted. A B C D E F G H I J 3.05 C(0.95) E(1.00) J(1.00) 3.40 B(0.90) C(0.95) J(0.80) G(0.95) 2.55 D(0.70) G(0.85) E(0.85) H(0.90) 3.05 H(0.80) H(0.65) B(0.85) E(0.75) G(0.75) G(0.60) D(0.80) 3.15 3.30 B(0.55) C(0.70) B(0.75) I(0.70) D(0.65) F(0.60) I(0.50) A(0.65) 2.65 E(0.45) I(0.55) A(0.60) A(0.50) D(0.40) J(0.55) F(0.40) F(0.45) Answers seen (under SA) in all 4 lists, i.e., H. A(0.30) J(0.30) F(0.50) I(0.30) H
Example of FA Answers seen in >=1 list, i.e., Y unsorted. A B C D E F G H I J 3.05 C(0.95) E(1.00) J(1.00) 3.40 B(0.90) C(0.95) J(0.80) G(0.95) 2.55 D(0.70) G(0.85) E(0.85) H(0.90) 3.05 H(0.80) H(0.65) B(0.85) E(0.75) G(0.75) G(0.60) D(0.80) 3.15 3.30 B(0.55) C(0.70) B(0.75) I(0.70) D(0.65) F(0.60) I(0.50) A(0.65) 2.65 E(0.45) I(0.55) A(0.60) A(0.50) D(0.40) J(0.55) F(0.40) F(0.45) Answers seen (under SA) in all 4 lists, i.e., H. A(0.30) J(0.30) F(0.50) I(0.30) H, G
Example of FA Answers seen in >=1 list, i.e., Y unsorted. A B C D E F G H I J 3.05 C(0.95) E(1.00) J(1.00) 3.40 B(0.90) C(0.95) J(0.80) G(0.95) 2.55 D(0.70) G(0.85) E(0.85) H(0.90) 3.05 H(0.80) H(0.65) B(0.85) E(0.75) G(0.75) G(0.60) D(0.80) 3.15 3.30 B(0.55) C(0.70) B(0.75) I(0.70) 2.05 D(0.65) F(0.60) I(0.50) A(0.65) 2.65 E(0.45) I(0.55) A(0.60) A(0.50) D(0.40) J(0.55) F(0.40) F(0.45) Answers seen (under SA) in all 4 lists, i.e., H. A(0.30) J(0.30) F(0.50) I(0.30) H, G, B, C |H| = 4.
FA Example concluded • A, F – not seen in any list. Yet, we are sure they can’t make it to top-4. Why? • Based on where the cursors are now, what’s the max. possible score for A, F? • What assumptions are being made about t()? • FA is shown to be optimal with very high probability [Fagin: PODS 1996]. • But can be beaten by other algorithms on specific inputs. • What about buffer size?
Threshold Algorithm • Do parallel SA on all m lists. • For each new object x, fetch its scores from other lists and compute overall score. • If |Buffer| < K add x to Buffer; • Else if score(x) <= k-th score in buffer, toss; • Else replace bottom of buffer with (x, score(x)). • Stop when threshold <= k-th score in buffer. • Threshold := t(worst score seen on L1, …, worst score seen on Lm). • Output the top-K objects & scores (in buffer).
TA Example A B C D E F G H I J C(0.95) E(1.00) J(1.00) B(0.90) C(0.95) J(0.80) G(0.95) D(0.70) G(0.85) E(0.85) H(0.90) H(0.80) H(0.65) B(0.85) E(0.75) G(0.75) G(0.60) D(0.80) B(0.55) C(0.70) B(0.75) I(0.70) D(0.65) F(0.60) I(0.50) A(0.65) E(0.45) I(0.55) A(0.60) A(0.50) D(0.40) J(0.55) F(0.40) F(0.45) A(0.30) J(0.30) F(0.50) I(0.30)
TA Example A B C D E F G H I J C(0.95) E(1.00) J(1.00) B(0.90) C(0.95) J(0.80) G(0.95) D(0.70) G(0.85) E(0.85) H(0.90) H(0.80) H(0.65) B(0.85) E(0.75) G(0.75) G(0.60) D(0.80) B(0.55) C(0.70) B(0.75) I(0.70) D(0.65) F(0.60) I(0.50) A(0.65) E(0.45) I(0.55) A(0.60) A(0.50) D(0.40) J(0.55) F(0.40) F(0.45) A(0.30) J(0.30) F(0.50) I(0.30)
TA Example A B C D E F G H I J C(0.95) E(1.00) J(1.00) B(0.90) C(0.95) J(0.80) G(0.95) D(0.70) G(0.85) E(0.85) H(0.90) H(0.80) H(0.65) B(0.85) E(0.75) G(0.75) G(0.60) D(0.80) B(0.55) C(0.70) 3.30 B(0.75) I(0.70) D(0.65) F(0.60) I(0.50) A(0.65) E(0.45) I(0.55) A(0.60) A(0.50) Threshold Bar: D(0.40) J(0.55) F(0.40) F(0.45) A(0.30) J(0.30) x1x2x3x4 0.95 1.00 0.95 1.00 F(0.50) I(0.30)
TA Example A B C D E F G H I J C(0.95) E(1.00) J(1.00) 3.40 B(0.90) C(0.95) J(0.80) G(0.95) D(0.70) G(0.85) E(0.85) H(0.90) 3.05 H(0.80) H(0.65) B(0.85) E(0.75) G(0.75) G(0.60) D(0.80) B(0.55) C(0.70) 3.30 B(0.75) I(0.70) D(0.65) F(0.60) I(0.50) A(0.65) 2.65 E(0.45) I(0.55) A(0.60) A(0.50) Threshold Bar: T = 3.90. D(0.40) J(0.55) F(0.40) F(0.45) A(0.30) J(0.30) x1x2x3x4 0.95 1.00 0.95 1.00 F(0.50) I(0.30)
TA Example A B C D E F G H I J C(0.95) E(1.00) 3.05 X J(1.00) 3.40 B(0.90) C(0.95) J(0.80) G(0.95) D(0.70) G(0.85) E(0.85) H(0.90) 3.05 H(0.80) H(0.65) B(0.85) E(0.75) G(0.75) G(0.60) D(0.80) 3.15 B(0.55) C(0.70) 3.30 B(0.75) I(0.70) D(0.65) F(0.60) I(0.50) A(0.65) 2.65 X E(0.45) I(0.55) A(0.60) A(0.50) Threshold Bar: T=3.60. D(0.40) J(0.55) F(0.40) F(0.45) A(0.30) J(0.30) x1x2x3x4 0.900.950.800.95 F(0.50) I(0.30)
TA Example A B C D E F G H I J C(0.95) E(1.00) 3.05 X J(1.00) 3.40 B(0.90) C(0.95) J(0.80) G(0.95) 2.55 X D(0.70) G(0.85) E(0.85) H(0.90) 3.05 H(0.80) H(0.65) B(0.85) E(0.75) G(0.75) G(0.60) D(0.80) 3.15 B(0.55) C(0.70) 3.30 B(0.75) I(0.70) D(0.65) F(0.60) I(0.50) A(0.65) 2.65 X E(0.45) I(0.55) A(0.60) A(0.50) Threshold Bar: T=3.30. D(0.40) J(0.55) F(0.40) F(0.45) A(0.30) J(0.30) x1x2x3x4 0.85 0.85 0.70 0.90 F(0.50) I(0.30)
TA Example A B C D E F G H I J C(0.95) E(1.00) 3.05 X J(1.00) 3.40 B(0.90) C(0.95) J(0.80) G(0.95) 2.55 X D(0.70) G(0.85) E(0.85) H(0.90) 3.05 H(0.80) H(0.65) B(0.85) E(0.75) G(0.75) G(0.60) D(0.80) 3.15 B(0.55) C(0.70) 3.30 B(0.75) I(0.70) D(0.65) F(0.60) I(0.50) A(0.65) 2.65 X E(0.45) I(0.55) A(0.60) A(0.50) Threshold Bar: T=3.10. D(0.40) J(0.55) F(0.40) F(0.45) A(0.30) J(0.30) x1x2x3x4 0.80 0.80 0.65 0.85 F(0.50) I(0.30)
TA Example A B C D E F G H I J C(0.95) E(1.00) 3.05 X J(1.00) 3.40 B(0.90) C(0.95) J(0.80) G(0.95) 2.55 X D(0.70) G(0.85) E(0.85) H(0.90) 3.05 H(0.80) H(0.65) B(0.85) E(0.75) G(0.75) G(0.60) D(0.80) 3.15 B(0.55) C(0.70) 3.30 B(0.75) I(0.70) D(0.65) F(0.60) I(0.50) A(0.65) 2.65 X E(0.45) I(0.55) A(0.60) A(0.50) Threshold Bar: T=2.90. ==> can stop! D(0.40) J(0.55) F(0.40) F(0.45) A(0.30) J(0.30) x1x2x3x4 0.75 0.75 0.60 0.80 F(0.50) I(0.30)
TA Remarks • What properties do we require of t() for TA to be correct? • How large does the buffer ever get with TA? What happened with FA? • Performance guarantee of TA (instance optimality): • D – class of DBs; A – class of algorithms; A A is instance optimal provided BA, DD, cost(A,D) = c.cost(B,D) + c’, for some fixed constants c, c’. • c = optimality ratio. • TA is instance optimal over algo’s not making wild guesses.
No Random Access Algorithm • What if RA > SA or RA wasn’t allowed? • Do SA on all lists in parallel. At depth d: • Maintain worst scores x1, …, xm. • x any object seen in lists {1, …, i}. • Best(x) = t(x1, …, xi, xi+1, …, xm). • Worst(x) = t(x1, …, xi, 0, …, 0). • TopK contains K objects with max worst scores at depth d. Break ties using Best. M = k-th Worst score in TopK. • Object y is viable if Best(y) > M. • Stop when TopK contains >=K distinct objects and no object outside TopK is viable. Return TopK.
NRA Example A B C D E F G H I J C(0.95) E(1.00) J(1.00) [0.95, 3.90] B(0.90) C(0.95) J(0.80) G(0.95) D(0.70) G(0.85) E(0.85) H(0.90) [1.00, 3.90] H(0.80) H(0.65) B(0.85) E(0.75) G(0.75) G(0.60) D(0.80) [0.95, 3.90] B(0.55) C(0.70) B(0.75) I(0.70) D(0.65) F(0.60) I(0.50) A(0.65) [1.00, 3.90] E(0.45) I(0.55) A(0.60) A(0.50) D(0.40) J(0.55) F(0.40) F(0.45) A(0.30) J(0.30) F(0.50) I(0.30)
NRA Example A B C D E F G H I J C(0.95) E(1.00) [0.90, 3.60] J(1.00) [1.90, 3.75] B(0.90) C(0.95) J(0.80) G(0.95) D(0.70) G(0.85) E(0.85) H(0.90) [1.00, 3.65] H(0.80) H(0.65) B(0.85) [0.95, 3.60] E(0.75) G(0.75) G(0.60) D(0.80) [0.95, 3.65] B(0.55) C(0.70) B(0.75) I(0.70) D(0.65) F(0.60) I(0.50) A(0.65) [1.80, 3.65] E(0.45) I(0.55) A(0.60) A(0.50) D(0.40) J(0.55) F(0.40) F(0.45) A(0.30) J(0.30) F(0.50) I(0.30)
NRA Features • What sort of t() do we need to assume, for NRA to work correctly? • How large can the buffers get? • How does the amount of bookkeeping compare with TA? • NRA is instance optimal over algo’s not making RA
Combined optimization • What if we are told cost(RA) = .cost(SA)? • Can we find algo’s better than NRA and TA in this case? • Combined algorithm = CA. (See Fagin et al.’s paper for details.)
Worrying about I/O cost • Based on Bast et al. VLDB 2006. • Inverted lists of (itemID, score) entries in desc. score order, as usual, but on disk. • Blocks sorted by itemID; across blocks still in desc. score order. • Inverted Block Index (IBI) Algorithm. • What is an IBI?
A Motivating Example List 1 List 2 List 3 Doc17 : 0.8 Doc25 : 0.7 Doc83 : 0.9 Doc78 : 0.2 Doc38 : 0.5 Doc17 : 0.7 . Doc14 : 0.5 Doc61 : 0.3 · Doc83 : 0.5 · · · · · Doc17 : 0.2 · · · · Round 1 (SA on 1,2,3) Doc17 : [0.8 , 2.4] Doc25 : [0.7 , 2.4] Doc83 : [0.9 , 2.4] unseen: ≤ 2.4
A Motivating Example List 1 List 2 List 3 Doc17 : 0.8 Doc25 : 0.7 Doc83 : 0.9 Doc78 : 0.2 Doc38 : 0.5 Doc17 : 0.7 . Doc14 : 0.5 Doc61 : 0.3 · Doc83 : 0.5 · · · · · Doc17 : 0.2 · · · · Round 2 (SA on 1,2,3) Doc17 : [1.5 , 2.0] Doc25 : [0.7 , 1.6] Doc83 : [0.9 , 1.6] unseen: ≤ 1.4 Round 1 (SA on 1,2,3) Doc17 : [0.8 , 2.4] Doc25 : [0.7 , 2.4] Doc83 : [0.9 , 2.4] unseen: ≤ 2.4
A Motivating Example List 1 List 2 List 3 Doc17 : 0.8 Doc25 : 0.7 Doc83 : 0.9 Doc78 : 0.2 Doc38 : 0.5 Doc17 : 0.7 . Doc14 : 0.5 Doc61 : 0.3 · Doc83 : 0.5 · · · · · Doc17 : 0.2 · · · · Round 2 (SA on 1,2,3) Doc17 : [1.5 , 2.0] Doc25 : [0.7 , 1.6] Doc83 : [0.9 , 1.6] unseen: ≤ 1.4 Round 3 (SA on 2,2,3!) Doc17 : [1.5 , 2.0] Doc83 : [1.4 , 1.6] unseen: ≤ 1.0 Round 1 (SA on 1,2,3) Doc17 : [0.8 , 2.4] Doc25 : [0.7 , 2.4] Doc83 : [0.9 , 2.4] unseen: ≤ 2.4
A Motivating Example List 1 List 2 List 3 Doc17 : 0.8 Doc25 : 0.7 Doc83 : 0.9 Doc78 : 0.2 Doc38 : 0.5 Doc17 : 0.7 . Doc14 : 0.5 Doc61 : 0.3 · Doc83 : 0.5 · · · · · Doc17 : 0.2 · · · · Round 2 (SA on 1,2,3) Doc17 : [1.5 , 2.0] Doc25 : [0.7 , 1.6] Doc83 : [0.9 , 1.6] unseen: ≤ 1.4 Round 1 (SA on 1,2,3) Doc17 : [0.8 , 2.4] Doc25 : [0.7 , 2.4] Doc83 : [0.9 , 2.4] unseen: ≤ 2.4 Round 3 (SA on 2,2,3!) Doc17 : [1.5 , 2.0] Doc83 : [1.4 , 1.6] unseen: ≤ 1.0 Note deviation from round-robin. Round 4 (RA for Doc17) Doc17 : 1.7 all others < 1.7 done!
IBI Algorithm • Same setting as NRA/CA, except use IBI. • Maintain two lists: Top-K items (T = d1, …, dk) and StillHaveASHot (SHASH) (S = dk+1, …, dk+q) items. • Pos_i = curr cursor position on list Li. • high_i = score in Li at curr cursor position (upper bounds score of unseen items). • For items d in S: • Which attr scores are known E(d). • Which attr scores are unknown E~(d). • Worst(d) = total score from E(d). • Best(d) = Worst(d) + {high_i(d) | i E~(d)}. (Exactly as Fagin.)
IBI Algorithm (contd.) • In each round, compute: • min-k = min{Worst(d) | d T}. • bestscore that any unseen doc can have = sum of all high_i’s. • For dj S: def_j = min-k – worst(d_j). [denotes deficit below qualification level for top-k.] • T sorted in desc. Worst(); S sorted in desc. Best(). [sorting on (score, ItemID) for fast processing.] • Invatiant: min-k >= max{Worst(d) | d S}. • Termination: when min-k >= max{Best(d) | d S}. • Can remove an obj from S whenever its Best <= min-k. stop when S = {}. • Early termination AND minimal bookkeeping are BOTH important for performance.
More on IBI Framework • Instead of scheduling SAs using RR, use a differential approach for diff. lists based on expected score reductions at future cursor positions (Knapsack). • Do SA*RA*. • Order RAs based on estimated Prob[dj can get into top-k answers].