300 likes | 422 Views
Using the Crowd for Top-K or Group-By Queries. Susan Davidson U. of Pennsylvania Sanjeev Khanna U. of Pennsylvania Tova Milo Tel Aviv U. Sudeepa Roy U. of Washington. Example: Using Wisdom of Crowd in DB Queries. Database of football players pictures
E N D
Using the Crowd for Top-K or Group-By Queries Susan Davidson U. of Pennsylvania SanjeevKhanna U. of Pennsylvania Tova Milo Tel Aviv U. Sudeepa Roy U. of Washington ICDT 2013
Example: Using Wisdom of Crowd in DB Queries Database of football players pictures • Multiple photos at different ages of each player Q. Group the photos of individual players Q. Find their most recent photos ICDT 2013
How a DBMS Thinks Q. Group the photos of individual players Group-By Queries! Use “Name” attribute Q. Find their most recent photos Max/Top-k Queries! Use “Date” attribute • What if name/date is missing • Image processing? Photo forensics? ICDT 2013
Ask the “Crowd”! ICDT 2013
How the Crowd Thinks - 1 Q. Group the photos of individual players ICDT 2013
How the Crowd Thinks - 2 Q. Find their most recent photos ICDT 2013
Crowd Sourcing • Using human intelligence to do tasks which are harder to automate • A recent topic of interest in database and other research communities • Many crowdsourcing platforms ICDT 2013
This talk: Using Wisdom of Crowd in Top-K and Group-By Queries Top-k/max A possible query • SELECTTOP 1 R.picture • FROMSoccerPlayerPhotoTableAS R • GROUP BY R.player • ORDER BY R.dateDESC Group By / Clustering Fixed but unknown attributes: R.player, R.date ICDT 2013
Outline of this talk • Our Model • Max and Top-K Queries • Group-By Queries (Clustering) Combination is easy.. ICDT 2013
Outline of this talk • Our Model • Max and Top-K Queries • Group-By Queries (Clustering) ICDT 2013
Elements: Type and Value • #Elements = n • (n = 16) • Two attributes: Type and Value • Type • e.g. Name = “Maradona” • used in Group-By • #Types = J (J clusters, J = 4) • Value • e.g. Date when photo was taken • used in Top-K • Unknown, but “ground-truth’’ exists for Types and Values ICDT 2013
Type and Value Comparisons Is the first photo older? Same Person? • DB Queries vs. Comparisons(questions to the crowd) • Type Comparisons: • Type(x) = Type(y)? • Value Comparisons: • Value(x) > Value(y)? • Assumes same type • Answer is Boolean • We cannot ask • “What is Type(x)/Value(x)?” • But, answers are not always correct • Crowd makes mistakes • Next, error model ICDT 2013
Constant Error Model Type comparisons: Type(x) = Type(y)? Value comparisons: Value(x) > Value(y)? Constant error model: • Standard model • Probability of wrong answer ≤ ½ - , >0 Is the first photo older? Wrong: w.p. ≤ ½ - Correct: w.p. ≥ ½ + Same person? Wrong: w.p. ≤ ½ - Correct: w.p. ≥ ½ + ICDT 2013
Variable Error Model (this paper) Is the first photo older? - Harder For value comparisons only : Value(x) > Value(y)? • Error probability ≤ 1/ f() - • = Distance of x, y in sorted order Is the first photo older? – Easier • f is • strictly monotone • x1 > x2 f(x1) > f(x2) • f(x) ≥ 2 e.g. f() = e f() = + 1 f() = log + 2 ≤ ½ - = 2 x1 x2 x3 x4 • e.g. Error probability when f() = e • ≤ 1/e • ≤ 1/e2 ICDT 2013
Framework at a Glance Value(a) > Value(b) Internal Computation (Top-K/ Group-By) Yes/No Value(b) > Value(c) Yes/No Crowd Crowd-sourced DBMS Comparison Error Crowd’s answer may be wrong with some prob. • Cost Model • Asking questions costs money • Additive • Count #comparisons We still want the correct answer w.h.p. ICDT 2013
Our Goal Minimize the total #comparisons while outputting the correct answer(exact top-k or clusters) w.p. 1 – δ (given constant δ > 0) ICDT 2013
Outline of this talk • Our Model • Max and Top-K Queries • Group-By Queries (Clustering) ICDT 2013
Problem Statement: Max/Top-k • Input: n elements, k • Same type • Only value comparisons are used • Output: All top-k elements (w.p. 1 – δ) • (k = 2) • We focus on Max: Algorithm for top-k builds on the algorithm for max x3 x1 x2 x4 ICDT 2013
Max: Our Results Required confidence: 1 – δ (δ = constant), = distance in sorted order Asymptotically smaller than all c.n, c>0 • Further, • f() = exp. n + O(1) • f() = linear n + O(log log n) • f() = log. n + O(n1/1+ δ) ICDT 2013
Review: Tournament Tree for Max 10 Max #comparisons = c . 5 #comparisons = c . 3 Comparison at internal nodes 10 #comparisons = c . 1 9 5 9 10 2 n elements as leaves • Exact comparisons: Binary tree structure is not necessary • Noisy comparisons: Repeat comparison + majority vote • Constant error model: θ(n) algorithm (Feige et. al. ’94) • Our goal:Total no. of comparisons = n + o(n) • cannot repeat even twice in most of the internal nodes ICDT 2013
Main Steps for Max Goal: n + o(n) • Upper levels • Use Feige’s algorithm • Y nodes Ө(Y) cost Y = o(n) for any f Y = O(1) for f = e Upper Levels Y nodes Lower levels Just 1 comparison at each internal node No majority vote Binary tournament tree Lower Levels n nodes Key idea: Start with a random permutation of elements at the leaves • Max does not lose in the lower levels w.h.p. • Intuition: Max does not meet 2nd Max in the lower levelsw.h.p. • Total no. of comparisons = n + Ө(Y) = n + o(n) ICDT 2013
Extension to Top-K • Corollary • If k = o(n), • n + o(n) comparisons suffice to find top-k w.h.p. • x1, …, xkdo not meet in the lower levels ICDT 2013
Outline of this talk • Our Model • Max and Top-K Queries • Group-By Queries (Clustering) • Simple Clustering (using types only) • Clustering with Correlated Values and Types ICDT 2013
Simple Clustering • Group n elements into J clusters • Only type comparisons • Constant error model • O(nJ) comparisons are sufficient • Log factor for comparison error • J can be unknown • Not surprising • We show Ω(nJ) lower bound • Even if no comparison error • Even for randomized algorithms • Even when J is known ~ ICDT 2013
Clustering withCorrelated Types and Values 3-br 2-br Apts. In a building Type = #bedrooms Value = rent 1-br Type 1 Type 3 Type 2 Low values High values • Full correlation: • Elements of same type form contiguous blocks in the sorted order on values • For no correlation, we gave a lower bound of Ω(nJ) Assume no error ~ • O(n logJ) value and typecomparisons suffice to find the clusters w.h.p. ICDT 2013
Algorithm for Full Correlation Type 1 Type 2 Type 3 J = 3 s = 13 - Find median and partition RepeatUntil each block has one type - If ≥ 2 types in a block, Repeat - Same type in a block, keep only one element • Linear cost in total in each inner iteration • #Inner iterations = O(log J) • Cost: C(n) ≤ C(n/2) + O(n log J) • C(n) = O(n log J) - Until #elements ≤ s/2 4J blocks at most J blocks contain ≥ 2 types, ≤ s/2 elements ICDT 2013
Extension to Partial Correlation ≤ α changes in the same type Hotels in a city Type = Star-rating Value = Avg. price Type 1 Type 3 Type 1 Type 2 • Partial correlation: • Elements of same type form almost contiguous blocks • O(n log (α J) + αJ) value and typecomparisons suffice w.h.p. ~ ICDT 2013
Related Work Max/Top-K: • [Guo et. al. ’12]: Which element has the maximum likelihood of being max, which future queries are most effective • [Venetis et. al. ’12]: Efficient heuristics that tune latency, cost, quality to find max • [Feige et. al. ’94]: Tight bounds for Max/Top-K/Sorting in constant error model • Other error model and objectives exist in theory literature Clustering • [Gomes et. al. ’11]: Machine learning approach to define clusters • [Parameswaran et. al. ’12, Wang et. al. ’12]: Filtering data, entity resolution Other Work: • Crowd sourced DBs: CrowdDB, Deco, Qurk, MoDaS… • Crowd sourcing for data cleansing, data integration, entity resolution, data analytics, … ICDT 2013
Conclusion • We studied Max/Top-k and Group-By queries in crowd-sourced setting • Top-K: Proposed a variable error model, allows fewer comparisons • Group by: Obvious simple algorithm is the best possible, correlation reduces #comparisons • Future Work: • Other objectives: latency, #rounds, or to reduce error when a budget on #comparisons is given • Other cost functions: e.g. more #similar queries in a round, less cost ICDT 2013
Thank You Questions? ICDT 2013