400 likes | 457 Views
When is A=B?. Donald Kossmann Systems Group, ETH Zurich http:// systems.ethz.ch. Acknowledgments. Insanity: doing the same thing over and over again and expecting different results . (A. Einstein).
E N D
Whenis A=B? Donald Kossmann Systems Group, ETH Zurich http://systems.ethz.ch
Insanity: doing the same thing over and over again and expecting different results. (A. Einstein)
Insanity: doing the same thing over and over again and expecting different results. (A. Einstein) Reality: We all are insane! When do you start believing that your paper is not worth publishing?
Speculations on IT Trends • Big Data: Automating Experience • Logic -> Statistics • Open World Semantics • Hybrid Systems: Get best of humans & machines • to err is human • Systems • DNA, Quantum: trade energy for precision • Distributed systems: design for failure • Intel’s SCC: non-cache-coherent processors
Speculations on IT Trends • Big Data: Automating Experience • Logic -> Statistics • Open World Semantics • Hybrid Human & Machine Systems • to err is human • Systems • DNA HW: trade energy consumption for precision • Distributed systems: design for failure Computers are becoming insane!
Implications • We need to model insanity • (too crazy for this talk) • (will use Mechanical Turk to simulate craziness) • We need to revisit algos & complexity theory • focus of this talk
Traditional Complexity Theory • Cost is a function of input • Example: sorting in O(N * log N) input Algo/Problem cost
“Modern” Complexity Theory • Cost is a function of input, quality, error rate • Example: sorting is O(???) input quality error Algo/Problem cost
Alternative Complexity Theory • Quality is a function of input, budget, error rate • Example: sorting is O(???) input budget error Algo/Problem quality
Agenda • Case Study: Entity Resolution, Joins • when is A=B? • Case Study: Sorting • when is A<B?
Problem Statement • You are the director of the Louvre • you have gazillions of unknown paintings • you have a bunch of students that guess: p(A) = p(B)? • You would like to group the paintings by painter • minimize cost (work of students) • minimize errors (#paintings in wrong room) • Assumption: There is a ground truth! • (Many problems have no ground truth; e.g., grouping the best paintings.)
Naïve Algorithm • Step 1: select two random paintings • Step 2: ask students to compare them • Step 3: goto Step 1 until done • How can we do better???
Votes Graph A B • Is A = B? C D
Votes Graph A B • Is A = B? YES! C D
Votes Graph A B C D
Votes Graph A B • Is B = C? • Is A = D? C D
Votes Graph A B • Is B = C? YES! • Is A = D?NO! C D
Votes Graph A B • Is B = C? ??? C D
Votes Graph 50 A B • Is B = C? YES! 30 -1 C D -100
Decision Functions • Input: Votes graph (with weights) two nodes • Output: Yes, No, Do-not-know • Desired Properties: • Consistency: do not invent anything • Convergence: do not always punt • Reflexivity, Symmetry, Transitivity, Anti-transitivity
Min-Max Function • Compute pScore, nScore • take all positive, negative paths • score of path: minimum of weights of edges (AND) • pScore = maximum of score of all positive paths (OR) • nScore = maximum of score of all negative paths (OR) • Make decision based on quorum (e.g., q=3) • Yes: pScore – nScore > q • No: nScore – pScore > q • Do-not-know: otherwise
Min/Max with Conflicts 50 • Is B = C? YES • pScore = 30 • nScore = 1 • Is A = D? NO • pScore = 0 • nScore = 30 A B 30 -1 C D -100
Naïve Algorithm V2.0 • Step 1: select two random paintings, p1, p2 • Step 2: if (MinMax(p1,p2) == Do-not-know) ask students to compare them else return MinMax(p1, p2) • Step 3: goto Step 1 until done
Min/Max and Transitivity? 5 5 D B C 5 3 -2 A E • A = E? Do-not-know • pScore = 3 • nScore = 2 • D = E? YES • pScore = 3 • nScore = 0 • A = D? YES • pScore = 5 • nScore = 2
When is A=E? 5 5 D B C 5 3 -2 A E Compute “A=E”: Need at least 5 votes for success. Compute “D=E”: In best case, only 2 more votes needed.
When is A=E? 5 5 D Many more surprises like that!!! B C 5 3 -2 A E Crowdsource A=E: Need at least 5 votes for success. Crowdsource D=E: In best case, only 2 votes needed.
Related Work & Alternatives • R. Fagin, E. Wimmer: A formula for incorporating weights into scoring rules. 2000. • M. Schulze: A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single winner election method. 2011. • Huge body of work on ER in DB, II communities. • Other decision function: MinCuts!
Summary • Getting A=B right more important than algorithm • Naïve algo with Min/Max >> Correlation Clustering • Result of A=B depends on C, D, … • sounds trivial, but has nasty implications • need a decision function: new cost/precision tradeoffs • Some trad. algos (e.g., CC) do not work • Complexity: Still unknown! • interesting future work
Agenda • Case Study: Entity Resolution, Joins • when is A=B? • Case Study: Sorting • when is A<B?
Revisit Sorting Algos • How do traditional sorting algorithms behave • Quicksort • Bubblesort • Look at new sorting algorithms based on graph • PageRank • Min/Max • Schulze method • Focus on Quicksort vs. Bubblesort here • Just give a glimpse of what can happen
Quicksort: Effect of built-in transitivity • Sort the following sequence Neutral, Painful, Good, Excellent, Bad • Use “Good” as pivot element for partitioningFumble “Painful < Good” comparison Excellent, Painful, Good, Neutral, Bad • One bad comparison propagates to three misclassifications • quality of result can become arbitrarily bad • difficult to extend QSortalgo with safety net.
Results (20% error, uniform) Quality (%) Cost (number of iterations of algorithm)
Summary • Some algos implicitly exploit transitivity • difficult to control cost/quality tradeoff • might result in a poor result for specific application • QuickSort >> Bubblesort no longer true • depends on error and quality expectation • there are better and worse ways to exploit transitivity depending on budget and error behavior • confirms observations of “A=B” study
Related Work on Sorting • Ludwig Busse et al.: The information content in sorting algorithms. 2012. • M. Schulze: A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single winner election method. 2011. • Qurk (MIT) & Deco (Stanford) projects. 2011-2013. • …
Conclusion & Future Work • Computers are becoming insane • because they automate more of the insane world • because we are hitting the limits of trad. computing • consequence: quality becomes a major metric • Adding “quality” has dramatic implications • need to revisit algorithms to become fault-tolerant • need to revisit complexity: totally open • need to revisit debugging and testing: totally open