1 / 40

When is A=B?

When is A=B?. Donald Kossmann Systems Group, ETH Zurich http:// systems.ethz.ch. Acknowledgments. Insanity: doing the same thing over and over again and expecting different results . (A. Einstein).

zofia
Download Presentation

When is A=B?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Whenis A=B? Donald Kossmann Systems Group, ETH Zurich http://systems.ethz.ch

  2. Acknowledgments

  3. Insanity: doing the same thing over and over again and expecting different results. (A. Einstein)

  4. Insanity: doing the same thing over and over again and expecting different results. (A. Einstein) Reality: We all are insane!  When do you start believing that your paper is not worth publishing?

  5. Speculations on IT Trends • Big Data: Automating Experience • Logic -> Statistics • Open World Semantics • Hybrid Systems: Get best of humans & machines • to err is human • Systems • DNA, Quantum: trade energy for precision • Distributed systems: design for failure • Intel’s SCC: non-cache-coherent processors

  6. Speculations on IT Trends • Big Data: Automating Experience • Logic -> Statistics • Open World Semantics • Hybrid Human & Machine Systems • to err is human • Systems • DNA HW: trade energy consumption for precision • Distributed systems: design for failure Computers are becoming insane!

  7. Implications • We need to model insanity • (too crazy for this talk) • (will use Mechanical Turk to simulate craziness) • We need to revisit algos & complexity theory • focus of this talk

  8. Traditional Complexity Theory • Cost is a function of input • Example: sorting in O(N * log N) input Algo/Problem cost

  9. “Modern” Complexity Theory • Cost is a function of input, quality, error rate • Example: sorting is O(???) input quality error Algo/Problem cost

  10. Alternative Complexity Theory • Quality is a function of input, budget, error rate • Example: sorting is O(???) input budget error Algo/Problem quality

  11. Agenda • Case Study: Entity Resolution, Joins • when is A=B? • Case Study: Sorting • when is A<B?

  12. Problem Statement • You are the director of the Louvre • you have gazillions of unknown paintings • you have a bunch of students that guess: p(A) = p(B)? • You would like to group the paintings by painter • minimize cost (work of students) • minimize errors (#paintings in wrong room) • Assumption: There is a ground truth! • (Many problems have no ground truth; e.g., grouping the best paintings.)

  13. Naïve Algorithm • Step 1: select two random paintings • Step 2: ask students to compare them • Step 3: goto Step 1 until done • How can we do better???

  14. Votes Graph A B • Is A = B? C D

  15. Votes Graph A B • Is A = B? YES! C D

  16. Votes Graph A B C D

  17. Votes Graph A B • Is B = C? • Is A = D? C D

  18. Votes Graph A B • Is B = C? YES! • Is A = D?NO! C D

  19. Votes Graph A B • Is B = C? ??? C D

  20. Votes Graph 50 A B • Is B = C? YES! 30 -1 C D -100

  21. Decision Functions • Input: Votes graph (with weights) two nodes • Output: Yes, No, Do-not-know • Desired Properties: • Consistency: do not invent anything • Convergence: do not always punt • Reflexivity, Symmetry, Transitivity, Anti-transitivity

  22. Min-Max Function • Compute pScore, nScore • take all positive, negative paths • score of path: minimum of weights of edges (AND) • pScore = maximum of score of all positive paths (OR) • nScore = maximum of score of all negative paths (OR) • Make decision based on quorum (e.g., q=3) • Yes: pScore – nScore > q • No: nScore – pScore > q • Do-not-know: otherwise

  23. Min/Max with Conflicts 50 • Is B = C? YES • pScore = 30 • nScore = 1 • Is A = D? NO • pScore = 0 • nScore = 30 A B 30 -1 C D -100

  24. Naïve Algorithm V2.0 • Step 1: select two random paintings, p1, p2 • Step 2: if (MinMax(p1,p2) == Do-not-know) ask students to compare them else return MinMax(p1, p2) • Step 3: goto Step 1 until done

  25. Min/Max and Transitivity? 5 5 D B C 5 3 -2 A E • A = E? Do-not-know • pScore = 3 • nScore = 2 • D = E? YES • pScore = 3 • nScore = 0 • A = D? YES • pScore = 5 • nScore = 2

  26. When is A=E? 5 5 D B C 5 3 -2 A E Compute “A=E”: Need at least 5 votes for success. Compute “D=E”: In best case, only 2 more votes needed.

  27. When is A=E? 5 5 D Many more surprises like that!!! B C 5 3 -2 A E Crowdsource A=E: Need at least 5 votes for success. Crowdsource D=E: In best case, only 2 votes needed.

  28. Related Work & Alternatives • R. Fagin, E. Wimmer: A formula for incorporating weights into scoring rules. 2000. • M. Schulze: A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single winner election method. 2011. • Huge body of work on ER in DB, II communities. • Other decision function: MinCuts!

  29. Summary • Getting A=B right more important than algorithm • Naïve algo with Min/Max >> Correlation Clustering • Result of A=B depends on C, D, … • sounds trivial, but has nasty implications • need a decision function: new cost/precision tradeoffs • Some trad. algos (e.g., CC) do not work • Complexity: Still unknown! • interesting future work

  30. Agenda • Case Study: Entity Resolution, Joins • when is A=B? • Case Study: Sorting • when is A<B?

  31. Revisit Sorting Algos • How do traditional sorting algorithms behave • Quicksort • Bubblesort • Look at new sorting algorithms based on graph • PageRank • Min/Max • Schulze method • Focus on Quicksort vs. Bubblesort here • Just give a glimpse of what can happen

  32. Quicksort: Effect of built-in transitivity • Sort the following sequence Neutral, Painful, Good, Excellent, Bad • Use “Good” as pivot element for partitioningFumble “Painful < Good” comparison Excellent, Painful, Good, Neutral, Bad • One bad comparison propagates to three misclassifications • quality of result can become arbitrarily bad • difficult to extend QSortalgo with safety net.

  33. Results (20% error, uniform) Quality (%) Cost (number of iterations of algorithm)

  34. Summary • Some algos implicitly exploit transitivity • difficult to control cost/quality tradeoff • might result in a poor result for specific application • QuickSort >> Bubblesort no longer true • depends on error and quality expectation • there are better and worse ways to exploit transitivity depending on budget and error behavior • confirms observations of “A=B” study

  35. Related Work on Sorting • Ludwig Busse et al.: The information content in sorting algorithms. 2012. • M. Schulze: A new monotonic, clone-independent, reversal symmetric, and condorcet-consistent single winner election method. 2011. • Qurk (MIT) & Deco (Stanford) projects. 2011-2013. • …

  36. Conclusion & Future Work • Computers are becoming insane • because they automate more of the insane world • because we are hitting the limits of trad. computing • consequence: quality becomes a major metric • Adding “quality” has dramatic implications • need to revisit algorithms to become fault-tolerant • need to revisit complexity: totally open • need to revisit debugging and testing: totally open

More Related