1 / 53

Exploring Autocomplete and Sorting Techniques in Computer Science

Learn about Autocomplete systems, Sorting algorithms, and Comparators in Compsci 201 during the Spring 2019 semester. Discover trade-offs, binary searches, and optimization strategies in handling large datasets efficiently.

keithn
Download Presentation

Exploring Autocomplete and Sorting Techniques in Computer Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compsci 201, Autocomplete, Sorts, Comparators Owen Astrachan ola@cs.duke.edu http://bit.ly/201spring19 March 27, 2019 Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  2. S is for … • Stack • Last in, First Out, source of overflow! • Software • Joys and sorrows, eating the world • Sorting • From slow to quick to tim to … Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  3. Is this a Meme? A Gif? Recursion? • Laziness does not exist • https://medium.com/@devonprice/laziness-does-not-exist-3af27e312d01 Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  4. Plan for LWoM • Autocomplete Overview • Using priority queues and interfaces • Comparators for Sorting and Priority Queues • Alternative to .compareTo/Comparable • Sorting from Algorithms to APIs • Exposure and utility: knowing and doing Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  5. Autocomplete • 70,000 queries/second, thousands of computers, 0.2 seconds to answer query, … • Fall 2018 and Spring 2019 Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  6. Autocomplete • 70,000 queries/second, thousands of computers, 0.2 seconds to answer query, … • Fall 2018 and Spring 2019 Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  7. Geolocating Heaven Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  8. Tradeoffs in Autocomplete • Like search in Google, we want the best or top or most-weighty matches • Each term is (word,weight) pair • Sort by weight, we want only top 10 or top k • Don't sort everything if we only need 10 • PriorityQueues can help, so can binary search • Don't sort 109 items if 103 match "duke b" Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  9. BruteAutocomplete • There are N terms (word,weight) • M of these match a prefix, e.g., "auto" • We want the top k of these M matches • N is millions, M is thousands/hundreds, k is 10 • Naïve: find matching M terms of N, sort them, choose the heaviest k: N + M log M + k • Where does M log M term come from? • Where does N term come from? Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  10. Binary Search beats Naïve Brute • In Brute don't use M log M use M log k • Use a priority queue: see code in Git repo • Who cares? Does this make a difference? • We can use binary search, after sorting once • Find first and last of M matching terms: log N • Changes time to log N + M log k + k Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  11. Pièce de résistance! • Brute: N + M log k • Binary: log N + M log k • data/alexa.txt contains one million queries • Does N versus log N matter? • Use a HashMap of prefix to matching terms • O(1) at the expense of storage • APIs are your friend! Good in practice too Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  12. Binary Search in Autocomplete • Given "beenie" and prefix of 3, find M matches • Find first "bee.." and last "bee..": O(log N) • Cannot use API: doesn't get first/last match • O(log N) to find first and last O(M log k) for top k Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  13. Tradeoffs Summarized • Brute: O(N + M log k) since uses priority queue • Binary: O(log N + M log k) • Requires sorting once: O(N log N) • Amortize this cost over many queries, ignore? • If we make Q queries: • Brute: Q x N • Binary: Q x log N – recoup sorting cost? • Hash: Q – can't really do better! Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  14. Something Old, Something New • Sort list of Terms by weight? Or reverse order? • Call v.getWeight()w.getWeight() • Before Java 8 and after Java 8 publicstaticclassWeightOrderimplements Comparator<Term> publicstatic classReverseWeightOrderimplements … Collections.sort(list, newTerm.ReverseWeightOrder()); Collections.sort(list, Comparator.comparing(Term::getWeight) .reversed()); Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  15. The Comparator Interface • When you can't access/change a class • Supply separate Comparator object • Return integer similar to .compareTo return value • int compare(Term a, Term b) • Compare by weight or prefix or … • return < 0 if a < b • return > 0 if a > b • return == 0 if a == b Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  16. Old Ideas meet new APIs • Remnants of old versions of assignments linger • If sorting M matching terms in binary search, need ReverseWeightOrder: O(M log M) • If using Priority Queue, don't need: O(M log k) • When M is 106 and k is 101, there's a difference!! • Old and new in HashListAutocomplete Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  17. WOTO http://bit.ly/201spring19-march27-1 Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  18. Alan Turing • 2:46 marathon • 15:20 three mile • Enigma machine and WWII • Entscheidungsproblem Sometimes it is the people no one can imagine anything of who do the things no one can imagine Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  19. PriorityQueues top to bottom • All operations are O(log N) where N size of PQ • This for add and remove; can peek in O(1) • Details after midterm • Always remove the smallest element, minPQ • Can change by providing a Comparator • Shortest-path, e.g., Google Maps. Best-first search in games • Best element removed from queue, not first Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  20. PriorityQueues top to bottom • How can we sort elements using Priority Queue? • Add all elements to pq, then remove them • Every operation is O(log N), so this sort? • O(N log N) – basis for heap sort • https://coursework.cs.duke.edu/201spring19/sorting Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  21. Finding top M of N • Sort all and get first (or last) M • O(N log N) to sort, then O(M), typically N >> M • Code below doesn't alter list parameter • Why is comp.reversed() used? https://coursework.cs.duke.edu/201spring19/sorting Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  22. Finding top M of N • Can do this in O(N log M) using priority queue • Not intuitive? largest M using min PQ? Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  23. Details for M of N • Keep only M elements in the priority queue • Every time one removed? It's the smallest • When done? Top M remain, removed smallest! • First element removed? Smallest, so … • Why is LinkedList used? O(1) add to front Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  24. Sorting: 201 in a • Algorithms: traditionally foundation of compsci • Study, analyze, develop, use • APIs: tested, proven, configurable • Algorithms encapsulated and usable • You should know how to write your own sort • You should know how to call library sort Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  25. Sorting: From Theory to Practice • Why do we study more than one algorithm? • Paradigms of trade-offs and algorithmic design • Know your history • How do you use and configure library sort? • http://www.sorting-algorithms.com/ Not Yogi Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  26. Simple, O(n2) sorts • Selection sort --- n2 comparisons, n swaps • Find min, swap to front, increment front, repeat • Insertion sort --- n2 comparisons, no swap, shift • stable, fast on sorted data, slide into place • Bubble sort --- n2 everything, slow* • Catchy name, but slow and ugly* *this isn't everyone's opinion, but it should be • Shell sort: quasi-insertion, fast in practice • Not quadratic with some tweaks Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  27. Case Study: SelectionSort • https://coursework.cs.duke.edu/201spring19/sorting Canonical O(n2) algorithm/code Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  28. Case Study: SelectionSort • Invariant: on jth pass, [0,j) is in final sorted order • Nested loop re-establishes invariant Final order unexamined publicvoid sort(List<T> list) { for(intj=0; j < list.size()-1; j++) { intmin = j; for(intk=j+1; k < list.size(); k++) { if (list.get(k).compareTo(list.get(min)) < 0){ min = k; } } swap(list,min,j); } } j Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  29. Reminder: Loop Invariant • Statement: true each time loop begins to execute • During loop execution it may become false • The loop re-establishes the invariant • Typically stated in terms of loop index • Pictures can help reason about code/solution • Helps to reason formally and informally about the code you’re writing • Can I explain the invariant to someone? Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  30. Bubblesort isn't much code • Swap adjacent elements when out of order • From beginning to end, then end-1, end-2, … • After n passes, last n-elements in place Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  31. Timing of n2 and other sorts • https://coursework.cs.duke.edu/201spring19/sorting Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  32. More efficient O(n log n) sorts • Divide and conquer sorts: • Quick sort: fast in practice, O(n2) worst case • Merge sort: stable, fast, extra storage • Timsort: http://en.wikipedia.org/wiki/Timsort • Other sorts: • Heap sort: priority queue sorting • Radix sort: uses digits/characters (no compare) • O(n log n) is optimal for comparing • But, Radix is O(n) ?? Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  33. Stable, Stability • Stable: respect order of equal keys when sorting • First sort by shape, then by color: Stable! • Triangle < Square < Circle; Yellow < Green < Red Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  34. <= X X > X pivot index Quicksort: fast in practice • Invented in 1962 by Tony Hoare, didn't understand recursion: • Canonical T(n) = 2T(n/2)+O(n), but • Worst case is O(n2), bad pivot. Shuffle first? voiddoQuick(List<T> list, intfirst, intlast) { if (first >= last) return; intpiv = pivot(list,first,last); doQuick(list,first,piv-1); doQuick(list,piv+1,last); } Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  35. <= X X > X pivot index Pivot is O(n) • Invariant: [first,p] <= list.get(first) • Invariant: (p,k) > list.get(first) <= [first] ??? privateint pivot(List<T> list, intfirst, intlast){ T piv = list.get(first); intp = first; for(intk=first+1; k <= last; k++){ if (list.get(k).compareTo(piv) <= 0){ p++; swap(list,k,p); } } swap(list,p,first); returnp; } k > [first] p first Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  36. https://en.wikipedia.org/wiki/Timsort • Stable, O(n log n) in average and worst, O(n) best! • In practice lots of data is "close" to sorted • Invented by Tim Peters for Python, now in Java • Replaced merge sort which is also stable • Engineered to be correct, fast, useful in practice • Theory and explanation not so simple https://www.youtube.com/watch?v=NVIjHj-lrT4 Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  37. Summary of O(n log n) sorts • Timsort: hybrid of merge and insertion? • Fast in real world: Python, Java 7+, Android • What’s the best O(n log n) sort to call? • The one in the library you have access to • Arrays.sortor Collections.sort • Changing how you sort: • .compareTo() or .compare() Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  38. sortingwoto or ginooorsttw • http://bit.ly/201spring19-march27-2 Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  39. Brian Fox GNU Bash Shell (developer) First employee at Free Software Foundation First online banking system at Wells Fargo There’s nothing that I am better at than everyone else, except being me. There’s no secret to being me. Follow your interests and work hard at them. Then you will play bass better, program better, cook better, ride motorcycles better, or anything else that you really want to do. https://lifehacker.com/im-brian-fox-author-of-the-bash-shell-and-this-is-how-1820510600 Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  40. Recursion Idiom for Trees • Use root = caller(root, val); • Modifies Tree at root and returns the result • Recursive calls works similarly • root.left = caller(root.left, val); • Process the subtree, use the result • Where do we see this? Insertion into search tree • We also see it in tree tighten APT Compsci 201, Spring 2019, Trees and Recurrences

  41. Tightening a Tree • https://www2.cs.duke.edu/csed/newapt/treetighten.html • Call: tree = tighten(tree) • A node has two children (13, 8)? no change • A node has no children (5,7,25)? no change • One child? remove Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  42. APT Tree Practice Redux • https://www2.cs.duke.edu/csed/newapt/trimtree.html • It's a search tree, so we can use range search • Left subtree <= root. Right subtree > root • With no duplicates, < and not <= • Range [2,15] whole tree • What about … Compsci 201, Spring 2019, Trees and Recurrences

  43. Trimming the Tree • https://www2.cs.duke.edu/csed/newapt/trimtree.html • Range [2,15] both subtrees • What about [3,7] or [5,13] or [11,17] • If root in range? • Process subtrees, return root • t.left = trim(t.left, …) • t.right = trim(t.right, …) • If root not in range? Return result of one call! • return trim(t.left, ..) or … Compsci 201, Spring 2019, Trees and Recurrences

  44. Toward All Green • Identify base-case: null every time • Sometimes leaf? In this case not needed • Make sure you use result of recursive call • Trim returns a tree, what do we do? • t.left = trim(t.left,low,high); • return t • Compare to simply return trim(t.left); • Don't include node when outside range? Compsci 201, Spring 2019, Trees and Recurrences

  45. Bubble Sort, A Personal Odyssey Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  46. Steve and Rachel, Duke 1997 Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  47. 11/08/77 Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  48. 17 Nov 75 Not needed Can be tightened considerably Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators

  49. Jim Gray (Turing 1998) • Bubble sort is a good argument for analyzing algorithm performance. It is a perfectly correct algorithm. But it's performance is among the worst imaginable. So, it crisply shows the difference between correct algorithms and good algorithms. (italics ola’s)

  50. Brian Reid (Hopper Award 1982) Feah. I love bubble sort, and I grow weary of people who have nothing better to do than to preach about it. Universities are good places to keep such people, so that they don't scare the general public. (continued)

More Related