530 likes | 551 Views
Learn about Autocomplete systems, Sorting algorithms, and Comparators in Compsci 201 during the Spring 2019 semester. Discover trade-offs, binary searches, and optimization strategies in handling large datasets efficiently.
E N D
Compsci 201, Autocomplete, Sorts, Comparators Owen Astrachan ola@cs.duke.edu http://bit.ly/201spring19 March 27, 2019 Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
S is for … • Stack • Last in, First Out, source of overflow! • Software • Joys and sorrows, eating the world • Sorting • From slow to quick to tim to … Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Is this a Meme? A Gif? Recursion? • Laziness does not exist • https://medium.com/@devonprice/laziness-does-not-exist-3af27e312d01 Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Plan for LWoM • Autocomplete Overview • Using priority queues and interfaces • Comparators for Sorting and Priority Queues • Alternative to .compareTo/Comparable • Sorting from Algorithms to APIs • Exposure and utility: knowing and doing Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Autocomplete • 70,000 queries/second, thousands of computers, 0.2 seconds to answer query, … • Fall 2018 and Spring 2019 Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Autocomplete • 70,000 queries/second, thousands of computers, 0.2 seconds to answer query, … • Fall 2018 and Spring 2019 Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Geolocating Heaven Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Tradeoffs in Autocomplete • Like search in Google, we want the best or top or most-weighty matches • Each term is (word,weight) pair • Sort by weight, we want only top 10 or top k • Don't sort everything if we only need 10 • PriorityQueues can help, so can binary search • Don't sort 109 items if 103 match "duke b" Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
BruteAutocomplete • There are N terms (word,weight) • M of these match a prefix, e.g., "auto" • We want the top k of these M matches • N is millions, M is thousands/hundreds, k is 10 • Naïve: find matching M terms of N, sort them, choose the heaviest k: N + M log M + k • Where does M log M term come from? • Where does N term come from? Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Binary Search beats Naïve Brute • In Brute don't use M log M use M log k • Use a priority queue: see code in Git repo • Who cares? Does this make a difference? • We can use binary search, after sorting once • Find first and last of M matching terms: log N • Changes time to log N + M log k + k Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Pièce de résistance! • Brute: N + M log k • Binary: log N + M log k • data/alexa.txt contains one million queries • Does N versus log N matter? • Use a HashMap of prefix to matching terms • O(1) at the expense of storage • APIs are your friend! Good in practice too Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Binary Search in Autocomplete • Given "beenie" and prefix of 3, find M matches • Find first "bee.." and last "bee..": O(log N) • Cannot use API: doesn't get first/last match • O(log N) to find first and last O(M log k) for top k Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Tradeoffs Summarized • Brute: O(N + M log k) since uses priority queue • Binary: O(log N + M log k) • Requires sorting once: O(N log N) • Amortize this cost over many queries, ignore? • If we make Q queries: • Brute: Q x N • Binary: Q x log N – recoup sorting cost? • Hash: Q – can't really do better! Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Something Old, Something New • Sort list of Terms by weight? Or reverse order? • Call v.getWeight()w.getWeight() • Before Java 8 and after Java 8 publicstaticclassWeightOrderimplements Comparator<Term> publicstatic classReverseWeightOrderimplements … Collections.sort(list, newTerm.ReverseWeightOrder()); Collections.sort(list, Comparator.comparing(Term::getWeight) .reversed()); Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
The Comparator Interface • When you can't access/change a class • Supply separate Comparator object • Return integer similar to .compareTo return value • int compare(Term a, Term b) • Compare by weight or prefix or … • return < 0 if a < b • return > 0 if a > b • return == 0 if a == b Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Old Ideas meet new APIs • Remnants of old versions of assignments linger • If sorting M matching terms in binary search, need ReverseWeightOrder: O(M log M) • If using Priority Queue, don't need: O(M log k) • When M is 106 and k is 101, there's a difference!! • Old and new in HashListAutocomplete Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
WOTO http://bit.ly/201spring19-march27-1 Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Alan Turing • 2:46 marathon • 15:20 three mile • Enigma machine and WWII • Entscheidungsproblem Sometimes it is the people no one can imagine anything of who do the things no one can imagine Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
PriorityQueues top to bottom • All operations are O(log N) where N size of PQ • This for add and remove; can peek in O(1) • Details after midterm • Always remove the smallest element, minPQ • Can change by providing a Comparator • Shortest-path, e.g., Google Maps. Best-first search in games • Best element removed from queue, not first Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
PriorityQueues top to bottom • How can we sort elements using Priority Queue? • Add all elements to pq, then remove them • Every operation is O(log N), so this sort? • O(N log N) – basis for heap sort • https://coursework.cs.duke.edu/201spring19/sorting Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Finding top M of N • Sort all and get first (or last) M • O(N log N) to sort, then O(M), typically N >> M • Code below doesn't alter list parameter • Why is comp.reversed() used? https://coursework.cs.duke.edu/201spring19/sorting Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Finding top M of N • Can do this in O(N log M) using priority queue • Not intuitive? largest M using min PQ? Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Details for M of N • Keep only M elements in the priority queue • Every time one removed? It's the smallest • When done? Top M remain, removed smallest! • First element removed? Smallest, so … • Why is LinkedList used? O(1) add to front Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Sorting: 201 in a • Algorithms: traditionally foundation of compsci • Study, analyze, develop, use • APIs: tested, proven, configurable • Algorithms encapsulated and usable • You should know how to write your own sort • You should know how to call library sort Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Sorting: From Theory to Practice • Why do we study more than one algorithm? • Paradigms of trade-offs and algorithmic design • Know your history • How do you use and configure library sort? • http://www.sorting-algorithms.com/ Not Yogi Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Simple, O(n2) sorts • Selection sort --- n2 comparisons, n swaps • Find min, swap to front, increment front, repeat • Insertion sort --- n2 comparisons, no swap, shift • stable, fast on sorted data, slide into place • Bubble sort --- n2 everything, slow* • Catchy name, but slow and ugly* *this isn't everyone's opinion, but it should be • Shell sort: quasi-insertion, fast in practice • Not quadratic with some tweaks Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Case Study: SelectionSort • https://coursework.cs.duke.edu/201spring19/sorting Canonical O(n2) algorithm/code Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Case Study: SelectionSort • Invariant: on jth pass, [0,j) is in final sorted order • Nested loop re-establishes invariant Final order unexamined publicvoid sort(List<T> list) { for(intj=0; j < list.size()-1; j++) { intmin = j; for(intk=j+1; k < list.size(); k++) { if (list.get(k).compareTo(list.get(min)) < 0){ min = k; } } swap(list,min,j); } } j Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Reminder: Loop Invariant • Statement: true each time loop begins to execute • During loop execution it may become false • The loop re-establishes the invariant • Typically stated in terms of loop index • Pictures can help reason about code/solution • Helps to reason formally and informally about the code you’re writing • Can I explain the invariant to someone? Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Bubblesort isn't much code • Swap adjacent elements when out of order • From beginning to end, then end-1, end-2, … • After n passes, last n-elements in place Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Timing of n2 and other sorts • https://coursework.cs.duke.edu/201spring19/sorting Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
More efficient O(n log n) sorts • Divide and conquer sorts: • Quick sort: fast in practice, O(n2) worst case • Merge sort: stable, fast, extra storage • Timsort: http://en.wikipedia.org/wiki/Timsort • Other sorts: • Heap sort: priority queue sorting • Radix sort: uses digits/characters (no compare) • O(n log n) is optimal for comparing • But, Radix is O(n) ?? Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Stable, Stability • Stable: respect order of equal keys when sorting • First sort by shape, then by color: Stable! • Triangle < Square < Circle; Yellow < Green < Red Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
<= X X > X pivot index Quicksort: fast in practice • Invented in 1962 by Tony Hoare, didn't understand recursion: • Canonical T(n) = 2T(n/2)+O(n), but • Worst case is O(n2), bad pivot. Shuffle first? voiddoQuick(List<T> list, intfirst, intlast) { if (first >= last) return; intpiv = pivot(list,first,last); doQuick(list,first,piv-1); doQuick(list,piv+1,last); } Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
<= X X > X pivot index Pivot is O(n) • Invariant: [first,p] <= list.get(first) • Invariant: (p,k) > list.get(first) <= [first] ??? privateint pivot(List<T> list, intfirst, intlast){ T piv = list.get(first); intp = first; for(intk=first+1; k <= last; k++){ if (list.get(k).compareTo(piv) <= 0){ p++; swap(list,k,p); } } swap(list,p,first); returnp; } k > [first] p first Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
https://en.wikipedia.org/wiki/Timsort • Stable, O(n log n) in average and worst, O(n) best! • In practice lots of data is "close" to sorted • Invented by Tim Peters for Python, now in Java • Replaced merge sort which is also stable • Engineered to be correct, fast, useful in practice • Theory and explanation not so simple https://www.youtube.com/watch?v=NVIjHj-lrT4 Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Summary of O(n log n) sorts • Timsort: hybrid of merge and insertion? • Fast in real world: Python, Java 7+, Android • What’s the best O(n log n) sort to call? • The one in the library you have access to • Arrays.sortor Collections.sort • Changing how you sort: • .compareTo() or .compare() Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
sortingwoto or ginooorsttw • http://bit.ly/201spring19-march27-2 Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Brian Fox GNU Bash Shell (developer) First employee at Free Software Foundation First online banking system at Wells Fargo There’s nothing that I am better at than everyone else, except being me. There’s no secret to being me. Follow your interests and work hard at them. Then you will play bass better, program better, cook better, ride motorcycles better, or anything else that you really want to do. https://lifehacker.com/im-brian-fox-author-of-the-bash-shell-and-this-is-how-1820510600 Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Recursion Idiom for Trees • Use root = caller(root, val); • Modifies Tree at root and returns the result • Recursive calls works similarly • root.left = caller(root.left, val); • Process the subtree, use the result • Where do we see this? Insertion into search tree • We also see it in tree tighten APT Compsci 201, Spring 2019, Trees and Recurrences
Tightening a Tree • https://www2.cs.duke.edu/csed/newapt/treetighten.html • Call: tree = tighten(tree) • A node has two children (13, 8)? no change • A node has no children (5,7,25)? no change • One child? remove Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
APT Tree Practice Redux • https://www2.cs.duke.edu/csed/newapt/trimtree.html • It's a search tree, so we can use range search • Left subtree <= root. Right subtree > root • With no duplicates, < and not <= • Range [2,15] whole tree • What about … Compsci 201, Spring 2019, Trees and Recurrences
Trimming the Tree • https://www2.cs.duke.edu/csed/newapt/trimtree.html • Range [2,15] both subtrees • What about [3,7] or [5,13] or [11,17] • If root in range? • Process subtrees, return root • t.left = trim(t.left, …) • t.right = trim(t.right, …) • If root not in range? Return result of one call! • return trim(t.left, ..) or … Compsci 201, Spring 2019, Trees and Recurrences
Toward All Green • Identify base-case: null every time • Sometimes leaf? In this case not needed • Make sure you use result of recursive call • Trim returns a tree, what do we do? • t.left = trim(t.left,low,high); • return t • Compare to simply return trim(t.left); • Don't include node when outside range? Compsci 201, Spring 2019, Trees and Recurrences
Bubble Sort, A Personal Odyssey Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Steve and Rachel, Duke 1997 Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
11/08/77 Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
17 Nov 75 Not needed Can be tightened considerably Compsci 201, Spring 2019, Autocomplete, Sorting, Comparators
Jim Gray (Turing 1998) • Bubble sort is a good argument for analyzing algorithm performance. It is a perfectly correct algorithm. But it's performance is among the worst imaginable. So, it crisply shows the difference between correct algorithms and good algorithms. (italics ola’s)
Brian Reid (Hopper Award 1982) Feah. I love bubble sort, and I grow weary of people who have nothing better to do than to preach about it. Universities are good places to keep such people, so that they don't scare the general public. (continued)