Topic 1011: Topics in Computer Science

Topic 1011: Topics in Computer Science Dr J Frost (jfrost@tiffin.kingston.sch.uk) Last modified: 2nd November 2013

A note on these Computer Science slides These slides are intended to give just an introduction to two key topics in Computer Science: algorithms and data structures. Unlike the other topics in the Riemann Zeta Club, they’re not intended to give a deeper knowledge required for solving difficult problems. The main intention is to provide an initial base of Computer Science knowledge, which may help you in your university interviews. In addition to these slides, it’s highly recommended that you study the following Riemann Zeta slides to deal with more specific Computer-Science-ey questions: Logic Combinatorics Pigeonhole Principle

Slide Guidance Any box with a ? can be clicked to reveal the answer (this works particularly well with interactive whiteboards!). Make sure you’re viewing the slides in slideshow mode. ?  For multiple choice questions (e.g. SMC), click your choice to reveal the answer (try below!) Question: The capital of Spain is:  A: London  B: Paris  C: Madrid

Contents • Time and Space Complexity • Big O Notation • Sets and Lists • Binary Search • Sorted vs Unsorted Lists • Hash Tables • Recursive Algorithms • Sorting Algorithms • Bubble Sort • Merge Sort • Bogosort

Time and Space Complexity Suppose we had a list of unordered numbers, which the computer can only view one at a time. 1 4 2 9 3 7 Suppose we want to check if the number 8 is in the list. If the size of the problem is , (i.e. there are cards in the list), then in the worst case, how much time will it take to check that some number is in there? And given that the list is stored on a disc (rather than in memory), how much memory (i.e. space) do we need for our algorithm? (Worst Case) Time Complexity Space Complexity ? Time If there’s n items to check, and each takes some constant amount of time to check, so we know the time will be at most some constant times n. Space We only need one slot of memory for the number we’re checking against the list, and one slot of memory for the current item in the list we’re looking at. So the space needed will be constant, and importantly, is not dependent on the size n of our list. ?

Big O notation So the time and space complexity of an algorithm gives us a measure of how ‘complex’ the algorithm is in terms of the time it’ll take, and the space required to do its handywork. In mathematics, Big O notation is used to measure how some expression grows. Suppose for example we have the function: We can see that as becomes larger, the and terms become inconsequential, because the term dominates. Since and for all positive , then We’re not interested in the scaling of 15, since this doesn’t tell us anything about the growth of the function. y = 2x3 + 10x2 + 3 y = 2x3 i.e. grows cubically. We say that:

Big O notation Formally, if , then there is some constant such that for all sufficiently large . So technically we could say that , because the big-O just provides an upper bound to the growth. But we would want to keep this upper bound as low as possible, so it would be more useful to say that . 1 4 2 9 3 7 While big-O notation has been around for centuries (particularly in number theory), in the 1950s, it started to be used to describe the complexity of algorithms. Returning to our probably of finding a number in an ordered list, we can now express our time and space complexity using big-O notation (in terms of the list size ): Remember that the constant scaling doesn’t matter in big-O notation. So ‘1’ is used to mean constant time/space. Time Complexity Space Complexity ? ?

Big O notation We’ll see some examples of more algorithms and their complexity in a second, but let’s see how we might describe algorithms based on their complexity… Time Complexity We say the time complexity of the algorithm is… constant time linear time ? quadratic time ? ? polynomial time exponential time ? logarithmic time ?

Sets and lists A data structure is, unsurprisingly, some way of structuring data, whether as a tree, a set, a list, a table, etc. There’s two main ways of representing a collection of items: lists and sets. Lists Sets <4, -2, 3, 6, 3> Example Does ordering of items matter? Yes ? ? No. and are the same set. Duplicates allowed? ? Yes ? No

Binary Search 1 4 3 7 9 12 15 20 Suppose we have either a set or list where the items are in ascending order. We want to determine if the number 14 is in the list. Previously, when the items were unordered, we have to scan through the whole list (a simple algorithm where the time complexity was ). But can we do better? More specifically, seeing if an item is within an unsorted list is known as a linear search. (Because we have to check every item, taking time linear in n!)

Binary Search 1 4 3 7 9 12 15 20 This line represents where the number we’re looking for could possibly be. At the start of a binary search, the number could be anywhere. A sensible thing to do is to look at the number just after the centre. That way, we can round down our search by half in one step. In this case , so we know that if the number is in the collection, it must be in the second half of it. Looking to see if 14 is in our list/set.

Binary Search 1 4 3 7 9 12 15 20 Now we looking halfway across what we have left to check. The number just after the halfway point is 15. , so if 14 is in our collection, it must be to the left of this point. Looking to see if 14 is in our list/set.

Binary Search 1 4 3 7 9 12 15 20 Now we’d compare our number 14 against the 12. Now since , we now know that 14 is not in the collection of items. Looking to see if 14 is in our list/set.

Binary Search 1 4 3 7 9 12 15 20 We can see on each step, we half the number of items that need to be search. The number of steps (i.e. the time complexity) in terms of the number of items must therefore be: Time Complexity ? This makes sense when you think about it. If , then , i.e. we can half 16 four times until we get to 1, so only 4 steps are needed. You might be wondering why we wrote instead of . This is because changing the base of a log only scales by a constant, and as we saw, big-O notation doesn’t care about constant scaling. So the base is irrelevant. We only ever looking at one number at a time, so only need a constant amount of memory. Space Complexity ?

Sorted vs unsorted lists Keeping our list sorted, or leaving it unsorted, has advantages either way. We’ve already seen that keeping the list sorted makes it much quicker to see if the list contains an item or not. What is the time complexity of the best algorithm to do each of these tasks? Sorted Unsorted Seeing if the list contains a particular value. We can just stick the item on the end! Adding an item to the list. We find the correct position to insert in time using a binary search. If we have some easy to splice in the new item somewhere in the middle of the list, without having to move the items after up to make space, then we’re done. However, if we do have to move up the items after (e.g. the values are stored in an ‘array’), then it takes time to shift the items up, hence it’s time overall. ? ? Merging two lists (of size and respectively, where ) Easy again. Just have the end of the first list somehow link to the start of the first list so that they’re joined together. Start with the largest list, with its items. Then insert each of the items from the second list into it. Each insert operation costs time (from above). But there’s items to add. ? ?

Sorted vs unsorted lists Sorted Unsorted Seeing if the list contains a particular value. Adding an item to the list. Merging two lists (of size and respectively, where ) We can see that the advantage of keeping the list unsorted is that it’s much quicker to insert new items into the list. However, it’s much slower to find/retrieve an item in the list, because we can’t exploit binary search. So it’s a trade-off.

Hash Table Hash Tables are structure which allow us to do certain operations to do with collections much more quickly: e.g. inserting a value into the collection, and retrieving! Imagine we had 10 ‘buckets’ to put new values into. Suppose we had a rule which decided what bucket to put a value into: Find the remainder when is divided by 10 (i.e. ) 1 2 3 4 5 6 7 8 9 0

Hash Table We can use our “mod 10” hash function to insert new values into our hash table. 2 31 67 42 19 112 55 57 29 33 69 4 1 0 9 8 7 6 5 3 2

Hash Table The great thing about a hash table is that if we want to check if some value is contained within it, we only need to check within the bucket it corresponds to. e.g. Is 65 in our hash table? Using the same hash function, we’d just check “Bucket 5”. At this point, we might just do a linear search of the items in the bucket to see if the 65 matches. In this case, we’d conclude that 65 isn’t part of our collection of numbers. 112 69 42 57 29 31 2 33 55 67 19 9 0 7 1 6 5 4 3 8 2

Hash Table Suppose we’ve put n items in a hash table with k buckets: Operation Time Complexity Seeing if some number is contained in our collection. O(n/k) But only if our chosen hash function distributes items fairly evenly across buckets. But if our data tended to have 1 as the last digit, “mod 10” would be a bad hash function because all the items would end up in the same bucket. The result would be that if we wanted to then check if 71 was in our collection, we’d end up having to check every item still! Using “mod p” where p is a prime, reduces this problem. ? Inserting a new item into the hash table structure. O(1) Presuming the hash function takes a constant amount of time to evaluate, we just stick the new item at the top of the correct bucket. We could always keep the buckets sorted. In which case, insertion would take O(log(n/k)) time. ? 112 69 42 57 29 31 2 33 55 67 19 8 7 6 5 4 3 2 1 9 0

Recursive Algorithms The Towers of Hanoi is a classic game in which the aim is to get the ‘tower’ (composed of varying sized discs) from the first peg to the last peg. There’s one ‘spare peg’ available. The only rule is that a larger disc can never be on top of a larger disc, i.e. on any peg the discs must be in decreasing size order from bottom to top. There’s two questions we might ask: 1. For n discs, what is the minimum number of moves required to win? 2. Is there an algorithm which generates the sequence of moves required?

Recursive Algorithms We can answer both questions at the same time. Suppose HANOI(START,SPARE,GOAL,n) is a function which generates a sequence of moves for n discs, where START is the start peg, SPARE is the spare peg and GOAL is the goal peg. Then we can define an algorithm as such:

Recursive Algorithms Recursively solve the problem of moving n-1 pegs from the start peg to the spare peg. i.e. HANOI(START,GOAL,SPARE, n-1) (notice that we’ve made the original goal peg the new spare peg and vice versa) It’s quite common to define a function in terms of itself but with smaller arguments. It’s recommended you first look at some of the examples in the Recurrence Relations section of the RZC Combinatorics slides to get your head around this.

Recursive Algorithms Next move the 1 remaining disc (or whatever disc is at the top of the peg) from the start to goal peg. i.e. MOVE(START,GOAL)

Recursive Algorithms Finally, recursively solve the problem moving n-1 discs from the spare peg to the target peg. i.e. HANOI(SPARE,START,GOAL, n-1) Notice here that the original start peg is now the spare peg, and the spare peg the start peg.

Recursive Algorithms Putting this together, we have the algorithm: FUNCTION HANOI(START, SPARE, GOAL, n) = HANOI(START, GOAL, SPARE, n-1), MOVE(START, GOAL), HANOI(SPARE, START, GOAL, n-1) But just like recurrences in maths, we need a ‘base case’, to say what happens when we only have to solve the problem when n=1 (i.e. we have one disc): FUNCTION HANOI(START, SPARE, GOAL, 1) = MOVE(START, GOAL)

Recursive Algorithms A B C We can see this algorithm in action. If the 3 pegs are A, B and C, and we have 3 discs, then we want to execute HANOI(A, B, C, 3) to get our moves: HANOI(A, B, C, 3) = HANOI(A, C, B, 2), MOVE(A, C), HANOI(B, A, C, 2) = HANOI(A, B, C, 1), MOVE(A, B), HANOI(C, B, A, 1), MOVE(A, C), HANOI(B, C, A, 1), MOVE(B, C), HANOI(A, B, C, 1) = MOVE(A, C), MOVE(A, B), MOVE(C, A), MOVE(A, C), MOVE(B, A), MOVE(B, C), MOVE(A, C)

Recursive Algorithms The same approach applies when counting the minimum number of moves. Let F(n) be the number of moves required to move n discs to the target peg. We require F(n-1) moves to move n-1 discs from the start to spare peg. We require 1 move to move the remaining disc to the goal peg. We require F(n-1) moves to move n-1 discs from the spare to goal peg. This gives us the recurrence relation F(n) = 2F(n-1) + 1 And our base case is F(1) = 1, since it only requires 1 move to move 1 disc. But just writing out the first few terms in this sequence, it’s easy to spot that the position-to-term formula is F(n) = 2n – 1. ? ?

Sorting Algorithms One very fundamental algorithm in Computer Science is sorting a collection of items so that they are in order (whether in numerical order, or some order we’ve defined). We’ll look at the main well-known algorithms, and look at their time complexity. 2 19 31 42 55 67 112

Bubble Sort This looks at each pair of numbers first, starting with the 1st and 2nd, then the 2nd and 3rd , and swaps them if they’re in the wrong order: Click to Animate 31 19 55 42 2 112 67 At the end of the first ‘pass’*, we can guarantee that the largest number will be at the end of the list. We then repeat the process, but we can now ignore the last number (because it’s in the correct position). This continues, until eventually on the last pass, we only need to compare the first two items. * A ‘pass’ in an algorithm means that we’ve looked through all the values (or some subset of them) within this stage. You can think of a pass as someone checking your university personal statement and making corrections, before you give this updated draft to another person for an additional ‘pass’.

Bubble Sort 31 19 55 42 2 112 67 Time Complexity? ? O(n2) The first pass requires n-1 comparisons, the next pass requires n-2 comparisons, and so on, giving us the sum of an arithmetic sequence. So the exact number of comparison is ½ n(n-1) This is growth quadratic in n, i.e. O(n2)

Merge Sort First treat each individual value as an individual list (with 1 item in it!) 31 19 55 42 2 112 67 4 19 31 42 55 2 112 4 67 19 31 42 55 2 4 67 112 2 4 19 31 42 55 67 112 Then we repeatedly merge each pair of lists, until we only have 1 big fat list. We’ll go into more detail on this ‘merge’ operation on the next slide.

Merge Sort At each point in the algorithm, we know each smaller list will be in order. Merging two sorted lists can be done quite quickly (click the button below): Click to Animate 19 31 55 112 2 4 42 67 New merged list General jist: Start with a marker at the beginning of each list. Compare the two elements at the markers. The lowest value gets put in the new list, and the marker at that item used moves up one. Then repeat!

Merge Sort Time Complexity? ? O(n log n) Each merging phase requires exactly n steps because when merging each pair of lists, each comparison puts an element in our new list. So there’s exactly n comparisons. There’s log2 n phases, because similarly to the binary search, each phase halves the number of mini-lists.

Bogosort The Bogosort, also known as ‘Stupid Sort’, is intentionally a ‘joke’ sorting algorithm, but provides some educational value. It simply goes like this: Put all the elements of the list in a completely random order. Check if the elements are in order. If so, you’re done. If not, then go back to Step 1. We can describe time complexity in different ways: the ‘worst-case behaviour’ (i.e. the longest amount of time the algorithm can possibly take) and the ‘average-case behaviour’ (i.e. how long we expect the algorithm to take on average) Worst Case Time Complexity? Average Case Time Complexity? ? ? The algorithm theoretically may never terminate, because the order may be wrong every time. O(n n!) There are n! possible ways the items can be ordered. Presuming no duplicates in the list, there’s a 1 in n! chance that the list is in the correct order. We therefore ‘expect’ to have to repeat Step 1 n! times. Each check in Step 2 requires checking all the elements, which is O(n) time. It might be worth checking out the Geometric Distribution in the RZC Probability slides.

Topic 1011: Topics in Computer Science