CS 130 A: Data Structures and Algorithms

CS 130 A: Data Structures and Algorithms • Course webpage: www.cs.ucsb.edu/~suri/cs130a/cs130a • Email:suri@cs.ucsb.edu • Office Hours: 11-12 Wed

CS 130A: Prerequisites • First upper division course • More in-depth coverage of data structures and algorithms • Prerequisites • CS 16: stacks, queues, lists, binary search trees, … • CS 40: functions, recurrence equations, proofs, … • Programming competence assumed • C, C++, and UNIX • Refresh your coding and debugging skills • Use TAs

Text Book • Data Structures & Algorithm Analysis in C++ by Mark Allen Weiss • Supplemental material from Introduction to Algorithms, by Cormen, Leiserson, Rivest, Stein [MIT book] • Lecture material primarily based on my notes • Lecture notes available on my webpage • See web page for lectures updates, assignments.

CS 130 A: Grade Composition • 2 Midterm exams (30% total) • 2 Programming assignments (30% total) • 1 Final exam (40%) • Homework assignments • They will not be graded: they are to help you practice problem solving and prepare for exams • Solving homework problems key to understanding. • Solutions will be made available, so you can self-assess your understanding and work with TAs to correct your mistakes. • Attend all lectures! • Schedule is tentative. • Unexpected changes in midterm/exam dates

Some Advice and Caution • Posted schedule of lectures, assignments, exams is tentative • Reviews unplanned • Unexpected events may change dates of midterms • No makeup exams, no extensions. • Attend all lectures. • Read lecture notes (material) before coming to class.

Teaching Assistants • Teaching Assistants: • Semih Yavuz (syavuz@cs.ucsb.edu) • Discussion: Wed 6:30-7:200 (GIRV 1119) • TA hours: TBA (Trailer 936) • Bay-Yuan Hsu (soulhsu@cs.ucsb.edu) • Discussion: Tues 6:30-7:20 (GIRV 1119) • TA hours: TBA (Trailer 936)

Discussion Sections • No discussion section this week • Discussion Format • No new material discussed • It is meant as a help session • Use them to go over homework assignments • Programming pointers • But TA are not there to help you write or debug code

What the course is about • The course is primarily about Data Structures • Algorithms covered in small part (20%) • CS 130B is the main algorithms course • Data structures will be motivated by applications although we won’t discuss them in any detail

What the course is about • This is a Theory course, not programming/systems • Primary focus on concepts, design, analysis, proofs • Includes 2 coding assignments, but no programming taught • C++, Unix competence expected • My teaching philosophy for 130A • Discovery and insights. Big picture. • Best understood in abstract form, with pen-paper • Alternative Style: learn by coding. (If coding is your thing, feel free to program the data structures.) • Exams on conceptual understanding, not coding details. • Homework exercises model for exam questions.

Course Outline • Introduction and Algorithm Analysis (Ch. 2) • Hash Tables: dictionary data structure (Ch. 5, CLRS) • Heaps: priority queue data structures (Ch. 6) • Balanced Search Trees: general search structures (Ch. 4.1-4.5) • Union-Find data structure (Ch. 8.1–8.5, Notes) • Graphs: Representations and basic algorithms • Topological Sort (Ch. 9.1-9.2) • Minimum spanning trees (Ch. 9.5) • Shortest-path algorithms (Ch. 9.3.2) • B-Trees: External-Memory data structures (CLRS, Ch. 4.7) • kD-Trees: Multi-Dimensional data structures (Notes, Ch. 12.6) • Misc.: Streaming data, randomization (Notes)

What are your goals? • A step towards the BS degree • Just a required CS course • Becoming a well-rounded computer scientist • Intellectual (theory) aspects of CS • Clever ideas • Interview questions at elite software companies

My goals • Algorithms is my research expertise • A lively and enormously active area of research • Broad impact on almost every area of CS • My personal mission: • transmit some of the knowledge and enthusiasm • Win the best teacher award • Weekly Jokes • Send me your jokes!

Why Study Algorithms and Data Structures? • Intellectual Pursuit

Why Study Algorithms and Data Structures? • To become better computer scientist

Why Study Algorithms and Data Structures? • World domination

Algorithms are Everywhere • Search Engines • GPS navigation • Self-Driving Cars • E-commerce • Banking • Medical diagnosis • Robotics • Algorithmic trading • and so on …

Emergence of Computational Thinking • Computational X • Physics: simulate big bang, analyze LHC data, quantum computing • Biology: model life, brain, design drugs • Chemistry: simulate complex chemical reactions • Mathematics: non-linear systems, dynamics • Engineering: nano materials, communication systems, robotics • Economics: macro-economics, banking networks, auctions • Aeronautics: new designs, structural integrity • Social Sciences, Political Science, Law ….

Emergence of Computational Thinking

Modern World of Computing • Age of Big Data, birth of Data Science • Digitization, communication, sensing, imaging… • Entertainment, science, maps, health, environmental, banking… • Volume, variety, velocity, variability • What all happens in 1 Internet minute?

Intelligent Computational Systems

Why Data Structures? • Data is just the raw material for information, analytics, business intelligence, advertising, etc • Computational efficient ways of analyzing, storing, searching, modeling data • For the purpose of this course, need for efficient data structures comes down to: • Linear search does not scale for querying large databases • N2 processing or N2 storage infeasible • Smart data structures offer an intelligent tradeoff: • Perform near-linear preprocessing so that queries can be answered in much better than linear time

2 Motivating Applications • Imagine you are in charge of maintaining a corporate network (or a major website such as Amazon) • High speed, high traffic volume, lots of users. • Expected to perform with near perfect reliability, but is also under constant attack from malicious hackers • Monitoring what is going through the network is complex: • Why is it slow? • Which machines have become compromised? • Which applications are eating up too much bandwidth etc.

IP Network Monitoring • Any monitoring software/engine must be extremely light weight and not add to the network load • These algorithms need smart data structures to track important statistics in real time

IP Network Monitoring • Consider a simple (toy) example • Is some IP address sending a lot of data to my network? • Which IP address sent the most data in last 1 minute? • How many different IP addresses in last 5 minutes? • Have I seen this IP address in the last 5 minutes? • IP address format: 192.168.0.0 • IPv4 has 32 bits, IPv6 has 128 bits • You wouldn’t want to maintain a table of all IP addresses to see how much traffic each is sending. • These are data structure problems, where obvious/naïve solutions are no good, and require creative/clever ideas.

Microprocessor Profiling • Modern microprocessors run at GHz or higher speeds • Yet they do an incredible amount of optimization for instruction scheduling, branch prediction etc • Profiling or monitoring code tracks performance bottlenecks, and looks for anomalies. • Compute memory access statistics • Correlations across resources etc • Toy examples: • Which memory locations used the most in the last 1 sec? • Usage map over sliding time window • Need for highly efficient dynamic data structures

A Puzzle • Most Frequent Item • You are shown a sequence of N positive integers • Identify the one that occurs most frequently • Example: 4, 1, 3, 3, 2, 6, 3, 9, 3, 4, 1, 12, 19, 3, 1, 9 • However, your algorithm has access to only O(1) memory • “Streaming data” • Not stored, just seen once in the order it arrives • The order of arrival is arbitrary, with no pattern • What data structure will solve this problem?

A Puzzle: Most Frequent Item • Items can be source IP addresses at a router • The most frequent IP address can be useful to monitor suspicious traffic source • More generally, find the top K frequent items • Targeted advertising • Amazon, Google, eBay, Alibaba may track items bought most frequently by various demographics

Another Puzzle • The Majority Item • You are shown a sequence of N positive integers • Identify the one that occurs at least N/2 times • A: 4, 1, 3, 3, 2, 6, 3, 9, 3, 4, 1, 12, 19, 3, 1, 9, 1 • B: 4, 1, 3, 3, 2, 3, 3, 9, 3, 4, 1, 3, 19, 3, 3, 9, 3 • Sequence A has no majority, but B has one (item 3) • Again, your algorithm has access to only O(1) memory • What data structure will solve this problem?

Solving the Majority Puzzle • Use two variables C (candidate) and M (multiplicity). • When next item, say, X arrives • if C undefined (null), set C = X and M = 1; • else if X = C, set M = M+1; • else set M = M-1; • Claim: At the end of sequence, C is the only possible candidate for majority. • Note that sequence may not have any majority. • But if you know there is a majority, C must be it.

Solving the Majority Puzzle • Proof of Correctness. • Suppose item Z is the majority item. • Whenever C = Z, counter M is incremented. • Whenever Z occurs but C has a different item, Z causes M to decrement. • Each decrement is “charged” to that non-Z item • Each non-Z item can only counteract one occurrence of Z • Since there are fewer than N/2 non-Z items, they cannot cancel all occurrences of Z. • So, in the end, Z must be stored as C, with a non-zero M value.

Solving the Majority Puzzle • False Positives in Majority Puzzle. • What happens if the sequence does not have a majority? • C may contain a random item, with non-zero M. • Strictly, a second pass through the sequence is necessary to “confirm” that Z is the majority. • But in our application, it suffices to just “tag” a malicious IP address, and to monitor it for next few minutes.

Generalizing the Majority Problem • Identify k items, each appearing more than N/(k+1) times. • Note that simple majority is the case of k = 1.

Generalizing the Majority Problem • Find k items, each appearing more than N/(k+1) times. • Use k candidate-multiplicity tuples (C1, M1), …, (Ck, Mk). • When next item, say, X arrives • if X = Cj for some j, set Mj = Mj+1 • if X different from all Cj, but some tuple i free, then set Ci = X and Mi = Mi+1 • else decrement all counters Mj = Mj-1; • Verify for yourselves this algorithm is correct.

Back to the Most Frequent Item Puzzle • You are shown a sequence of N positive integers • Identify most frequently occurring item • Example: 4, 1, 3, 3, 2, 6, 3, 9, 3, 4, 1, 12, 19, 3, 1, 9 • What algorithm and data structure will help?

An Impossibility Result • Cannot be done! • Computing the MFI requires storing Q(N) space. • An adversary based argument: • The first half of the sequence has all distinct items • At least one item, say, X is not remembered by algorithm. • In the second half, all items will be distinct, except X will occur twice, becoming the MFI.

Lessons for Data Structure Design • Puzzles such as Majority and Most Frequent Items teach us two important lessons: • To solve a problem, we should understand its structure • Correctness is intertwined with design/efficiency • Problems with superficial resemblance can have very different complexity • Do not blindly apply a data structure or algorithm without understanding the nature of the problem

Performance Bottleneck: algorithm or data structure?

Course Objectives • Focus: systematic design and analysis of data structures (and some algorithms) • Algorithm: method for solving a problem. • Data structure: method to store information. • Guiding principles: abstraction and formal analysis • Abstraction: Formulate fundamental problem in a general form so it applies to a variety of applications • Analysis: A (mathematically) rigorous methodology to compare two objects (data structures or algorithms) • In particular, we will worry about "always correct"-ness, and worst-case bounds on time and memory (space).

130a: Design and Analysis • Foundations of Algorithm Analysis and Data Structures. • Data Structures • How to efficiently store, access, manage data • Data structures effect algorithm’s performance • Algorithm Design and Analysis: • How to predict an algorithm’s performance • How well an algorithm scales up • How to compare different algorithms for a problem

Asymptotic Complexity Analysis

Complexity and Tractability Assume the computer does 1 billion ops per sec.

N2 is bad, Exponential is horrible

Graph Problems Often face Combinatorial Explosion

Quick Review of Algorithm Analysis • Two algorithms for computing the Factorial • Which one is better? • int factorial (int n) { if (n <= 1) return 1; else return n * factorial(n-1); } • int factorial (int n) { if (n<=1) return 1; else { fact = 1; for (k=2; k<=n; k++) fact *= k; return fact; } }

A More Challenging Algorithm to Analyze main () { int x = 3; for ( ; ; ) { for (int a = 1; a <= x; a++) for (int b = 1; b <= x; b++) for (int c = 1; c <= x; c++) for (int i = 3; i <= x; i++) if(pow(a,i) + pow(b,i) == pow(c,i)) exit; x++; } }

Max Subsequence Problem • Given a sequence of integers A1, A2, …, An, find the maximum possible value of a subsequence Ai, …, Aj. • Numbers can be negative. • You want a contiguous chunk with largest sum. • Example: 4, 3, -8, 2, 6, -4, 2, 8, 6, -5, 8, -2, 7, -9, 4, -1, 5 • While not a data structure problems, it is an excellent pedagogical exercise for design, correctness proof, and runtime analysis of algorithms

Max Subsequence Problem • Given a sequence of integers A1, A2, …, An, find the maximum possible value of a subsequence Ai, …, Aj. • Example: 4, 3, -8, 2, 6, -4, 2, 8, 6, -5, 8, -2, 7, -9, 4, -1, 5 • We will discuss 4 different algorithms, of time complexity O(n3), O(n2), O(n log n), and O(n). • With n = 106, Algorithm 1 may take > 10 years; Algorithm 4 will take a fraction of a second!

Algorithm 1 for Max Subsequence Sum • Given A1,…,An , find the maximum value of Ai+Ai+1+···+Aj Return 0 if the max value is negative

Algorithm 1 for Max Subsequence Sum • Given A1,…,An , find the maximum value of Ai+Ai+1+···+Aj 0 if the max value is negative int maxSum = 0; for( int i = 0; i < a.size( ); i++ ) for( int j = i; j < a.size( ); j++ ) { int thisSum = 0; for( int k = i; k <= j; k++ ) thisSum += a[ k ]; if( thisSum > maxSum ) maxSum = thisSum; } return maxSum; • Time complexity: O(n3)

CS 130 A: Data Structures and Algorithms