690 likes | 876 Views
Sublinear Algorihms for Big Data. Lecture 1. Grigory Yaroslavtsev http://grigory.us. Part 0: Introduction. Disclaimers Logistics Materials …. Name. Correct: Grigory Gregory (easiest and highly recommended!) Also correct: Dr. Yaroslavtsev (I bet it’s difficult to pronounce)
E N D
SublinearAlgorihms for Big Data Lecture 1 GrigoryYaroslavtsev http://grigory.us
Part 0: Introduction • Disclaimers • Logistics • Materials • …
Name Correct: • Grigory • Gregory (easiest and highly recommended!) Also correct: • Dr. Yaroslavtsev (I bet it’s difficult to pronounce) Wrong: • Prof. Yaroslavtsev (Not any easier)
Disclaimers • A lot of Math!
Disclaimers • No programming!
Disclaimers • 10-15 times longer than “FuerzaBruta”, soccer game, milonga…
Big Data • Data • Programming and Systems • Algorithms • Probability and Statistics
Sublinear Algorithms = size of the data, we want , not • Sublinear Time • Queries • Samples • Sublinear Space • Data Streams • Sketching • Distributed Algorithms • Local and distributed computations • MapReduce-style algorithms
Why is it useful? • Algorithms for big data used by big companies (ultra-fast (randomized algorithms for approximate decision making) • Networking applications (counting and detecting patterns in small space) • Distributed computations (small sketches to reduce communication overheads) • Aggregate Knowledge: startup doing streaming algorithms, acquired for $150M • Today: Applications to soccer
Course Materials • Will be posted at the class homepage: http://grigory.us/big-data.html • Related and further reading: • Sublinear Algorithms (MIT) by Indyk, Rubinfeld • Algorithms for Big Data (Harvard) by Nelson • Data Stream Algorithms (University of Massachusetts) by McGregor • Sublinear Algorithms (Penn State) by Raskhodnikova
Course Overview • Lecture 1 • Lecture 2 • Lecture 3 • Lecture 4 • Lecture 5 3 hours = 3 x (45-50 min lecture + 10-15 min break).
Puzzles You see a sequence of values , arriving one by one: • (Easy, “Find a missing player”) • If all are different and have values between and , which value is missing? • You have space • Example: • There are 11 soccer players with numbers 1, …, 11. • You see 10 of them one by one, which one is missing? You can only remember a single number.
Puzzle #1 You see a sequence of values , arriving one by one: • (Easy, “Find a missing player”) • If all are different and have values between and , which value is missing? • You have space • Example: • There are 11 soccer players with numbers 1, …, 11. • You see 10 of them one by one, which one is missing? You can only remember a single number.
Puzzle #2 You see a sequence of values , arriving one by one: • (Harder, “Keep a random team”) • How can you maintain a uniformly random sample of values out of those you have seen so far? • You can store exactly items at any time • Example: • You want to have a team of 11 players randomly chosen from the set you have seen. • Players arrive one at a time and you have to decide whether to keep them or not.
Puzzle #3 You see a sequence of values , arriving one by one: • (Very hard, “Count the number of players”) • What is the total number of values up to error • You have space and can be completely wrong with some small probability
Puzzles You see a sequence of values , arriving one by one: • (Easy, “Find a missing player”) • If all are different and have values between and , which value is missing? • You have space • (Harder, “Keep a random team”) • How can you maintain a uniformly random sample of values out of those you have seen so far? • You can store exactly items at any time • (Very hard, “Count the number of players”) • What is the total number of values up to error • You have space and can be completely wrong with some small probability
Part 1: Probability 101 “The bigger the data the better you should know your Probability” • Basic Spanish: Hola, Gracias, Bueno, Por favor, Bebida, Comida, Jamon, Queso, Gringo, Chica, Amigo, … • Basic Probability: • Probability, events, random variables • Expectation, variance / standard deviation • Conditional probability, independence, pairwise independence, mutual independence
Expectation • = random variable with values • Expectation 𝔼 • Properties (linearity): • Useful fact: if all and integer then 𝔼
Expectation • Example: dice has values with probability 1/6 [Value]
Variance • Variance • is some fixed value (a constant) • Corollary:
Variance • Example (Variance of a fair dice): [ =[ +
Independence • Two random variables and areindependent if and only if (iff) for every : • Variables are mutually independentiff • Variables are pairwise independentiff for all pairs i,j
Independence: Example Independent or not? • Event Argentina wins the World Cup • Event Messi becomes the best striker Independent or not? • Event Argentina wins against Netherlands in the semifinals • Event Germany wins against Brazil in the semifinals
Independence: Example • Ratings of mortgage securities • AAA = 1% probability of default (over X years) • AA = 2% probability of default • A = 5% probability of default • B = 10% probability of default • C = 50% probability of default • D = 100% probability of default • You are a portfolio holder with 1000 AAA securities? • Are they all independent? • Is the probability of default
Conditional Probabilities • For two events and : • If two random variables (r.vs) are independent (by definition) (by independence)
Union Bound For any events : • Pro: Works even for dependent variables! • Con: Sometimes very loose, especially for mutually independent events
Union Bound: Example Events “Argentina wins the World Cup” and “Messi becomes the best striker” are not independent, but: Pr[“Argentina wins the World Cup” or “Messi becomes the best striker”] Pr[“Argentina wins the World Cup”] + Pr[“Messi becomes the best striker”]
Independence and Linearity of Expectation/Variance • Linearity of expectation (even for dependent variables!): • Linearity of variance (only for pairwise independent variables!)
Part 2: Inequalities • Markov inequality • Chebyshev inequality • Chernoff bound
Markov’s Inequality • For every • Proof (by contradiction) (by definition) (pick only some i’s) >(by assumption )
Markov’s Inequality • For every • Corollary : For every • Pro: always works! • Cons: • Not very precise • Doesn’t work for the lower tail:
Markov Inequality: Example Markov 1: For every • Example:
Markov Inequality: Example Markov 2: For every • Example:
Markov Inequality: Example Markov 2: For every • :
Markov + Union Bound: Example Markov 2: For every • Example: 1.75
Chebyshev’s Inequality • For every : • Proof: (by squaring) (by Markov’s inequality)
Chebyshev’s Inequality • For every : • Corollary (): For every :
Chebyshev: Example • Roll a dice 10 times: = Average value over 10 rolls • = value of the i-th roll, • Variance ( by linearity for independentr.vs):