1 / 69

Sublinear Algorihms for Big Data

Sublinear Algorihms for Big Data. Lecture 1. Grigory Yaroslavtsev http://grigory.us. Part 0: Introduction. Disclaimers Logistics Materials …. Name. Correct: Grigory Gregory (easiest and highly recommended!) Also correct: Dr. Yaroslavtsev (I bet it’s difficult to pronounce)

enya
Download Presentation

Sublinear Algorihms for Big Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SublinearAlgorihms for Big Data Lecture 1 GrigoryYaroslavtsev http://grigory.us

  2. Part 0: Introduction • Disclaimers • Logistics • Materials • …

  3. Name Correct: • Grigory • Gregory (easiest and highly recommended!) Also correct: • Dr. Yaroslavtsev (I bet it’s difficult to pronounce) Wrong: • Prof. Yaroslavtsev (Not any easier)

  4. Disclaimers • A lot of Math!

  5. Disclaimers • No programming!

  6. Disclaimers • 10-15 times longer than “FuerzaBruta”, soccer game, milonga…

  7. Big Data • Data • Programming and Systems • Algorithms • Probability and Statistics

  8. Sublinear Algorithms = size of the data, we want , not • Sublinear Time • Queries • Samples • Sublinear Space • Data Streams • Sketching • Distributed Algorithms • Local and distributed computations • MapReduce-style algorithms

  9. Why is it useful? • Algorithms for big data used by big companies (ultra-fast (randomized algorithms for approximate decision making) • Networking applications (counting and detecting patterns in small space) • Distributed computations (small sketches to reduce communication overheads) • Aggregate Knowledge: startup doing streaming algorithms, acquired for $150M • Today: Applications to soccer

  10. Course Materials • Will be posted at the class homepage: http://grigory.us/big-data.html • Related and further reading: • Sublinear Algorithms (MIT) by Indyk, Rubinfeld • Algorithms for Big Data (Harvard) by Nelson • Data Stream Algorithms (University of Massachusetts) by McGregor • Sublinear Algorithms (Penn State) by Raskhodnikova

  11. Course Overview • Lecture 1 • Lecture 2 • Lecture 3 • Lecture 4 • Lecture 5 3 hours = 3 x (45-50 min lecture + 10-15 min break).

  12. Puzzles You see a sequence of values , arriving one by one: • (Easy, “Find a missing player”) • If all are different and have values between and , which value is missing? • You have space • Example: • There are 11 soccer players with numbers 1, …, 11. • You see 10 of them one by one, which one is missing? You can only remember a single number.

  13. 1

  14. 8

  15. 5

  16. 11

  17. 3

  18. 9

  19. 2

  20. 6

  21. 7

  22. 4

  23. Which number was missing?

  24. Puzzle #1 You see a sequence of values , arriving one by one: • (Easy, “Find a missing player”) • If all are different and have values between and , which value is missing? • You have space • Example: • There are 11 soccer players with numbers 1, …, 11. • You see 10 of them one by one, which one is missing? You can only remember a single number.

  25. Puzzle #2 You see a sequence of values , arriving one by one: • (Harder, “Keep a random team”) • How can you maintain a uniformly random sample of values out of those you have seen so far? • You can store exactly items at any time • Example: • You want to have a team of 11 players randomly chosen from the set you have seen. • Players arrive one at a time and you have to decide whether to keep them or not.

  26. Puzzle #3 You see a sequence of values , arriving one by one: • (Very hard, “Count the number of players”) • What is the total number of values up to error • You have space and can be completely wrong with some small probability

  27. Puzzles You see a sequence of values , arriving one by one: • (Easy, “Find a missing player”) • If all are different and have values between and , which value is missing? • You have space • (Harder, “Keep a random team”) • How can you maintain a uniformly random sample of values out of those you have seen so far? • You can store exactly items at any time • (Very hard, “Count the number of players”) • What is the total number of values up to error • You have space and can be completely wrong with some small probability

  28. Part 1: Probability 101 “The bigger the data the better you should know your Probability” • Basic Spanish: Hola, Gracias, Bueno, Por favor, Bebida, Comida, Jamon, Queso, Gringo, Chica, Amigo, … • Basic Probability: • Probability, events, random variables • Expectation, variance / standard deviation • Conditional probability, independence, pairwise independence, mutual independence

  29. Expectation • = random variable with values • Expectation 𝔼 • Properties (linearity): • Useful fact: if all and integer then 𝔼

  30. Expectation • Example: dice has values with probability 1/6 [Value]

  31. Variance • Variance • is some fixed value (a constant) • Corollary:

  32. Variance • Example (Variance of a fair dice): [ =[ +

  33. Independence • Two random variables and areindependent if and only if (iff) for every : • Variables are mutually independentiff • Variables are pairwise independentiff for all pairs i,j

  34. Independence: Example Independent or not? • Event Argentina wins the World Cup • Event Messi becomes the best striker Independent or not? • Event Argentina wins against Netherlands in the semifinals • Event Germany wins against Brazil in the semifinals

  35. Independence: Example • Ratings of mortgage securities • AAA = 1% probability of default (over X years) • AA = 2% probability of default • A = 5% probability of default • B = 10% probability of default • C = 50% probability of default • D = 100% probability of default • You are a portfolio holder with 1000 AAA securities? • Are they all independent? • Is the probability of default

  36. Conditional Probabilities • For two events and : • If two random variables (r.vs) are independent (by definition) (by independence)

  37. Union Bound For any events : • Pro: Works even for dependent variables! • Con: Sometimes very loose, especially for mutually independent events

  38. Union Bound: Example Events “Argentina wins the World Cup” and “Messi becomes the best striker” are not independent, but: Pr[“Argentina wins the World Cup” or “Messi becomes the best striker”] Pr[“Argentina wins the World Cup”] + Pr[“Messi becomes the best striker”]

  39. Independence and Linearity of Expectation/Variance • Linearity of expectation (even for dependent variables!): • Linearity of variance (only for pairwise independent variables!)

  40. Part 2: Inequalities • Markov inequality • Chebyshev inequality • Chernoff bound

  41. Markov’s Inequality • For every • Proof (by contradiction) (by definition) (pick only some i’s) >(by assumption )

  42. Markov’s Inequality • For every • Corollary : For every • Pro: always works! • Cons: • Not very precise • Doesn’t work for the lower tail:

  43. Markov Inequality: Example Markov 1: For every • Example:

  44. Markov Inequality: Example Markov 2: For every • Example:

  45. Markov Inequality: Example Markov 2: For every • :

  46. Markov + Union Bound: Example Markov 2: For every • Example: 1.75

  47. Chebyshev’s Inequality • For every : • Proof: (by squaring) (by Markov’s inequality)

  48. Chebyshev’s Inequality • For every : • Corollary (): For every :

  49. Chebyshev: Example

  50. Chebyshev: Example • Roll a dice 10 times: = Average value over 10 rolls • = value of the i-th roll, • Variance ( by linearity for independentr.vs):

More Related