340 likes | 481 Views
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 1. 0. 1. From Load Balancing to Data Streams: Randomized Algorithms in Computer Science. Thomas Sauerwald Max Planck Institute for Informatics, Saarbr ü cken , Germany. January 28 th , 2013. I. INTRODUCING MYSELF. Short CV.
E N D
1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 1 From Load Balancing to Data Streams:Randomized Algorithms in Computer Science Thomas Sauerwald Max Planck Institute for Informatics, Saarbrücken, Germany January 28th, 2013
Short CV 2005: Diploma in Mathematics, Paderborn 2008: PhD in Computer Science, Paderborn 2009-2010: Postdoc in Berkeley and Vancouver Since 2010: Researcher at the Max Planck Institute for Informatics Since 2012: Research Group Leader at the Cluster of Excellence “Efficient Algorithms for Massive Graphs”
Research Interests RandomizedAlgorithms DiscreteMathematics Graph Theory Networks
What are Efficient Algorithms? Answers in 2000s • 2.2 billion internet user, 100 billion web pages, 1 billion facebook accounts • There are 3 Billion Telephone Calls in US each day, 30 Billion emails daily, 1 Billion SMS, IMs. • IP Network Traffic: up to 1 Billion packets per hour per router. Each ISP has many (hundreds) routers! • GenBank (database for DNA sequences) contains 200 million entries with 300 billion DNA bases What if the algorithm takes n4 steps and n = 10 billion?
What are Efficient Algorithms? Answers in 2000s Massive Networks & Huge Data Sets • Super-linear algorithms are too slow • Storing the whole network/data set is not possible • Randomized and distributed algorithms are needed,resort to approximate solutions
Randomized Algorithms on Massive Graphs Load Balancing Massive Graphs Counting in Data Steams
Discrete Load Balancing Round #1 Round #2
Discrepancy Discrepancy = Max-Load – Min-Load • Nodes can only communicate with neighbors • Nodes do not know the network structure • Goal: Find a local protocol which • Achieves low discrepancy • Requires small runtime
Matching Model vs. Diffusion We focus on the matching model in this talk here
Load Balancing Protocol Protocol proposed already in 1998, but without theoretical analysis... • Protocol • For every round t=1,2,… • Generate a random matching • Matched vertices average their load Randomized Rounding 4 7 w.p. 1/2 w.p. 1/2 6 5 5 6 Random Matching Every vertex is active or passive with probability 1/2 Every active vertex sends a proposal to a randomly chosen neighbor Every passive vertex that receives exactly one proposal is included in the matching (with the sender)
Deterministic vs. Randomized Rounding Randomized Rounding Deterministic Rounding 4 4 7 7 w.p. 1/2 w.p. 1/2 6 5 5 5 6 6
The Problem with Deterministic Rounding Problem: Discrepancy cannot be reduced below the diameter!
Parameters • : discrepancy of the initial load vector • : spectral expansion of graph, denotes the second largest eigenvalue of the adjacency matrix • : number of nodes in the graph G The smaller , the better connected the network
Progress In the continuous case, the time to reach constant discrepancy is: • Parameters: • : discrepancy of the initial load vector • : spectral expansion of graph Is at least diameter of the graph! Deterministic Rounding Rabani, Sinclair, Wanka, FOCS 1998 For any graph, the discrepancy is after rounds. Randomized RoundingSun, S., FOCS 2012 For any graph, the discrepancy is constant after rounds.
Continuous Case Discrete Case • Linear System • corresponds to random walks • well-understood • Non-Linear System • perturbation of a linear system • more realistic Much more difficult to analyze! Theorem: For any initial load vector with discrepancy at most K, the discrepancy is at most after rounds.
How well do we perform in practice? Randomized RoundingSun, S., FOCS 2012 For any graph, the discrepancy is constant after rounds. This constant is very huge (much larger than 1000)! How does the protocol perform in practice?
Study of a Concrete Network: Hypercube • Network is a -dimensional hypercube with nodes • Every node has a random number of tasks between • 1 and 100,000 • Run protocol for exactly d rounds • For every initial load distribution • 1 run with deterministic rounding • 10 runs with randomized rounding Deterministic & Randomized RoundingHerlihy, Tirthapura, JPDC 2006 Randomized Rounding balances load vector almost perfectly in contrast to Deterministic Rounding.
Experimental Results discrepancy deterministic rounding Initial Discrepancy of the load vector is approx. 100,000! randomized rounding dimension
Load Balancing: Summary Main Result For any graph, the discrepancy is constant after rounds. • Further Research • Can we proof a discrepancy bound of 2? (recall experimental results!) • Processors with different speeds/jobs with different weights
Information Gathering: Streaming Algorithms • Fundamental problem of gathering information: Too much information to store • Need to process data as it arrives: one pass, small space, data stream model . • Approximate answers, since exact computation is not possible.
Finding Missing Elements Naive Algorithmwouldneedbitsofspace! Can beimplemented in space! Whatif not one but twonumbersaremissing? Input: Problem:
Problem & Motivation 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 1 • Internet trafficanalysis, monitoringdatabasesqueries • Letnbethenumberofitems(e.g., IP addresses) • Goal: Approximate F[i], thenumberoftimesaddress i occurs in thestream Cormode, Muthukrishnan, JALG 2004
An Important Tool: Hash Function • U istheuniverse (very large set) • S isthememory (smallset) • Seek a randomfunction such that
Count Min Sketch: Example Howmany ?
Analysis of Count MinSketch This step requires that the random hash functions are independent!
Count Min Sketch: Summary Cormode, Muthukrishnan, JALG 2004 Approximation only useful for elements that appear very often!
Streaming Algorithms: Counting Subgraphs Problem G Given a massive graph G and a small graph H, count the number of occurrences of H in G. • Analyze network connectivity and transitivity • Watts and Strogatz, Nature ’98 • Boril et al., ESA’07 • Discover structural information of biological networks • R. Milo et al., Science’02 • Wong et al., Briefs in Bioinformatics, ’11 • Optimize graph databases • X. Yan et al., SIGMOD ’04 • Community detection on social networks • Bordino et al., ICDM ’08 H
Streaming Algorithms: Counting Subgraphs Problem G Given a massive graph G and a small graph H, count the number of occurrences of H in G. • Goal: Design algorithms such that • working space is sub-linear in . • For any , the output Z and exact answer Z* satisfies • with probability at least 2/3. H Conditions • Storing the whole graph G is infeasible. • Edges of G are added or deleted over time.
Streaming Algorithms: Counting Subgraphs Kane, Mehlhorn, S., Sun, ICALP ’12 There is an algorithm to approximate the number of occurrences of H in G by using bits of space. Further Research • Improve the space complexity • concentration theory on complex-valued polynomials
Randomized Algorithms in a nutshell • Algorithm DesignRandomizationyieldsalgorithmswhichare: • Natural • Easy-to-implement • Robust • Elegant • Efficient in termsof time andspace • … • ComplexityTheoryRandomizationas a costlyresource: • Howtogenerateperfectrandombits? • Derandomization • Pseudorandomness • Are Turing machineswithrandomaccessmore powerful? • … Thank you!