640 likes | 788 Views
Bahman Bahmani bahman@stanford.edu. Algorithm Design Meets Big Data. Outline. Fundamental Tradeoffs Drug Interaction Example [Adapted from Ullman’s slides, 2012] Technique I: Grouping Similarity Search [Bahmani et al., 2012] Technique II: Partitioning
E N D
Bahman Bahmani bahman@stanford.edu Algorithm Design Meets Big Data
Outline • Fundamental Tradeoffs • Drug Interaction Example [Adapted from Ullman’s slides, 2012] • Technique I: Grouping • Similarity Search [Bahmani et al., 2012] • Technique II: Partitioning • Triangle Counting [Afrati et al., 2013; Suri et al., 2011] • Technique III: Filtering • Community Detection [Bahmani et al., 2012] • Conclusion
Drug interaction problem • 3000 drugs, 1M patients, 20 years • 1MB of data per drug • Goal: Find drug interactions
MapReduce algorithm • Map • Input: <i; Ri> • Output: <{i,j}; Ri> for any other drug j • Reduce • Input: <{i,j}; [Ri,Rj]> • Output: Interaction between drugs i,j
Example Mapper for drug 1 Reducer for {1,2} Drug 3 data Drug 2 data Drug 3 data Drug 1 data Drug 1 data {2, 3} {1, 2} {1, 3} {1, 2} {1, 3} {2, 3} Drug 2 data Mapper for drug 2 Reducer for {1,3} Mapper for drug 3 Reducer for {2,3}
Example Reducer for {1,2} {1, 2} Drug 1 data Drug 2 data Reducer for {1,3} Drug 1 data Drug 3 data {1, 3} Reducer for {2,3} Drug 2 data Drug 3 data {2, 3}
Total work 4.5M pairs × 100 msec/pair = 120 hours Less than 1 hour using ten 16-core nodes
All good, right? Network communication = 3000 drugs × 2999 pairs/drug × 1MB/pair = 9TB Over 20 hours worth of network traffic on a 1Gbps Ethernet
Improved algorithm • Group drugs • Example: 30 groups of size 100 each • G(i) = group of drug i
Improved algorithm • Map • Input: <i; Ri> • Output: <{G(i), G’}; (i, Ri)> for any other group G’ • Reduce • Input: <{G,G’}; 200 drug records in groups G,G’> • Output: All pairwise interactions between G,G’
Total work • Same as before • Each pair compared once
All good now? 3000 drugs × 29 replications × 1MB = 87GB Less than 15 minutes on 1Gbps Ethernet
Algorithm’s tradeoff • Assume ngroups • #key-value pairs emitted by map= 3000×(n-1) • #input records per reducer = 2×3000/n The more parallelism, the more communication
Reducer size • Maximum number of inputs a reducer can have, denoted as λ
Analyzing drug interaction problem • Assume ddrugs • Each drug needs to go to (d-1)/(λ-1) reducers • needs to meet d-1 other drugs • meets λ-1 other drugs at each reducer • Minimum communication = d(d-1)/(λ-1) ≈ d2/λ
How well our algorithm trades off? • With ngroups • Reducer size = λ= 2d/n • Communication = d×(n-1) ≈ d × n = d × 2d/λ = 2d2/λ • Tradeoff within a factor 2 of ANY algorithm
Fundamental MapReduce tradeoff • Increase in reducer size (λ) • Increases reducer time, space complexities • Drug interaction: time ~ λ2, space ~ λ • Decreases total communication • Drug interaction: communication ~ 1/λ
Technique I: Grouping • Decrease key resolution • Lower communication at the cost of parallelism • How to group may not always be trivial
Similarity Search • Near-duplicate detection • Document topic classification • Collaborative filtering • Similar images • Scene completion
Many-to-Many Similarity Search • N data objects and M query objects • Goal: For each query object, find the most similar data object
Candidate generation • M=107, N=106 • 1013pairs to check • 1μsec per pair • 116 days worth of computation
Locality Sensitive Hashing: Big idea • Hash functions likely to map • Similar objects to same bucket • Dissimilar objects to different buckets
… 9 8 7 … 3 2 1 6 0 …
MapReduce implementation • Map • For data point p emit <Bucket(p) ; p> • For query point q • Generate offsets qi • Emit <Bucket(qi); q> for each offset qi
MapReduce implementation • Reduce • Input: <v; p1,…,pt,q1,…,qs> • Output: {(pi,qj)| 1≤i≤t, 1≤j≤s}
All good, right? • Too many offsets required for good accuracy • Too much network communication
… 9 8 7 … 3 2 1 6 0 …
MapReduce implementation • G another LSH • Map • For data point p emit <G(Bucket(p)); (Bucket(p),p)> • For query point q • Generate offsets qi • Emit <G(Bucket(qi)); q> for all distinct keys
MapReduce implementation • Reduce • Input: <v; [(Bucket(p1),p1), …, (Bucket(pt),pt),q1,…,qs]> • Index pi at Bucket(pi) (1≤i≤t) • Re-compute all offsets of qj and their buckets (1≤j≤s) • Output: candidate pairs
Experiments Shuffle Size Runtime
Technique II: Partitioning • Divide and conquer • Avoid “Curse of the Last Reducer”
Graph Clustering Coefficient • G = (V,E) undirected graph • CC(v) = Fraction of v’s neighbor pairs which are neighbors themselves • Δ(v) = Number of triangles incident on v
Graph Clustering Coefficient • Spamming activity in large-scale web graphs • Content quality in social networks • …
MapReduce algorithm • Map • Input: <v; [u1,…,ud]> • Output: • <{ui,uj}; v> (1≤i<j≤d) • <{v,ui}; $> (1≤i≤d) v
MapReduce algorithm • Reduce • Input: <{u,w}; [v1,…,vT, $?]> • Output: If $ part of input, emit <vi; 1/3> (1≤i≤T) u … v1 v2 vT w
All good, right? • For each node v, Map emits all neighbor pairs • ~ dv2/2 key-value pairs • If dv=50M, even outputting 100M pairs/second, takes ~20 weeks!
Partitioning algorithm • Partition nodes: p partitions V1,…,Vp • ρ(v) = partition of node v • Solve the problem independently on each subgraphViUVjUVk (1≤i≤j≤k≤p) • Add up results from different subgraphs
Partitioning algorithm • Map • Input: <v; [u1,…,ud]> • Output: for each ut (1≤t≤d) emit <{ρ(v), ρ(ut),k}; {v,ut}> for any 1≤k≤p
Partitioning algorithm • Reduce • Input: <{i,j,k}; [(u1,v1), …, (ur,vr)]> (1≤i≤j≤k≤p) • Output: For each node x in the input emit (x; Δijk(x)) Final results computed as: Δ(x) = Sum of all Δijk(x) (1≤i≤j≤k≤p)
Example • 1B edges, p=100 partition • Communication • Each edge sent to 100 reducers • Total communication = 100B edges • Parallelism • #subgraphs ~ 170,000 • Size of each subgraph < 600K
MapReduce tradeoffs Approximation Communication Parallelism
Technique III: Filtering • Quickly decrease the size of the data in a distributed fashion… • … while maintaining the important features of the data • Solve the small instance on a single machine
Densest subgraph • G=(V,E) undirected graph • Find a subset S of V with highest density: ρ(S) = |E(S)|/|S| |S|=6, |E(S)|=11
Densest subgraph • Functional modules in protein-protein interaction networks • Communities in social networks • Web spam detection • …
Exact solution • Linear Programming • Hard to scale