Algorithm Design Meets Big Data

Bahman Bahmani bahman@stanford.edu Algorithm Design Meets Big Data

Outline • Fundamental Tradeoffs • Drug Interaction Example [Adapted from Ullman’s slides, 2012] • Technique I: Grouping • Similarity Search [Bahmani et al., 2012] • Technique II: Partitioning • Triangle Counting [Afrati et al., 2013; Suri et al., 2011] • Technique III: Filtering • Community Detection [Bahmani et al., 2012] • Conclusion

Drug interaction problem • 3000 drugs, 1M patients, 20 years • 1MB of data per drug • Goal: Find drug interactions

MapReduce algorithm • Map • Input: <i; Ri> • Output: <{i,j}; Ri> for any other drug j • Reduce • Input: <{i,j}; [Ri,Rj]> • Output: Interaction between drugs i,j

Example Mapper for drug 1 Reducer for {1,2} Drug 3 data Drug 2 data Drug 3 data Drug 1 data Drug 1 data {2, 3} {1, 2} {1, 3} {1, 2} {1, 3} {2, 3} Drug 2 data Mapper for drug 2 Reducer for {1,3} Mapper for drug 3 Reducer for {2,3}

Example Reducer for {1,2} {1, 2} Drug 1 data Drug 2 data Reducer for {1,3} Drug 1 data Drug 3 data {1, 3} Reducer for {2,3} Drug 2 data Drug 3 data {2, 3}

Total work 4.5M pairs × 100 msec/pair = 120 hours Less than 1 hour using ten 16-core nodes

All good, right? Network communication = 3000 drugs × 2999 pairs/drug × 1MB/pair = 9TB Over 20 hours worth of network traffic on a 1Gbps Ethernet

Improved algorithm • Group drugs • Example: 30 groups of size 100 each • G(i) = group of drug i

Improved algorithm • Map • Input: <i; Ri> • Output: <{G(i), G’}; (i, Ri)> for any other group G’ • Reduce • Input: <{G,G’}; 200 drug records in groups G,G’> • Output: All pairwise interactions between G,G’

Total work • Same as before • Each pair compared once

All good now? 3000 drugs × 29 replications × 1MB = 87GB Less than 15 minutes on 1Gbps Ethernet

Algorithm’s tradeoff • Assume ngroups • #key-value pairs emitted by map= 3000×(n-1) • #input records per reducer = 2×3000/n The more parallelism, the more communication

Reducer size • Maximum number of inputs a reducer can have, denoted as λ

Analyzing drug interaction problem • Assume ddrugs • Each drug needs to go to (d-1)/(λ-1) reducers • needs to meet d-1 other drugs • meets λ-1 other drugs at each reducer • Minimum communication = d(d-1)/(λ-1) ≈ d2/λ

How well our algorithm trades off? • With ngroups • Reducer size = λ= 2d/n • Communication = d×(n-1) ≈ d × n = d × 2d/λ = 2d2/λ • Tradeoff within a factor 2 of ANY algorithm

Fundamental MapReduce tradeoff • Increase in reducer size (λ) • Increases reducer time, space complexities • Drug interaction: time ~ λ2, space ~ λ • Decreases total communication • Drug interaction: communication ~ 1/λ

Technique I: Grouping • Decrease key resolution • Lower communication at the cost of parallelism • How to group may not always be trivial

Similarity Search • Near-duplicate detection • Document topic classification • Collaborative filtering • Similar images • Scene completion

Many-to-Many Similarity Search • N data objects and M query objects • Goal: For each query object, find the most similar data object

Candidate generation • M=107, N=106 • 1013pairs to check • 1μsec per pair • 116 days worth of computation

Locality Sensitive Hashing: Big idea • Hash functions likely to map • Similar objects to same bucket • Dissimilar objects to different buckets

… 9 8 7 … 3 2 1 6 0 …

MapReduce implementation • Map • For data point p emit <Bucket(p) ; p> • For query point q • Generate offsets qi • Emit <Bucket(qi); q> for each offset qi

MapReduce implementation • Reduce • Input: <v; p1,…,pt,q1,…,qs> • Output: {(pi,qj)| 1≤i≤t, 1≤j≤s}

All good, right? • Too many offsets required for good accuracy • Too much network communication

… 9 8 7 … 3 2 1 6 0 …

MapReduce implementation • G another LSH • Map • For data point p emit <G(Bucket(p)); (Bucket(p),p)> • For query point q • Generate offsets qi • Emit <G(Bucket(qi)); q> for all distinct keys

MapReduce implementation • Reduce • Input: <v; [(Bucket(p1),p1), …, (Bucket(pt),pt),q1,…,qs]> • Index pi at Bucket(pi) (1≤i≤t) • Re-compute all offsets of qj and their buckets (1≤j≤s) • Output: candidate pairs

Experiments Shuffle Size Runtime

Technique II: Partitioning • Divide and conquer • Avoid “Curse of the Last Reducer”

Graph Clustering Coefficient • G = (V,E) undirected graph • CC(v) = Fraction of v’s neighbor pairs which are neighbors themselves • Δ(v) = Number of triangles incident on v

Graph Clustering Coefficient • Spamming activity in large-scale web graphs • Content quality in social networks • …

MapReduce algorithm • Map • Input: <v; [u1,…,ud]> • Output: • <{ui,uj}; v> (1≤i<j≤d) • <{v,ui}; $> (1≤i≤d) v

MapReduce algorithm • Reduce • Input: <{u,w}; [v1,…,vT, $?]> • Output: If $ part of input, emit <vi; 1/3> (1≤i≤T) u … v1 v2 vT w

All good, right? • For each node v, Map emits all neighbor pairs • ~ dv2/2 key-value pairs • If dv=50M, even outputting 100M pairs/second, takes ~20 weeks!

Experiment: Reducer completion times

Partitioning algorithm • Partition nodes: p partitions V1,…,Vp • ρ(v) = partition of node v • Solve the problem independently on each subgraphViUVjUVk (1≤i≤j≤k≤p) • Add up results from different subgraphs

Partitioning algorithm 3

Partitioning algorithm • Map • Input: <v; [u1,…,ud]> • Output: for each ut (1≤t≤d) emit <{ρ(v), ρ(ut),k}; {v,ut}> for any 1≤k≤p

Partitioning algorithm • Reduce • Input: <{i,j,k}; [(u1,v1), …, (ur,vr)]> (1≤i≤j≤k≤p) • Output: For each node x in the input emit (x; Δijk(x)) Final results computed as: Δ(x) = Sum of all Δijk(x) (1≤i≤j≤k≤p)

Example • 1B edges, p=100 partition • Communication • Each edge sent to 100 reducers • Total communication = 100B edges • Parallelism • #subgraphs ~ 170,000 • Size of each subgraph < 600K

Experiment: Reducer completion times

Experiment: Reducer size vs shuffle size

Experiment: Total runtime

MapReduce tradeoffs Approximation Communication Parallelism

Technique III: Filtering • Quickly decrease the size of the data in a distributed fashion… • … while maintaining the important features of the data • Solve the small instance on a single machine

Densest subgraph • G=(V,E) undirected graph • Find a subset S of V with highest density: ρ(S) = |E(S)|/|S| |S|=6, |E(S)|=11

Densest subgraph • Functional modules in protein-protein interaction networks • Communities in social networks • Web spam detection • …

Exact solution • Linear Programming • Hard to scale

Algorithm Design Meets Big Data

Algorithm Design Meets Big Data

Presentation Transcript

Algorithm Analysis (Big O)

Big Data Meets Learning Analytics

Object-Orientation Meets Big Data

Big Data Meets Microfinance

U-Air: When Urban Air Quality Meets Big Data

When Urban Air Quality Meets Big Data

Big Data Meets Security :

Algorithm Design

EGG MEETS BIG M

Hogarth meets a big giant

Algorithm Design

Privacy by Design : Big Privacy for Big Data

Algorithm Design

Algorithm Design

Algorithm Design

ontology meets big data: immutability

Big Data Big Data

Algorithm Design

Data Structures and Algorithm Design (Review)

Data Structures and Algorithm Design (Review)