A Secure Clustering Algorithm for Distributed Data Streams

A Secure Clustering Algorithm for Distributed Data Streams Geetha Jagannathan Rutgers University Joint work with Krishnan Pillaipakkamnatt and D. Umano

Outline • The problem • Prior results • Clustering data streams • Experimental results and comparison • A privacy-preserving protocol • Conclusion

The problem • Alice and Bob each have a data stream, defined on the same attributes. (horizontal partition) • The wish to compute a clustering on the combined data.

Bob Alice

Clustering on joint data Alice’s Data k = 4

Clustering on joint data Bob’s Data k = 4

Clustering on joint data Combined Data k = 4

Trusted third party k-clustering k-clustering Bob Alice

Privacy requirements • Parties are semi-honest • Same as trusted third party • Reveals nothing but the final output • In this case – the k cluster centers

Prior results • PPDM protocols convert distributed DM algorithms into private ones • The k-means algorithm is the basis for many clustering protocols [VC03, JKM05, JW05, BO07] • “Leak” intermediate information • [JPW05] presents a leak-free clustering protocol based on a new clustering algorithm.

Our Contributions • A leak free privacy-preserving protocol for distributed data streams. • A data stream clustering algorithm • Better than k-means (on average) • Comparable performance with BIRCH on many data sets, but with lower memory needs.

Data Stream Algorithms • Data arrives in “stream” fashion: d1, d2, …, dn, … (the “end” of the stream is not known ahead of time). • Data is too large to fit entirely in memory. • Data can be accessed only in the order that it arrives. • Each data item can only be “read” once.

The clustering algorithm • “Incrementally agglomerative”: It merges intermediate clusters without waiting for all the data to be available. • Runs in time linear in n.

Overview of clustering algorithm K = 5 Output expected after n = 25 data points Output Level 2 clustering Level 1 clustering Level 0 clustering

Clustering Algorithm Outline • The algorithm maintains a list of k-clusterings (each clustering is on some partial data). • In each iteration: • Input the next k data points as a level-0 clustering. • If two clusterings at level i are in the list, “merge” them into a level-(i + 1) k-clustering.

Clustering algorithm outline • If output is needed after some n points have been read, all k-clusterings are “merged” into a single k-clustering.

“Merging” clusterings • Have a set S clusters, which |S| > k. • Need a set S' of k clusters. • S' = S • Repeat • Compute merge error for every pair of clusters • Take the union of the pair with lowest error • Until |S'| = k

Error (C1 U C2) = C1.weight * C2.weight * (dist(C1, C2))2

Sample results (offset grid)

Sample results (vs k-means)

Sample result (vs. BIRCH)

Realistic Data (Network Intrusion)

The Secure Protocol • Input: Alice owns data stream D1 Bob owns data stream D2 • Output : k-clusters on D1m U D2n • Alice computes O(k log ( )) cluster centers and Bob computes O(k log ( )) cluster centers • Alice and Bob securely share their cluster centers • They securely merge clusters

Sample Run(Distributed non-private protocol)

Complexity • Communication complexity: O((k log(mn/k2)2) • Non-private setting (one party sends the intermediate clusters to the other) • Comm complexity: O(k log (m/k))

A Secure Clustering Algorithm for Distributed Data Streams

A Secure Clustering Algorithm for Distributed Data Streams

Presentation Transcript

Sketching Massive Distributed Data Streams

A Distributed Clustering Framework for MANETS

Clustering Data Streams

Clustering Data Streams

A Distributed Clustering Algorithm for Target Tracking in Vehicular Ad-Hoc Networks

Managing Distributed Data Streams – II

A Framework for Clustering Evolving Data Streams

A Hierarchical Clustering Algorithm for Categorical Sequence Data

A Grid-Based Middleware’s Support for Processing Distributed Data Streams

Clustering Algorithm

Monitoring Distributed Data Streams

CLOPE: a Fast and Effective Clustering Algorithm for Transactional Data

Unsupervised Evolutionary Clustering Algorithm for Mixed Type Data

Integrating Distributed Data Streams

Boosting Algorithm for Clustering

A Framework for Clustering Evolving Data Streams

Monitoring Distributed Data Streams

A Fuzzy k-Modes Algorithm for Clustering Categorical Data

Towards a clustering algorithm for CALICE

Monitoring Distributed Data Streams