220 likes | 346 Views
Distributed Model-Based Learning. PhD student: Zhang, Xiaofeng. I. Model-Based Learning. Methods used in Data Clustering dimension reduction 1. Linear methods: SVD, PCA, Kernel PCA etc. 2. Pairwise distance methods: Multidimensional scaling (MDS), etc.
E N D
Distributed Model-Based Learning PhD student: Zhang, Xiaofeng
I. Model-Based Learning • Methods used in Data Clustering • dimension reduction • 1. Linear methods: SVD, PCA, Kernel PCA etc. • 2. Pairwise distance methods: Multidimensional scaling (MDS), etc. • 3. Topographic maps: Elastic net, SOM, generative topographic mapping(GTM) etc. • 4. Manifold learning: LLE etc.
Characters: • Cope with incomplete data • Better to explain data • Visualization • GTM as an example • Gaussian distribution over the dataset
Collaborative Filtering using GTM: Dataset: Movie data Rate on movie [0~1] Each color represent a class of movie Visualize in a 2-D plane Romance vs. Action Blue one: Action Pink one: Romance
Centralized GTM in CF: • Centralized dataset • Large scale, billions of records • Expensive to maintain • Distributed Requirement • Security concern: bank, government, military • Privacy sensitive: bank, commercial site, personal site • Scalable • Expensive to centralize • Real time huge data stream • Distributed learning way for statistical model is an important issue
II. Related Work • Distributed Information Retrieval • Globally building a P2P network • Locally routing a query • Globally matching the query to a distributed dataset
Distributed Data Mining • Partition of the dataset • Horizontal or homogenous • Attributes are same in partitions • Vertical or heterogeneous • Attributed are different in partitions • Approach: • Distributed KNN • Density-Based • Distributed Bayesian network • For example: a global virtue table is built for vertical partition
Approaches to distributed learning: • Mediator based • Agent based • Grid based • Middleware based • Density-based • Model-based
III. Our Approach • Sparse local data • Underlying a global model • Problem review • Local three model • Globally merge the local models • Merge again or not?
A related approach • Artificial data • A Gaussian Mixture Model over global dataset • MCMC sampling • To learn local model • From the average local model to learn global model • Privacy cost distribution: a gaussian distribution
Density based merging approach • The combined global model • K : the number of the components • pi(xt) : a Gaussian component • αi=1 is the weight value. satisfy
Merging criteria • Q = argmax(Lij) + argmin(Cosij) • Lij: likelihood measure • Cosij: Privacy cost between two model • Two consideration: • Privacy cost • Likelihood a data generated by the other model
Steps: • Locally learning models • Merging according to the likelihood and privacy control • Merging stop if no clusters is density connected. • Learn the parameters of a global GMM via • K etc.
Hierarchical Approach • Local six models • Merge according to the similarity measure • Each level can be controlled by the privacy cost • Bottom up learning a hierarchical model • After a global model is learned, change the privacy control level, can change the model
Model selection • Simij = Dist(Cost(Di);Cost(Dj)) < Const • Cost(Di) : transform dataset use cost function • Dist( x , y) : operation of computing distance between two dataset • Smaller than a threshold then merge • Steps: • 1. Learn a local model from local dataset. • 2. Based on the predefined the privacy control function, merge local models to form a hierarchical global model. • 3. Relabel the local model according to the changed privacy.
Privacy Control by Data Sampling • Previously control the privacy function • Try to control the dataset sensitive to privacy • D1’ = D1 U Oa21 (D2) U Oa31(D3) U Oa41(D4) • D2’ = Oa12 (D1) U D2 U Oa32(D3) U Oa42(D4) • … • Oa12: Operator over the dataset • New local dataset are reconstructed by sampling from the other local dataset at some privacy control level
P2P Approach • Local small world of network • Local global model • Storing local network information in each node • Trust propagation to connected nodes • Pass knowledge to connected small world
Algorithms: • 1. Learn a global model for each small world of local nodes. • 2. pass back global information to each node in this small world. • 3. Nodei pass its trust relationship to its connected outer small world nodes at a certain value. • 4. The connected nodes merge the local model with new knowledge in another model. • 5. Update the connected global model knowledge, and propagate to all the local models in this small world. • 6. Sum all the knowledge L3 collected, and update the G2, then repeat the step 3 - 6 until the loop criteria is satisfied: reach the iteration number or the global model change little.
IV. Model Evaluation • Effective criterion • Precision • How accurate a model can be • Recall • Cover how many the right data in the model
Efficiency criterion • The communication cost • bandwidth is the same • Only proportion to partition size • Maximum data transferred • Overhead • Compare three approach with the centralized way • Complexity • Computation complexity
V. Experiments Issue • Another approach for the dataset • Site vector instead of document vector • Pick out meaningful representatives of local models • LLE vs. GTM etc. • Change the privacy distribution to control the shape of global model