190 likes | 272 Views
Support Cluster Machine Paper from ICML2007 Read by Haiqin Yang 2007-10-18. This paper, Support Cluster Machine, was written by Bin Li, Mingmin Chi, Jianping Fan, Xiangyang Xue, which was published in 2007. Outline. Background and Motivation Support Cluster Machine - SCM Kernel in SCM
E N D
Support Cluster Machine Paper from ICML2007 Read by Haiqin Yang 2007-10-18 This paper, Support Cluster Machine, was written by Bin Li, Mingmin Chi, Jianping Fan, Xiangyang Xue, which was published in 2007.
Outline • Background and Motivation • Support Cluster Machine - SCM • Kernel in SCM • Experiments • An Interesting Application: Privacy-preserving Data Mining • Discussions
Large scale classification problem Decomposition methods Osuna et al., 1997; Joachims, 1999; Platt, 1999; Collobert & Bengio, 2001; Keerthi et al., 2001; Incremental algorithms Cauwenberghs & Poggio, 2000; Fung & Mangasarian, 2002; Laskov et al., 2006; Parallel techniques Collobert et al., 2001; Graf et al., 2004; Approximate formula Fung & Mangasarian, 2001; Lee & Mangasarian, 2001; Choose representatives Active learning -Schohn & Cohn, 2003; Cluster Based-SVM -Yu et al., 2003; Core Vector Machine (CVM) -Tsang et al., 2005; Clustering SVM -Boley, D. & Cao, 2004; Background and Motivation
Given training samples: Procedure Support Cluster Machine - SCM
Dual representation Decision function SCM Solution
Probability product kernel By Gaussian assumption, i.e., Hence Kernel
Property I That is Decision function Property II Kernel
Datasets Toydata MNIST – Handwritten digits (‘0’-’9’) classification Adult – Privacy-preserving Dataset Clustering algorithms Threshold Order Dependent (TOD) EM algorithm Classification methods libSVM SVMTorch SVMlight CVM (Core Vector Machine) SCM Model selection CPU: 3.0GHz Experiments
Toydata • Samples: 2500 samples/class generated from a mixture of Gaussian distribution • Clustering algorithm: TOD • Clustering results: 25 positive, 25 negative
MNIST • Data description • 10 classes: Handwritten digits ‘0’-’9’ • Training samples: 60,000, about 6000 for each class • Testing samples: 10,000 • Construct 45 binary classifiers • Results • 25 Clusters for EM algorithm
MNIST • Test results for TOD algorithm
Privacy-preserving Data Mining • Inter-Enterprise data mining • Problem: Two parties owning confidential databases wish to build a decision-tree classifier on the union of their databases, without revealing any unnecessary information. • Horizontally partitioned • Records (users) split across companies • Example: Credit card fraud detection model • Vertically partitioned • Attributes split across companies • Example: Associations across websites
Privacy-preserving Data Mining • Randomization approach 30 | 70K | ... 50 | 40K | ... ... Randomizer Randomizer 65 | 20K | ... 25 | 60K | ... ... Reconstruct distribution of Age Reconstruct distribution of Salary ... Data Mining Algorithms Model
Privacy-preserving Dataset: Adult • Data description • Training samples: 30162 • Testing samples: 15060 • Percentage of positive samples: 24.78% • Procedure • Horizontally partition data into three subsets (parties) • Cluster by TOD algorithm • Obtain three positive and three negative GMMs • Combine positive and negative GMMs into one positive and one negative GMMs with modified priors • Classify them by SCM
Privacy-preserving Dataset: Adult • Partition results • Experimental results
Discussions • Solved problems • Large scale problems: downsample by clustering + classifier • Privacy-preserving problems: hide individual information • Differences to other methods • Training units are generative model, testing units are vectors • Training units contain complete statistical information • Only one parameter for model selection • Easy implementation • Generalization ability is not clear, while the RBF kernel in SVM has the property of larger width leads to lower VC dimension.
Discussions • Advantages of using priors and covariances