790 likes | 1.15k Views
Data mining @ Mahout. Reporter: terry. What is Mahout?. Mahout's goal is to build scalable machine learning libraries. What’s the meaning of “scalable”? Scalable to reasonably large data sets. Scalable to support your business case. Scalable community Notes:
E N D
Data mining @ Mahout Reporter: terry
What is Mahout? • Mahout's goal is to build scalable machine learning libraries. • What’s the meaning of “scalable”? • Scalable to reasonably large data sets. • Scalable to support your business case. • Scalable community • Notes: • The core libraries are highly optimized to allow for good performance also for non-distributed algorithms. • https://cwiki.apache.org/confluence/display/MAHOUT/Overview
How it does us a favor? • Currently Mahout supports mainly four use cases: • Recommendation mining • takes users' behavior and from that tries to find items users might like. • Clustering • takes e.g. text documents and groups them into groups of topically related documents. • Classification • learns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. • Frequent item-set mining • takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.
Classification in Mahout • Logistic Regression (SGD-Stochastic Gradient Descent ) • A model used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. • SGD • An online learning algorithm • Do on-line evaluation using cross validation • An evolutionary system to do learning hyper-parameter optimization http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&language=en
Classification in Mahout • Logistic Regression (SGD-Stochastic Gradient Descent ) 7%
Classification in Mahout • SGD-Stochastic Gradient Descent • A optimization algorithm Learning rate
Classification in Mahout • SGD-Stochastic Gradient Descent
Classification in Mahout • SGD-Stochastic Gradient Descent Straight line: Input points: Applications: LMS (Least mean square) and backprogation
Classification out of Mahout • AdaBoost • Training data weak classifiers
Classification out of Mahout • AdaBoost • Discrete AdaBoost
Classification out of Mahout • AdaBoost • Real AdaBoost
Classification in Mahout • Bayesian • Traditional Naive Byes: Simple & Naïve • A simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions constant
Classification in Mahout • Bayesian for Then:
Classification in Mahout • Bayesian • Parameters estimation • MAP( maximum a posteriori) : the percent of class C in training set : the percent of class C in training set
Classification in Mahout • Bayesian • Example( Sex Classification) Parameters estimation: Probability distribution of every feature in every class The class priors:
Classification in Mahout • Bayesian • Example( Sex Classification) evidence
Classification in Mahout • Bayesian • Example( Sex Classification)
Classification in Mahout • Bayesian • Example( Sex Classification)
Classification in Mahout • Bayesian • Example( Sex Classification) Post( male) Post( female) × √
Classification in Mahout • Bayesian • Extension • Random Naïve Bayes • Random Tree + Naïve Bayes • Bayes network • Conditional dependencies • directed acyclic graph (DAG) • Node( variables) and edge( conditional dependencies)
Classification in Mahout • Support Vector Machine( BLC) • Each object is considered as a point in n-dims feature space • Each point is labeled with ‘0’ or ‘1’ • Find a hyperplane separate objects • Liner separating in low Dims leading to mistakes • Curse of Dims • Fewer Features VS. free parameters • Impose structural constraints
Classification in Mahout • Support Vector Machine • Linear Classifier Point s. line Line s. plane Plane s. volume
Classification in Mahout • Support Vector Machine • Maximum-margin hyperplane • The best line: red line • The worst line: green line • The medium line: blue line • The distance from plane to the nearest point on each side Is maximized
Classification in Mahout • Support Vector Machine • Linear SVM • Find Maximum Margin hyperplane Hyperplanes:
Classification in Mahout • Support Vector Machine • Linear SVM
Classification in Mahout • Support Vector Machine • Linear SVM ∝ Lagrange multipliers
Classification in Mahout • Support Vector Machine • Linear SVM Wrong? Maybe!
Classification in Mahout • Support Vector Machine • Linear SVM generalized
Classification in Mahout • Support Vector Machine • Linear SVM Support Vectors!
Classification in Mahout • Support Vector Machine • Linear SVM • Soft margin
Classification in Mahout • Support Vector Machine • Non-Linear SVM • Dot product( No!) • Kernel function Low Dims high Dims
Classification in Mahout • Support Vector Machine • Non-Linear SVM • Common kernel: Polynomial( homogeneous ): Gaussian Radial Basis Function: Hyperbolic tangent:
Clustering in Mahout • K-Means Clustering • Partition n observations to k clusters Observations: Observations: Clusters: is the mean of points in
Clustering in Mahout • K-Means Clustering • Step1: Assignment step • Step2: Update step Until the assignment no longer changes!
Clustering in Mahout • K-Means Clustering a. The result may depend on the initial clusters b. It is fast usually, so run it several times in different conditions
Clustering in Mahout • K-Means Clustering K=2? Or k=3? Which is better?
Clustering in Mahout • Fuzzy Clustering • Every point has a degree of belonging to clusters • Most popular: FCM( Fuzzy C-Means) Fuzzy logic belonging VS. Determined belonging Data sets Return:
Number of clusters Exponential weight Termination criterion Partition matrix Clustering in Mahout • Fuzzy Clustering • FCM • Step1: Initialization
Clustering in Mahout • Fuzzy Clustering • FCM • Step2: calculating the cluster center • Step3: calculating the partition matrix Partition matrix
Clustering in Mahout • Fuzzy Clustering • FCM • Step4:calculate the variation of partition matrix
Clustering in Mahout • Fuzzy Clustering • FCM • Example
Clustering in Mahout • Spectral Clustering • Make use of spectrum of similarity matrix Input : and number of clusters k Similarity matrix : Similarity graph :
Clustering in Mahout • Spectral Clustering • Different similarity matrix • -neighborhood graph • is a threshold • K-nearest neighbor graph • Directed graph or undirected graph • The fully connected graph
Clustering in Mahout • Spectral Clustering • Unnormalized Laplacian matrix: Property:
Clustering in Mahout • Spectral Clustering • steps:
Clustering in Mahout • Expectation maximization • Description Observations: Latent data or missing value: Unknown parameters: Likehood Function: MLE!
Clustering in Mahout • Expectation maximization • Step1: Expectation step • Step2: maximization
Clustering in Mahout • Expectation maximization • Hard-EM • Initialize the parameter • Compute the best value for z • Derive a better • Soft-EM • Determine the probability of each possible value for z
Clustering in Mahout • Expectation maximization • Example( Gaussian Mixture) Input: Latent variables: determine Soft EM: Parameters: Likehood function:
Clustering in Mahout • Expectation maximization • Example( Gaussian Mixture)