Data mining @ Mahout

Data mining @ Mahout Reporter: terry

What is Mahout? • Mahout's goal is to build scalable machine learning libraries. • What’s the meaning of “scalable”? • Scalable to reasonably large data sets. • Scalable to support your business case. • Scalable community • Notes: • The core libraries are highly optimized to allow for good performance also for non-distributed algorithms. • https://cwiki.apache.org/confluence/display/MAHOUT/Overview

How it does us a favor? • Currently Mahout supports mainly four use cases: • Recommendation mining • takes users' behavior and from that tries to find items users might like. • Clustering • takes e.g. text documents and groups them into groups of topically related documents. • Classification • learns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. • Frequent item-set mining • takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.

Classification in Mahout • Logistic Regression (SGD-Stochastic Gradient Descent ) • A model used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. • SGD • An online learning algorithm • Do on-line evaluation using cross validation • An evolutionary system to do learning hyper-parameter optimization http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&language=en

Classification in Mahout • Logistic Regression (SGD-Stochastic Gradient Descent ) 7%

Classification in Mahout • SGD-Stochastic Gradient Descent • A optimization algorithm Learning rate

Classification in Mahout • SGD-Stochastic Gradient Descent

Classification in Mahout • SGD-Stochastic Gradient Descent Straight line: Input points: Applications: LMS (Least mean square) and backprogation

Classification out of Mahout • AdaBoost • Training data weak classifiers

Classification out of Mahout • AdaBoost • Discrete AdaBoost

Classification out of Mahout • AdaBoost • Real AdaBoost

Classification in Mahout • Bayesian • Traditional Naive Byes: Simple & Naïve • A simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions constant

Classification in Mahout • Bayesian for Then:

Classification in Mahout • Bayesian • Parameters estimation • MAP( maximum a posteriori) : the percent of class C in training set : the percent of class C in training set

Classification in Mahout • Bayesian • Example( Sex Classification) Parameters estimation: Probability distribution of every feature in every class The class priors:

Classification in Mahout • Bayesian • Example( Sex Classification) evidence

Classification in Mahout • Bayesian • Example( Sex Classification)

Classification in Mahout • Bayesian • Example( Sex Classification) Post( male) Post( female) × √

Classification in Mahout • Bayesian • Extension • Random Naïve Bayes • Random Tree + Naïve Bayes • Bayes network • Conditional dependencies • directed acyclic graph (DAG) • Node( variables) and edge( conditional dependencies)

Classification in Mahout • Support Vector Machine( BLC) • Each object is considered as a point in n-dims feature space • Each point is labeled with ‘0’ or ‘1’ • Find a hyperplane separate objects • Liner separating in low Dims leading to mistakes • Curse of Dims • Fewer Features VS. free parameters • Impose structural constraints

Classification in Mahout • Support Vector Machine • Linear Classifier Point s. line Line s. plane Plane s. volume

Classification in Mahout • Support Vector Machine • Maximum-margin hyperplane • The best line: red line • The worst line: green line • The medium line: blue line • The distance from plane to the nearest point on each side Is maximized

Classification in Mahout • Support Vector Machine • Linear SVM • Find Maximum Margin hyperplane Hyperplanes:

Classification in Mahout • Support Vector Machine • Linear SVM

Classification in Mahout • Support Vector Machine • Linear SVM ∝ Lagrange multipliers

Classification in Mahout • Support Vector Machine • Linear SVM Wrong？ Maybe！

Classification in Mahout • Support Vector Machine • Linear SVM generalized

Classification in Mahout • Support Vector Machine • Linear SVM Support Vectors!

Classification in Mahout • Support Vector Machine • Linear SVM • Soft margin

Classification in Mahout • Support Vector Machine • Non-Linear SVM • Dot product( No!) • Kernel function Low Dims high Dims

Classification in Mahout • Support Vector Machine • Non-Linear SVM • Common kernel: Polynomial( homogeneous ): Gaussian Radial Basis Function: Hyperbolic tangent:

Clustering in Mahout • K-Means Clustering • Partition n observations to k clusters Observations: Observations: Clusters: is the mean of points in

Clustering in Mahout • K-Means Clustering • Step1: Assignment step • Step2: Update step Until the assignment no longer changes!

Clustering in Mahout • K-Means Clustering a. The result may depend on the initial clusters b. It is fast usually, so run it several times in different conditions

Clustering in Mahout • K-Means Clustering K=2? Or k=3? Which is better?

Clustering in Mahout • Fuzzy Clustering • Every point has a degree of belonging to clusters • Most popular: FCM( Fuzzy C-Means) Fuzzy logic belonging VS. Determined belonging Data sets Return:

Number of clusters Exponential weight Termination criterion Partition matrix Clustering in Mahout • Fuzzy Clustering • FCM • Step1: Initialization

Clustering in Mahout • Fuzzy Clustering • FCM • Step2: calculating the cluster center • Step3: calculating the partition matrix Partition matrix

Clustering in Mahout • Fuzzy Clustering • FCM • Step4:calculate the variation of partition matrix

Clustering in Mahout • Fuzzy Clustering • FCM • Example

Clustering in Mahout • Spectral Clustering • Make use of spectrum of similarity matrix Input : and number of clusters k Similarity matrix : Similarity graph :

Clustering in Mahout • Spectral Clustering • Different similarity matrix • -neighborhood graph • is a threshold • K-nearest neighbor graph • Directed graph or undirected graph • The fully connected graph

Clustering in Mahout • Spectral Clustering • Unnormalized Laplacian matrix: Property:

Clustering in Mahout • Spectral Clustering • steps:

Clustering in Mahout • Expectation maximization • Description Observations: Latent data or missing value: Unknown parameters: Likehood Function: MLE!

Clustering in Mahout • Expectation maximization • Step1: Expectation step • Step2: maximization

Clustering in Mahout • Expectation maximization • Hard-EM • Initialize the parameter • Compute the best value for z • Derive a better • Soft-EM • Determine the probability of each possible value for z

Clustering in Mahout • Expectation maximization • Example( Gaussian Mixture) Input: Latent variables: determine Soft EM: Parameters: Likehood function:

Clustering in Mahout • Expectation maximization • Example( Gaussian Mixture)

Data mining @ Mahout

Data mining @ Mahout

Presentation Transcript

Data Mining

Data Mining

Data Mining: Data

Data Mining

Data Mining: Data

Big Data Analytics Module 4 – Data Mining and Predictive Analytics Including Mahout

Data Mining: Data

Data Mining: P enelitian Data Mining

Data Mining

Data Mining: Data

Data Mining: Data

Apache Mahout

Data-mining

Data Mining

Data Mining: Data

Data Mining: Data

Data Mining: Data

Apache Mahout

Data Mining: Data

Data Mining: Data