440 likes | 592 Views
Data Mining in Practice: Techniques and Practical Applications . Junling Hu May 14, 2013. What is data mining?. Mining patterns from data Is it statistics? Functional form? Computation speed concern? Data size Variable size Is it machine learning? Big data issue
E N D
Data Mining in Practice:Techniques and Practical Applications Junling Hu May 14, 2013
What is data mining? • Mining patterns from data • Is it statistics? • Functional form? • Computation speed concern? • Data size • Variable size • Is it machine learning? • Big data issue • New methods: network mining
Examples of data mining • Frequently bought together • Movie recommendation
More examples of data mining • Keyword suggestions • Genome & disease mining • Heart monitoring
Overview of data mining • Frequent pattern mining • Machine Learning • Supervised • Unsupervised • Stream mining • Recommender system • Graph mining • Unstructured data • Text, • Audio • Image and Video • Big data technology
Frequent Pattern Mining • Diaper and Beer • Product assortment • Click behavior • Machine breakdown ?
The case of Amazon • Count frequency of co-occurrence • Efficient algorithm
Machine Learning • Supervised • Unsupervised (clustering)
Binary classification Input features Output class Data point
Classification (1) • Decision tree
Classification (2): Neural network • Perceptron • Multi-layer neural netowrk
Support Vector Machine (SVM) • Search for a separating hyperplane • Maximize margin
Perceived advantage of SVM • Transform data into higher dimension
Applications of SVM: Spam Filter Input Features: • Transmission • IP address --167.12.24.555 • Sender URL -- one-spam.com • Email header • From --“admin@one-spam.cpm” • To --“undisclosed” • cc • Email Body • # of paragraphs • # words • Email structure • # of attachments • # of links
Logistic regression • Advantage: Simple functional form • Can be parallelized • Large scale
Applications of logistic regression • Click prediction • Search ranking (web pages, products) • Online advertising • Recommendation • The model • Output: Click/no click • Input features: page content, search keyword, User information
Regression • Linear regression • Non-linear regression • Application: • Stock price prediction • Credit scoring • employment forecast
Semi-supervised learning • Application: • Speech dialog system
Unsupervised learning: Clustering • No labeled data • Methods • K-means
Applications of Clustering • Malware detection • Document clustering: Topic detection
Graphs in our life • Social network • Molecular compound Friend recommendation Drug discovery
Graph and its matrix representation Adjacency matrix 1 2 3 4 6 5
The web graph Page 2 Page 1 Hyperlink Anchor text Anchor text Anchor text Page 3 Anchor text
PageRank as a steady state • Transition matrix P= • PageRank is a probability vector such that
Discover influencers on Twitter • The Twitter graph • Node • Link • A PageRank approach: TwitterRank 2 3 Following 1 4 5
Facebook graph search • Entity graph • Natural language search • “Restaurants liked by my friends”
Prediction Problems • Rating Prediction • Given how an user rated other items, predict the user’s rating for a given item • Top-N Recommendation • Given the list of items liked by an user, recommend new items that the user might like ? ****
Explicit vs. Implicit Feedback Data • Explicit feedback • Ratings and reviews • Implicit feedback (user behavior) • Purchase behavior: Recency, frequency, … • Browsing behavior: # of visits, time of visit, time of staying, clicks
Collaborative Filtering • Hypotheses • User/Item Similarities • Similar users purchase similar items • Similar items are purchased by similar users • Matching characteristics • Match exists between user’s and item’s characteristics
User-User similarity • User’s movie rating
Application of item-item similarity • Amazon
Application of Latent Factor Model • GetJar
Application in LinkedIn • Ranking-based model
Thanks and Contact • Co-author: Patricia Hoffman Contact: • junlinghu@gmail.com • Twitter: @junling_tech