140 likes | 320 Views
ENEE 759D | ENEE 459D | CMSC 858Z. 5 . Machine Learning. Prof. Tudor Dumitraș. Assistant Professor, ECE University of Maryland, College Park. http://ter.ps/ 759d https://www.facebook.com/SDSAtUMD. Today’s Lecture. Where we’ve been Big Data Statistics MapReduce
E N D
ENEE 759D | ENEE 459D | CMSC 858Z 5. Machine Learning Prof. Tudor Dumitraș Assistant Professor, ECEUniversity of Maryland, College Park http://ter.ps/759d https://www.facebook.com/SDSAtUMD
Today’s Lecture • Where we’ve been • Big Data • Statistics • MapReduce • Interpretation of results • Where we’re going today • Machine learning • Where we’re going next • Part 2 of course: Security and InSecurity in the Real World • 2 readings each lecture
Machine Learning • Supervised learning: have inputs and associated outputs • Learn relationships between them using available training data (also called “labeled data”, “ground truth”) • Predict future values • Classification: The output (learned attribute) is categorical • Regression: The output (learned attribute) is numeric • Unsupervised learning: have only inputs • Learn “latent” labels • Clustering: Identify natural groups in the data Systems that automatically learn programs from data P. Domingos, CACM 2012
Rules Weather and golf • Want to decide when to play • Create rules based on attributes • Example: 1 attribute if (outlook == “rainy”) then play = “no”else play = “yes” • Errors: 6/14 • Can refine rule by adding conditions on other attributes • Create a decisiontree
Entropy Which attribute do we choose at each level? • Consider two sequences of coin flips • How much information do we get after flipping each coin once? • We want some function “Information” that satisfies: Information1&2(p1p2) = Information1(p1) + Information2(p2) • Expected Information = “Entropy” • Examples • Flipping a coin • In learning the outcome of the coin flip we learned 1 bit of information • Rolling a fair die • A die is more unpredictable than a coin • Rolling a weighted die with p1..5=0.1, p6=0.5 • A weighted die is less unpredictable than a fair die
Decision Tree Weather and golf • At each level, choose the attribute with the highest information gain • The one that reduces the unpredictability the most • Before: 9/14 “yes” outcomes => H=0.94 • Outlook: H=0.69 • 4/4 “yes” for overcast (H=0) • 3/5 “yes” for rainy (H=0.97) • 2/5 “yes” for sunny (H=0.97) • Temperature: H=0.91 • Humidity: H=0.94 • Windy: 0.87 • Outlook provides highest information gain: 0.94 – 0.69 = 0.25
Resulting Decision Tree • Putting the decision tree together • Choose the attribute with the highest Information Gain • Create branches for each value of attribute • Discretize continuous attributes (choose partition with highest gain) • R package: rpart • Not a perfect classification (still makes some incorrect decisions)
Overfitting • Low error on training data and high error on test data • “If the knowledgeand datawe have are not sufficient to completely determine the correct classifier, […] we run the risk of just hallucinating a classifier that […] simply encodes random quirks in the data.” – P. Domingos, CACM’12 • Some algorithms can prune the tree to avoid overfitting Underfitting Overfitting
Confusion Matrix How to determine if the classifier does a good job? • You need a training set (ground truth) and a testing set • Or you can split your ground truth into two data sets • Even better: K-fold cross-validation • Select K samples without replacement and train classifier multiple times • You can make a mistake in two different ways
Evaluating Results Is it better to have low FPs or low FNs? • There is usually a trade-off between FPs and FNs • Reducing type 1 errors causes more type 2 errors, and vice-versa • Sensitivity= TP / (TP+FN) • Ability to identify true positives • Also called true positive rate • Specificity= TN / (FP + TN) • Ability to rule out true negatives • Also called true negative rate • Can plot a Receiver Operating Characteristic (ROC) curve • R package: ROCR TP rate (Sensitivity) Evaluating keystroke dynamics[Killourhy & Maxion, DSN’09] FP rate (1 – Specificity)
Unsupervised Learning • Agglomerative hierarchical clustering (R: hclust) • No ground truth; goal is to identify patterns that describe the data • Start from individual points and progressively merge nearby clusters • Distance metric (e.g. Euclidian, rank correlation, Gower) • Linkage: how to aggregate pairwise point distances into cluster distances • Average? Minimum (single)? Maximum (complete)? Variance decrease (Ward)? • Choose classification or clustering features carefully Dendrogramof 1970 cars (features: MPG, weight, drive ratio)
Additional Machine Learning Resources • Classification • We saw: decision trees • Other classifiers: naïve Bayes, Support Vector Machines (SVM) • Natural language processing • Text mining (R package: tm) • Sentiment analysis (annotated English wordlist: http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010) • Clustering • We saw: hierarchical clustering • Other clustering techniques: k-means, k-medoids, time series clustering • Dimensionality reduction: principal component analysis (PCA) • Machine learning tools • For R: http://cran.r-project.org/web/views/MachineLearning.html • For Hadoop: Mahout (http://mahout.apache.org/)
Project Peer-Reviews • Pilot project reports • Reports due today • Discuss hypothesis (security problem and data analyzed to solve it) • Feasibility study • Report data volume, velocity, variety and quality • Post report on Piazza • Pilot project peer reviews • Review at least 2 project reports from other students • Use skills learned from paper reviews • Peer reviews are a part of your grade • Post reviews on Piazza (as follow-ups to report posts) by Monday
Review of Lecture • What did we learn? • Classification • Clustering • What’s next? • Paper discussion: ‘Sex, Lies and Cyber-crime Surveys’ • Next lecture: start of part 2 of course – 2 readings / lecture • Deadline reminders • Pilot project reports due today • Pilot project reviews due Monday • Group project proposals due Monday, 09/30