Machine Learning

Machine Learning Getting Started Andrew Loree

Got a question? Andrew Loree www.andyloree.com andy@andyloree.com @LowOnDiskSpace

Goals Outcome: • What is Machine Learning (ML)? • Understand the ML process • Base knowledge of types of ML & algorithms • Learning path for starting to use ML

What is Machine Learning? • Using data to find patterns and based upon those patterns predict the future • When is a prediction a guess? When it is not based upon “sufficient” observation, experience or scientific reasoning • Example questions: • How long until a production server is out of disk space? • Is this email spam? • Customer retention, product recommendations, marketing campaigns, fraud detection, credit worthiness, …

Is Machine Learning… • …just Statistics? • …just Calculus/Matrix Algebra/Optimization Mathematics • …just Computer Science/Engineering • ...just applied “domain knowledge” • Answer is all of the above and somewhere in between • Philosophical question are best left to the philosophers

Is Machine Learning… • …just Artificial Intelligence? • …just Deep Learning Artificial Intelligence ~1950’s↓ Machine Learning (boom) ~1980’s ↓ Deep Learning (boom) ~2010

Does the Machine really Learn? • Pattern recognition is from learning and past experiences, and we use it every day • Which of these charges were fraudulent for my credit card? When is there enough data and when do you have too much? Enter ML

Types of Questions (about Data) • Descriptive - how many of X did I sell? • Associative - is there an association between temperature and sales? (hypothesis) • Comparative - how many X sells versus Y? • Predicative - using associations and comparatives to predicate sales of X? Machine Learning can answer predictive questions

Framing your questions the ML way In order of importance: 1. Are you asking the right question? - ML is not magic, desired outcomes must be definable - Days until full? Will customer leave? Fraudulent Charge? 2. Do you think you have the right data? - Prediction cannot overcome lack of data - Data insight (domain knowledge) is critical to success 3. What results is good enough? - 50% accuracy, 70%, 99%? No false positives allowed? - Wait, what is accuracy?

Machine Learning Ethics • Bias: Confirmation,… • Perspective: Recommendations lead to more purchases (seller)- vs -Leads to higher debt (buyer) • Moral dilemmas

The Machine (Supervised) Learning Process • Training Data (contains patterns) • One (or more) ML algorithms learn the patterns • A model is generated, used to predict against new data

The Machine Learning Process: Data • Can be multiple sources, BigData stores, flat files, DBMS,… • Usually never in the right format • Do you have the right “features”? * • Preprocessing almost always required – usually the hardest part

The Machine Learning Process: Algorithm • Which algorithm is the “right” one?* • How do you compare one algorithms results to another?

The Machine Learning Process: Model • Most cases, repeating the entire process many times • How stable is our results? • Rinse and repeat the process • Model Management – consuming and operationalization of models is a separate, but very critical topic

Machine Learning: Terminology Training Data/Set – Prepared (training) data ready to use to create a model Three main ML categories: • Supervised learning - Categorizes outcomes or value of interest, in training data • Unsupervised learning - Organize data in a way to describe structure (clustering) • Reinforcement learning • Makes a choice, measure how “good” that was, modify the strategy going forward

Machine Learning Model Types Regression – supervised learning problems, fitting data to a line or curve How long until I run out of disk space?

Machine Learning Model Types Classification – supervised learning problems, capturing data in two (or more) classes Is it spam or ham?

Machine Learning Model Types Clustering – unsupervised learning problems, when we don’t know the defined classes Market research from surveysto generate market segments

Machine Learning: Terminology Feature - individual measurable property – prepared data. A combination of features for an observation is a commonly called a “feature vector” Target Value (or Class) – our desired outcome of prediction; With supervised learning, the value is in the training data

Text (SMS) Spam Which of these five messages are spam? SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate. I HAVE A DATE ON SUNDAY WITH WILL!! Fine if that's the way u feel. That's the way its gota b U GOIN OUT 2NITE?

Text Spam: Features What makes these two messages spam? SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate. What if you cannot use the message text itself?What are the “features” that are common to spam messages? • Length of the message? • Number of numeric strings? • Number of web links? • Number of currency symbols? • Number of punctuations? • Others?

Supervised Learning Example: Text Spam • Collection of SMS messages for mobile phone spam research • Contains a “training set” of 5,574 messages, marked either SPAM or HAM • Given just a message, how can we determine if the message is spam or ham? • Who doesn’t have domain knowledge of “spam” and texting? References:UC Irvine ML Repository: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection Contributions to the Study of SMS Spam Filtering: New Collection and Results: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

Text Spam: Demo • Weka Explorer • Load training data set • Try a couple algorithms using our features with Cross-validation • Compare results Azure ML Studio • Show same solution

ML Algorithms • Way too many to list • Commonly used: • Decision Trees • Random Forest • Support Vector Machines (SVM) • k-Nearest Neighbor - KNN • Linear Regression • Logistic Regression

ML Algorithms: Decision Trees • Supervised learning, classification • Weka implements a particular algorithm named C4.5 (called J48)

ML Algorithms: Random Forest • Supervised learning, classification • Multiple decision trees

ML Algorithms: Support Vector Machines • Supervised learning, classification • Separation by “hyperplane”, Weka version named SGD

ML Algorithms: k-nearest Neighbors • Supervised learning, classification or regression • k is number of neighbors used in measure of distance • Chose odd number to avoid ties • Called IBk in Weka

ML Algorithms: Linear Regression • Supervised learning, regression • Continuous values

ML Algorithms: Logistic Regression • Supervised learning, discrete (binary) values – (yes or no, A or B) • S-curve to fit against data

ML Algorithms: Cheat Sheet https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-cheat-sheet

ML Algorithms: Considerations • Not all algorithms are the same • Accuracy • Other practical measures: • Training time • Memory requirements • Scalability https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-choice

ML Algorithms: Testing • Different ways to “slice and dice” your training set data • Entire set • Percentage of set • Cross-validation - divide the set into subsets – generally best option

ML Algorithms: Evaluating Results • Confusion Matrix • Accuracy – closeness to the true (% of overall) • Precision – more important for non-binary classifications • Lots of others, some specific to problem Type (recall, F-measure,…)

ML Algorithms: Pitfalls • Underfitting – when close enough isn’t close enough

ML Algorithms: Pitfalls • Overfitting – memorization

ML Algorithms: Pitfalls • Data Leakage – do NOT use your prediction value as input to the model • Sampling Bias – poor choices for training set data e.g. predict item sales for entire store chain from a single store’s data • Predict Random Outcomes – fair and unfair coins flips, dependent outcomes

ML Algorithms: Text Processing • Think of “search” on top of machine learning • All of the common problems applied to classic linguistics challenge machine learning to an extent: • Tokenization – word breaking • Stemming (and lemmatization) – walk, walking, walked, walks → walk • Domain specific dictionaries – company jargon, acronyms, emojis,… • Language used – not everyone writes the Queen’s English • Semantic search – understand “meaning” – may be a better option to generate processing features

ML Toolkits, Platforms & Libraries • Toolkit/Platforms • WEKA • R • Parts of Python SciPy • Microsoft Cognitive Toolkit (CNTK) • Libraries • Scikit-learn (python) • JSAT • Accord.NET FrameworkAPIs • Azure ML • Mlib • PredicationIO • Operationalize • SQL Server Machine Learning Services/Machine Learning Server

Got a question? Andrew Loree www.andyloree.com andy@andyloree.com @LowOnDiskSpace

Machine Learning