400 likes | 415 Views
This comprehensive guide by Andrew Loree covers the fundamentals of Machine Learning (ML), types of ML algorithms, and the process of using data to predict the future. Explore how ML is more than just statistics or artificial intelligence (AI) and learn the ethical considerations in ML. Discover the importance of framing questions and understanding the ML process, from training data to model management. Gain insight into different types of ML categories, model types such as regression, classification, and clustering, and essential terminology like features and target values. Begin your journey into ML and start leveraging its capabilities effectively.
E N D
Machine Learning Getting Started Andrew Loree
Got a question? Andrew Loree www.andyloree.com andy@andyloree.com @LowOnDiskSpace
Goals Outcome: • What is Machine Learning (ML)? • Understand the ML process • Base knowledge of types of ML & algorithms • Learning path for starting to use ML
What is Machine Learning? • Using data to find patterns and based upon those patterns predict the future • When is a prediction a guess? When it is not based upon “sufficient” observation, experience or scientific reasoning • Example questions: • How long until a production server is out of disk space? • Is this email spam? • Customer retention, product recommendations, marketing campaigns, fraud detection, credit worthiness, …
Is Machine Learning… • …just Statistics? • …just Calculus/Matrix Algebra/Optimization Mathematics • …just Computer Science/Engineering • ...just applied “domain knowledge” • Answer is all of the above and somewhere in between • Philosophical question are best left to the philosophers
Is Machine Learning… • …just Artificial Intelligence? • …just Deep Learning Artificial Intelligence ~1950’s↓ Machine Learning (boom) ~1980’s ↓ Deep Learning (boom) ~2010
Does the Machine really Learn? • Pattern recognition is from learning and past experiences, and we use it every day • Which of these charges were fraudulent for my credit card? When is there enough data and when do you have too much? Enter ML
Types of Questions (about Data) • Descriptive - how many of X did I sell? • Associative - is there an association between temperature and sales? (hypothesis) • Comparative - how many X sells versus Y? • Predicative - using associations and comparatives to predicate sales of X? Machine Learning can answer predictive questions
Framing your questions the ML way In order of importance: 1. Are you asking the right question? - ML is not magic, desired outcomes must be definable - Days until full? Will customer leave? Fraudulent Charge? 2. Do you think you have the right data? - Prediction cannot overcome lack of data - Data insight (domain knowledge) is critical to success 3. What results is good enough? - 50% accuracy, 70%, 99%? No false positives allowed? - Wait, what is accuracy?
Machine Learning Ethics • Bias: Confirmation,… • Perspective: Recommendations lead to more purchases (seller)- vs -Leads to higher debt (buyer) • Moral dilemmas
The Machine (Supervised) Learning Process • Training Data (contains patterns) • One (or more) ML algorithms learn the patterns • A model is generated, used to predict against new data
The Machine Learning Process: Data • Can be multiple sources, BigData stores, flat files, DBMS,… • Usually never in the right format • Do you have the right “features”? * • Preprocessing almost always required – usually the hardest part
The Machine Learning Process: Algorithm • Which algorithm is the “right” one?* • How do you compare one algorithms results to another?
The Machine Learning Process: Model • Most cases, repeating the entire process many times • How stable is our results? • Rinse and repeat the process • Model Management – consuming and operationalization of models is a separate, but very critical topic
Machine Learning: Terminology Training Data/Set – Prepared (training) data ready to use to create a model Three main ML categories: • Supervised learning - Categorizes outcomes or value of interest, in training data • Unsupervised learning - Organize data in a way to describe structure (clustering) • Reinforcement learning • Makes a choice, measure how “good” that was, modify the strategy going forward
Machine Learning Model Types Regression – supervised learning problems, fitting data to a line or curve How long until I run out of disk space?
Machine Learning Model Types Classification – supervised learning problems, capturing data in two (or more) classes Is it spam or ham?
Machine Learning Model Types Clustering – unsupervised learning problems, when we don’t know the defined classes Market research from surveysto generate market segments
Machine Learning: Terminology Feature - individual measurable property – prepared data. A combination of features for an observation is a commonly called a “feature vector” Target Value (or Class) – our desired outcome of prediction; With supervised learning, the value is in the training data
Text (SMS) Spam Which of these five messages are spam? SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate. I HAVE A DATE ON SUNDAY WITH WILL!! Fine if that's the way u feel. That's the way its gota b U GOIN OUT 2NITE?
Text Spam: Features What makes these two messages spam? SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate. What if you cannot use the message text itself?What are the “features” that are common to spam messages? • Length of the message? • Number of numeric strings? • Number of web links? • Number of currency symbols? • Number of punctuations? • Others?
Supervised Learning Example: Text Spam • Collection of SMS messages for mobile phone spam research • Contains a “training set” of 5,574 messages, marked either SPAM or HAM • Given just a message, how can we determine if the message is spam or ham? • Who doesn’t have domain knowledge of “spam” and texting? References:UC Irvine ML Repository: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection Contributions to the Study of SMS Spam Filtering: New Collection and Results: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
Text Spam: Demo • Weka Explorer • Load training data set • Try a couple algorithms using our features with Cross-validation • Compare results Azure ML Studio • Show same solution
ML Algorithms • Way too many to list • Commonly used: • Decision Trees • Random Forest • Support Vector Machines (SVM) • k-Nearest Neighbor - KNN • Linear Regression • Logistic Regression
ML Algorithms: Decision Trees • Supervised learning, classification • Weka implements a particular algorithm named C4.5 (called J48)
ML Algorithms: Random Forest • Supervised learning, classification • Multiple decision trees
ML Algorithms: Support Vector Machines • Supervised learning, classification • Separation by “hyperplane”, Weka version named SGD
ML Algorithms: k-nearest Neighbors • Supervised learning, classification or regression • k is number of neighbors used in measure of distance • Chose odd number to avoid ties • Called IBk in Weka
ML Algorithms: Linear Regression • Supervised learning, regression • Continuous values
ML Algorithms: Logistic Regression • Supervised learning, discrete (binary) values – (yes or no, A or B) • S-curve to fit against data
ML Algorithms: Cheat Sheet https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-cheat-sheet
ML Algorithms: Considerations • Not all algorithms are the same • Accuracy • Other practical measures: • Training time • Memory requirements • Scalability https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-choice
ML Algorithms: Testing • Different ways to “slice and dice” your training set data • Entire set • Percentage of set • Cross-validation - divide the set into subsets – generally best option
ML Algorithms: Evaluating Results • Confusion Matrix • Accuracy – closeness to the true (% of overall) • Precision – more important for non-binary classifications • Lots of others, some specific to problem Type (recall, F-measure,…)
ML Algorithms: Pitfalls • Underfitting – when close enough isn’t close enough
ML Algorithms: Pitfalls • Overfitting – memorization
ML Algorithms: Pitfalls • Data Leakage – do NOT use your prediction value as input to the model • Sampling Bias – poor choices for training set data e.g. predict item sales for entire store chain from a single store’s data • Predict Random Outcomes – fair and unfair coins flips, dependent outcomes
ML Algorithms: Text Processing • Think of “search” on top of machine learning • All of the common problems applied to classic linguistics challenge machine learning to an extent: • Tokenization – word breaking • Stemming (and lemmatization) – walk, walking, walked, walks → walk • Domain specific dictionaries – company jargon, acronyms, emojis,… • Language used – not everyone writes the Queen’s English • Semantic search – understand “meaning” – may be a better option to generate processing features
ML Toolkits, Platforms & Libraries • Toolkit/Platforms • WEKA • R • Parts of Python SciPy • Microsoft Cognitive Toolkit (CNTK) • Libraries • Scikit-learn (python) • JSAT • Accord.NET FrameworkAPIs • Azure ML • Mlib • PredicationIO • Operationalize • SQL Server Machine Learning Services/Machine Learning Server
Got a question? Andrew Loree www.andyloree.com andy@andyloree.com @LowOnDiskSpace