680 likes | 692 Views
This course provides an introduction to machine learning, covering various algorithms and techniques. Topics include symbolic and connectionist models, support vector machines, and reinforcement learning.
E N D
CS760 – Machine Learning • Course Instructor: David Page • email: dpage@cs.wisc.edu • office: MSC 6743 (University & Charter) • hours: TBA • Teaching Assistant: Daniel Wong • email: wong@cs.wisc.edu • office: TBA • hours: TBA CS 760 – Machine Learning (UW-Madison)
Textbooks & Reading Assignment • Machine Learning (Tom Mitchell) • Selected on-line readings • Read in Mitchell (posted on class web page) • Preface • Chapter 1 • Sections 2.1 and 2.2 • Chapter 8 CS 760 – Machine Learning (UW-Madison)
Monday, Wednesday, andFriday? • We’ll meet 30 times this term (may or may not include exam in this count) • We’ll meet on FRIDAY this and next week, in order to cover material for HW 1(plus I have some business travel this term) • Default: we WILL meet on Friday unless I announce otherwise CS 760 – Machine Learning (UW-Madison)
Course "Style" • Primarily algorithmic & experimental • Some theory, both mathematical & conceptual (much on statistics) • "Hands on" experience, interactive lectures/discussions • Broad survey of many ML subfields, including • "symbolic" (rules, decision trees, ILP) • "connectionist" (neural nets) • support vector machines, nearest-neighbors • theoretical ("COLT") • statistical ("Bayes rule") • reinforcement learning, genetic algorithms CS 760 – Machine Learning (UW-Madison)
"MS vs. PhD" Aspects • MS'ish topics • mature, ready for practical application • first 2/3 – ¾ of semester • Naive Bayes, Nearest-Neighbors, Decision Trees, Neural Nets, Suport Vector Machines, ensembles, experimental methodology (10-fold cross validation, t-tests) • PhD'ish topics • inductive logic programming, statistical relational learning, reinforcement learning, SVMs, use of prior knowledge • Other machine learning material covered in Bioinformatics CS 576/776, Jerry Zhu’s CS 838 CS 760 – Machine Learning (UW-Madison)
Two Major Goals • to understand what a learning system should do • to understand how (and how well) existing systems work • Issues in algorithm design • Choosing algorithms for applications CS 760 – Machine Learning (UW-Madison)
Background Assumed • Languages • Java (see CS 368 tutorial online) • AI Topics • Search • FOPC • Unification • Formal Deduction • Math • Calculus (partial derivatives) • Simple prob & stats • No previous ML experience assumed(so some overlap with CS 540) CS 760 – Machine Learning (UW-Madison)
Requirements • Bi-weekly programming HW's • "hands on" experience valuable • HW0 – build a dataset • HW1 – simple ML algo's and exper. methodology • HW2 – decision trees (?) • HW3 – neural nets (?) • HW4 – reinforcement learning (in a simulated world) • "Midterm" exam (in class, about 90% through semester) • Find project of your choosing • during last 4-5 weeks of class CS 760 – Machine Learning (UW-Madison)
Grading HW's 35% "Midterm" 40% Project 20% Quality Discussion 5% CS 760 – Machine Learning (UW-Madison)
Late HW's Policy • HW's due @ 4pm • you have 5 late days to use over the semester • (Fri 4pm → Mon 4pm is 1 late "day") • SAVE UP late days! • extensions only for extreme cases • Penalty points after late days exhausted • Can't be more than ONE WEEK late CS 760 – Machine Learning (UW-Madison)
Academic Misconduct (also on course homepage) All examinations, programming assignments, and written homeworks must be done individually. Cheating and plagiarism will be dealt with in accordance with University procedures (see the Academic Misconduct Guide for Students). Hence, for example, code for programming assignments must not be developed in groups, nor should code be shared. You are encouraged to discuss with your peers, the TAs or the instructor ideas, approaches and techniques broadly, but not at a level of detail where specific implementation issues are described by anyone. If you have any questions on this, please ask the instructor before you act. CS 760 – Machine Learning (UW-Madison)
What Do You Think Learning Means? CS 760 – Machine Learning (UW-Madison)
What is Learning? “Learning denotes changes in the system that … enable the system to do the same task … more effectively the next time.” - Herbert Simon “Learning is making useful changes in our minds.” - Marvin Minsky CS 760 – Machine Learning (UW-Madison)
Today’sTopics • Memorization as Learning • Feature Space • Supervised ML • K-NN (K-Nearest Neighbor) CS 760 – Machine Learning (UW-Madison)
Memorization (Rote Learning) • Employed by first machine learning systems, in 1950s • Samuel’s Checkers program • Michie’s MENACE: Matchbox Educable Naughts and Crosses Engine • Prior to these, some people believed computers could not improve at a task with experience CS 760 – Machine Learning (UW-Madison)
Rote Learning is Limited • Memorize I/O pairs and perform exact matching with new inputs • If computer has not seen precise case before, it cannot apply its experience • Want computer to “generalize” from prior experience CS 760 – Machine Learning (UW-Madison)
Some Settings in Which Learning May Help • Given an input, what is appropriate response (output/action)? • Game playing – board state/move • Autonomous robots (e.g., driving a vehicle) -- world state/action • Video game characters – state/action • Medical decision support – symptoms/ treatment • Scientific discovery – data/hypothesis • Data mining – database/regularity CS 760 – Machine Learning (UW-Madison)
Not in Mitchell’s textbook (covered in CS 776) Broad Paradigms of Machine Learning • Inducing Functions from I/O Pairs • Decision trees (e.g., Quinlan’s C4.5 [1993]) • Connectionism / neural networks (e.g., backprop) • Nearest-neighbor methods • Genetic algorithms • SVM’s • Learning without Feedback/Teacher • Conceptual clustering • Self-organizing systems • Discovery systems CS 760 – Machine Learning (UW-Madison)
IID (Completion of Lec #2) • We are assuming examples are IID: independently identically distributed • Eg, we are ignoring temporal dependencies (covered in time-series learning) • Eg, we assume the learner has no say in which examples it gets (covered in active learning) CS 760 – Machine Learning (UW-Madison)
Supervised Learning Task Overview Real World HW 0 Feature Selection (usually done by humans) Feature Space Classification Rule Construction (done by learning algorithm) HW 1-3 Concepts/ Classes/ Decisions CS 760 – Machine Learning (UW-Madison)
Supervised Learning Task Overview (cont.) • Note: mappings on previous slide are not necessarily 1-to-1 • Bad for first mapping? • Good for the second (in fact, it’s the goal!) CS 760 – Machine Learning (UW-Madison)
Empirical Learning: Task Definition • Given • A collection of positive examples of some concept/class/category (i.e., members of the class) and, possibly, a collection of the negative examples (i.e., non-members) • Produce • A description that covers (includes) all/most of the positive examples and non/few of the negative examples (and, hopefully, properly categorizes most future examples!) Note: one can easily extend this definition to handle more than two classes The Key Point! CS 760 – Machine Learning (UW-Madison)
Example Positive Examples Negative Examples How does this symbol classify? • Concept • Solid Red Circle in a (Regular?) Polygon • What about? • Figures on left side of page • Figures drawn before 5pm 2/2/89 <etc> CS 760 – Machine Learning (UW-Madison)
Concept Learning Learning systems differ in how they represent concepts: Neural Net Backpropagation C4.5, CART Decision Tree Training Examples AQ, FOIL Φ <- X^Y Φ <- Z Rules . . . SVMs If 5x1 + 9x2 – 3x3 > 12 Then + CS 760 – Machine Learning (UW-Madison)
Feature Space If examples are described in terms of values of features, they can be plotted as points in an N-dimensional space. Size Big ? Color Gray 2500 Weight A “concept” is then a (possibly disjoint) volume in this space. CS 760 – Machine Learning (UW-Madison)
Learning from Labeled Examples • Most common and successful form of ML Venn Diagram - - - - + + + - + - - - • Examples – points in a multi-dimensional “feature space” • Concepts – “function” that labels every point in feature space • (as +, -, and possibly ?) CS 760 – Machine Learning (UW-Madison)
Brief Review Instances • Conjunctive Concept • Color(?obj1, red) ^ • Size(?obj1, large) • Disjunctive Concept • Color(?obj2, blue) v • Size(?obj2, small) • More formally a “concept” is of the form • x y z F(x, y, z) -> Member(x, Class1) “and” “or” A A A CS 760 – Machine Learning (UW-Madison)
Empirical Learning and Venn Diagrams Venn Diagram Concept = A or B (Disjunctive concept) Examples = labeled points in feature space Concept = a label for a set of points - - - - - - - - + + - - - - + - - + - + - + + + + + + + + + + - - - - - A - - - + + + - + - B - - - - - - - - Feature Space CS 760 – Machine Learning (UW-Madison)
Aspects of an ML System • “Language” for representing classified examples • “Language” for representing “Concepts” • Technique for producing concept “consistent” with the training examples • Technique for classifying new instance Each of these limits the expressiveness/efficiencyof the supervised learning algorithm. HW 0 Other HW’s CS 760 – Machine Learning (UW-Madison)
Nearest-Neighbor Algorithms (aka. Exemplar models, instance-based learning (IBL), case-based learning) • Learning ≈ memorize training examples • Problem solving = find most similar example in memory; output its category Venn - - + + + + “Voronoi Diagrams” (pg 233) + - … - - - + - + + + + ? - CS 760 – Machine Learning (UW-Madison)
Simple Example: 1-NN (1-NN ≡one nearest neighbor) Training Set • a=0, b=0, c=1+ • a=0, b=0, c=0- • a=1, b=1, c=1- Test Example • a=0, b=1, c=0 ? • “Hamming Distance” • Ex 1 = 2 • Ex 2 = 1 • Ex 3 = 2 So output - CS 760 – Machine Learning (UW-Madison)
Sample Experimental Results (see UCI archive for more) Simple algorithm works quite well! CS 760 – Machine Learning (UW-Madison)
Tuning Set Error Rate 2 3 4 5 K K-NN Algorithm Collect K nearest neighbors, select majority classification (or somehow combine their classes) • What should K be? • It probably is problem dependent • Can use tuning sets (later) to select a good setting for K Shouldn’t really “connect the dots” (Why?) 1 CS 760 – Machine Learning (UW-Madison)
Data Representation • Creating a dataset of • Be sure to include – on separate 8x11 sheet – a photo and a brief bio • HW0 out on-line • Due next Friday fixed length feature vectors CS 760 – Machine Learning (UW-Madison)
HW0 – Create Your Own Dataset (repeated from lecture #1) • Think about before next class • Read HW0 (on-line) • Google to find: • UCI archive (or UCI KDD archive) • UCI ML archive (UCI ML repository) • More links in HW0’s web page CS 760 – Machine Learning (UW-Madison)
HW0 – Your “Personal Concept” • Step 1: Choose a Boolean (true/false) concept • Books I like/dislike Movies I like/dislike www pages I like/dislike • Subjective judgment (can’t articulate) • “time will tell” concepts • Stocks to buy • Medical treatment • at time t, predict outcome at time (t +∆t) • Sensory interpretation • Face recognition (see textbook) • Handwritten digit recognition • Sound recognition • Hard-to-Program Functions CS 760 – Machine Learning (UW-Madison)
Digitized camera image Learned Function Steering Angle age=13, sex=M, wgt=18 Learned Function sick vs healthy Some Real-World Examples • Car Steering (Pomerleau, Thrun) • Medical Diagnosis (Quinlan) • DNA Categorization • TV-pilot rating • Chemical-plant control • Backgammon playing Medical record CS 760 – Machine Learning (UW-Madison)
HW0 – Your “Personal Concept” • Step 2: Choosing a feature space • We will use fixed-length feature vectors • Choose N features • Each feature has Vipossible values • Each example is represented by a vector of N feature values (i.e., is a point in the feature space) e.g.: <red, 50, round> colorweight shape • Feature Types • Boolean • Nominal • Ordered • Hierarchical • Step 3: Collect examples (“I/O” pairs) Defines a space In HW0 we will use a subset (see next slide) CS 760 – Machine Learning (UW-Madison)
closed polygon continuous square triangle circle ellipse Standard Feature Typesfor representing training examples – a source of “domain knowledge” • Nominal • No relationship among possible values e.g., color є {red, blue, green} (vs. color = 1000 Hertz) • Linear (or Ordered) • Possible values of the feature are totally ordered e.g., size є{small, medium, large}←discrete weight є [0…500] ←continuous • Hierarchical • Possible values are partiallyordered in an ISA hierarchy e.g. for shape-> CS 760 – Machine Learning (UW-Madison)
Our Feature Types(for CS 760 HW’s) • Discrete • tokens (char strings, w/o quote marks and spaces) • Continuous • numbers (int’s or float’s) • If only a few possible values (e.g., 0 & 1) use discrete • i.e., merge nominal and discrete-ordered (or convert discrete-ordered into 1,2,…) • We will ignore hierarchical info and only use the leaf values (common approach) CS 760 – Machine Learning (UW-Madison)
Product Pct Foods Tea 99 Product Classes 2302 Product Subclasses Dried Cat Food Canned Cat Food Friskies Liver, 250g ~30k Products Example Hierarchy (KDD* Journal, Vol 5, No. 1-2, 2001, page 17) • Structure of one feature! • “the need to be able to incorporate hierarchical (knowledge about data types) is shown in every paper.” • - From eds. Intro to special issue (on applications) of KDD journal, Vol 15, 2001 * Officially, “Data Mining and Knowledge Discovery”, Kluwer Publishers CS 760 – Machine Learning (UW-Madison)
HW0: Creating Your Dataset Ex: IMDB has a lot of data that are not discrete or continuous or binary-valued for target function (category) Name Country List of movies Name Year of birth Gender Oscar nominations List of movies Studio Actor Name Year of birth List of movies Director/ Producer Made Directed Acted in Produced Movie Title, Genre, Year, Opening Wkend BO receipts, List of actors/actresses, Release season CS 760 – Machine Learning (UW-Madison)
HW0: Sample DB Choose a Boolean or binary-valued target function (category) • Opening weekend box-office receipts > $2 million • Movie is drama? (action, sci-fi,…) • Movies I like/dislike (e.g. Tivo) CS 760 – Machine Learning (UW-Madison)
HW0: Representing as a Fixed-Length Feature Vector <discuss on chalkboard> Note: some advanced ML approaches do not require such “feature mashing” (eg, ILP) CS 760 – Machine Learning (UW-Madison)
IMDB@umass David Jensen’s group at UMass uses Naïve Bayes and other ML algo’s on the IMDB • Opening weekend box-office receipts > $2 million • 25 attributes • Accuracy = 83.3% • Default accuracy = 56%(default algo?) • Movie is drama? • 12 attributes • Accuracy = 71.9% • Default accuracy = 51% http://kdl.cs.umass.edu/proximity/about.html CS 760 – Machine Learning (UW-Madison)
First Algorithm in Detail • K-Nearest Neighbors / Instance-Based Learning (k-NN/IBL) • Distance functions • Kernel functions • Feature selection (applies to all ML algo’s) • IBL Summary Chapter 8 of Mitchell CS 760 – Machine Learning (UW-Madison)
Some Common Jargon • Classification • Learning a discrete valued function • Regression • Learning a real valued function IBL easily extended to regression tasks (and to multi-category classification) Discrete/Real Outputs CS 760 – Machine Learning (UW-Madison)
Variations on a Theme (From Aha, Kibler and Albert in ML Journal) • IB1 – keep all examples • IB2 – keep next instance if incorrectly classified by using previous instances • Uses less storage (good) • Order dependent (bad) • Sensitive to noisy data (bad) CS 760 – Machine Learning (UW-Madison)
Variations on a Theme (cont.) • IB3– extend IB2 to more intelligently decide which examples to keep (see article) • Better handling of noisy data • Another Idea - cluster groups, keep example from each (median/centroid) • Less storage, faster lookup CS 760 – Machine Learning (UW-Madison)
Distance Functions • Key issue in IBL (instance-based learning) • One approach: assign weights to each feature CS 760 – Machine Learning (UW-Madison)