CS194-10 Fall 2011 Introduction to Machine Learning Machine Learning: An Overview

CS194-10 Fall 2011Introduction to Machine LearningMachine Learning: An Overview

People Avital Steinitz 2nd year CS PhD student Stuart Russell 30th-year CS PhD student Mert Pilanci 2nd year EE PhD student CS 194-10 Fall 2011, Stuart Russell

Administrative details • Web page • Newsgroup CS 194-10 Fall 2011, Stuart Russell

Course outline • Overview of machine learning (today) • Classical supervised learning • Linear regression, perceptrons, neural nets, SVMs, decision trees, nearest neighbors, and all that • A little bit of theory, a lot of applications • Learning probabilistic models • Probabilistic classifiers (logistic regression, etc.) • Unsupervised learning, density estimation, EM • Bayes net learning • Time series models • Dimensionality reduction • Gaussian process models • Language models • Bandits and other exciting topics CS 194-10 Fall 2011, Stuart Russell

Lecture outline • Goal: Provide a framework for understanding all the detailed content to come, and why it matters • Learning: why and how • Supervised learning • Classical: finding simple, accurate hypotheses • Probabilistic: finding likely hypotheses • Bayesian: updating belief in hypotheses • Data and applications • Expressiveness and cumulative learning • CTBT CS 194-10 Fall 2011, Stuart Russell

Learning is…. … a computational process for improving performance based on experience CS 194-10 Fall 2011, Stuart Russell

Learning: Why? CS 194-10 Fall 2011, Stuart Russell

Learning: Why? • The baby, assailed by eyes, ears, nose, skin, and entrails at once, feels it all as one great blooming, buzzing confusion … • [William James, 1890] CS 194-10 Fall 2011, Stuart Russell

Learning: Why? • The baby, assailed by eyes, ears, nose, skin, and entrails at once, feels it all as one great blooming, buzzing confusion … • [William James, 1890] Learning is essential for unknown environments, i.e., when the designer lacks omniscience CS 194-10 Fall 2011, Stuart Russell

Learning: Why? • Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child's? If this were then subjected to an appropriate course of education one would obtain the adult brain. Presumably the child brain is something like a notebook as one buys it from the stationer's. Rather little mechanism, and lots of blank sheets. • [Alan Turing, 1950] • Learning is useful as a system construction method, i.e., expose the system to reality rather than trying to write it down CS 194-10 Fall 2011, Stuart Russell

Learning: How? CS 194-10 Fall 2011, Stuart Russell

Structure of a learning agent CS 194-10 Fall 2011, Stuart Russell

Design of learning element • Key questions: • What is the agent design that will implement the desired performance? • Improve the performance of what piece of the agent system and how is that piece represented? • What data are available relevant to that piece? (In particular, do we know the right answers?) • What knowledge is already available? CS 194-10 Fall 2011, Stuart Russell

Examples Supervised learning: correct answers for each training instance Reinforcement learning: reward sequence, no correct answers Unsupervised learning: “just make sense of the data” CS 194-10 Fall 2011, Stuart Russell

Supervised learning • To learn an unknown target functionf • Input: a training set of labeled examples(xj,yj) where yj = f(xj) • E.g., xj is an image, f(xj) is the label “giraffe” • E.g., xj is a seismic signal, f(xj) is the label “explosion” • Output: hypothesishthat is “close” to f, i.e., predicts well on unseen examples (“test set”) • Many possible hypothesis families for h • Linear models, logistic regression, neural networks, decision trees, examples (nearest-neighbor), grammars, kernelized separators, etc etc CS 194-10 Fall 2011, Stuart Russell

Supervised learning • To learn an unknown target functionf • Input: a training set of labeled examples(xj,yj)where yj = f(xj) • E.g., xjis an image,f(xj)is the label “giraffe” • E.g.,xjis a seismic signal, f(xj)is the label “explosion” • Output: hypothesishthat is “close” to f, i.e., predicts well on unseen examples (“test set”) • Many possible hypothesis families for h • Linear models, logistic regression, neural networks, decision trees, examples (nearest-neighbor), grammars, kernelized separators, etc etc CS 194-10 Fall 2011, Stuart Russell

Example: object recognition x f(x) giraffe giraffe giraffe llama llama llama CS 194-10 Fall 2011, Stuart Russell

Example: object recognition x f(x) giraffe giraffe giraffe llama llama llama X= f(x)=? CS 194-10 Fall 2011, Stuart Russell

Example: curve fitting CS 194-10 Fall 2011, Stuart Russell

Basic questions • Which hypothesis space H to choose? • How to measure degree of fit? • How to trade off degree of fit vs. complexity? • “Ockham’s razor” • How do we find a good h? • How do we know if a good h will predict well? CS 194-10 Fall 2011, Stuart Russell

Philosophy of Science (Physics) • Which hypothesis space H to choose? • Deterministic hypotheses, usually mathematical formulas and/or logical sentences; implicit relevance determination • How to measure degree of fit? • Ideally, h will be consistent with data • How to trade off degree of fit vs. complexity? • Theory must be correct up to “experimental error” • How do we find a good h? • Intuition, imagination, inspiration (invent new terms!!) • How do we know if a good h will predict well? • Hume’s Problem of Induction: most philosophers give up CS 194-10 Fall 2011, Stuart Russell

Kolmogorov complexity (also MDL, MML) • Which hypothesis space H to choose? • All Turing machines (or programs for a UTM) • How to measure degree of fit? • Fit is perfect (program has to output data exactly) • How to trade off degree of fit vs. complexity? • Minimize size of program • How do we find a good h? • Undecidable (unless we bound time complexity of h) • How do we know if a good h will predict well? • (recent theory borrowed from PAC learning) CS 194-10 Fall 2011, Stuart Russell

Classical stats/ML: Minimize loss function • Which hypothesis space H to choose? • E.g., linear combinations of features: hw(x) = wTx • How to measure degree of fit? • Loss function, e.g., squared error Σj (yj – wTx)2 • How to trade off degree of fit vs. complexity? • Regularization: complexity penalty, e.g., ||w||2 • How do we find a good h? • Optimization (closed-form, numerical); discrete search • How do we know if a good h will predict well? • Try it and see (cross-validation, bootstrap, etc.) CS 194-10 Fall 2011, Stuart Russell

Probabilistic: Max. likelihood, max. a priori • Which hypothesis space H to choose? • Probability model P(y | x,h) , e.g., Y ~ N(wTx,σ2) • How to measure degree of fit? • Data likelihood Πj P(yj | xj,h) • How to trade off degree of fit vs. complexity? • Regularization orprior: argmaxh P(h) Πj P(yj | xj,h) (MAP) • How do we find a good h? • Optimization (closed-form, numerical); discrete search • How do we know if a good h will predict well? • Empirical process theory (generalizes Chebyshev, CLT, PAC…); • Key assumption is (i)id CS 194-10 Fall 2011, Stuart Russell

Bayesian: Computing posterior over H • Which hypothesis space H to choose? • All hypotheses with nonzero a priori probability • How to measure degree of fit? • Data probability, as for MLE/MAP • How to trade off degree of fit vs. complexity? • Use prior, as for MAP • How do we find a good h? • Don’t! Bayes predictor P(y|x,D) = Σh P(y|x,h) P(D|h) P(h) • How do we know if a good h will predict well? • Silly question! Bayesian prediction is optimal!! CS 194-10 Fall 2011, Stuart Russell

Neon sculpture at Autonomy Corp. CS 194-10 Fall 2011, Stuart Russell

CS 194-10 Fall 2011, Stuart Russell

Lots of data • Web: estimated Google index 45 billion pages • Clickstream data: 10-100 TB/day • Transaction data: 5-50 TB/day • Satellite image feeds: ~1TB/day/satellite • Sensor networks/arrays • CERN Large Hadron Collider ~100 petabytes/day • Biological data: 1-10TB/day/sequencer • TV: 2TB/day/channel; YouTube 4TB/day uploaded • Digitized telephony: ~100 petabytes/day CS 194-10 Fall 2011, Stuart Russell

CS 194-10 Fall 2011, Stuart Russell

Real data are messy CS 194-10 Fall 2011, Stuart Russell

Arterial blood pressure (high/low/mean) 1s CS 194-10 Fall 2011, Stuart Russell

Application: satellite image analysis CS 194-10 Fall 2011, Stuart Russell

Application: Discovering DNA motifs ...TTGGAACAACCATGCACGGTTGATTCGTGCCTGTGACCGCGCGCCTCACACGGAAGACGCAGCCACCGGTTGTGATG TCATAGGGAATTCCCCATGTCGTGAATAATGCCTCGAATGATGAGTAATAGTAAAACGCAGGGGAGGTTCTTCAGTAGTA TCAATATGAGACACATACAAACGGGCGTACCTACCGCAGCTCAAAGCTGGGTGCATTTTTGCCAAGTGCCTTACTGTTAT CTTAGGACGGAAATCCACTATAAGATTATAGAAAGGAAGGCGGGCCGAGCGAATCGATTCAATTAAGTTATGTCACAAGG GTGCTATAGCCTATTCCTAAGATTTGTACGTGCGTATGACTGGAATTAATAACCCCTCCCTGCACTGACCTTGACTGAAT AACTGTGATACGACGCAAACTGAACGCTGCGGGTCCTTTATGACCACGGATCACGACCGCTTAAGACCTGAGTTGGAGTT GATACATCCGGCAGGCAGCCAAATCTTTTGTAGTTGAGACGGATTGCTAAGTGTGTTAACTAAGACTGGTATTTCCACTA GGACCACGCTTACATCAGGTCCCAAGTGGACAACGAGTCCGTAGTATTGTCCACGAGAGGTCTCCTGATTACATCTTGAA GTTTGCGACGTGTTATGCGGATGAAACAGGCGGTTCTCATACGGTGGGGCTGGTAAACGAGTTCCGGTCGCGGAGATAAC TGTTGTGATTGGCACTGAAGTGCGAGGTCTTAAACAGGCCGGGTGTACTAACCCAAAGACCGGCCCAGCGTCAGTGA... CS 194-10 Fall 2011, Stuart Russell

Application: User website behavior from clickstream data (from P. Smyth, UCI) 128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -, 128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -, User 1 2 3 2 2 3 3 3 1 1 1 3 1 3 3 3 3 User 2 3 3 3 1 1 1 User 3 7 7 7 7 7 7 7 7 User 4 1 5 1 1 1 5 1 5 1 1 1 1 1 1 User 5 5 1 1 5 … … CS 194-10 Fall 2011, Stuart Russell

Application: social network analysis HP Labs email data 500 users, 20k connections evolving over time CS 194-10 Fall 2011, Stuart Russell

Application: spam filtering • 200 billion spam messages sent per day • Asymmetric cost of false positive/false negative • Weak label: discarded without reading • Strong label (“this is spam”) hard to come by • Standard iid assumption violated: spammers alter spam generators to evade or subvert spam filters (“adversarial learning” task) CS 194-10 Fall 2011, Stuart Russell

Learning Learning knowledge data CS 194-10 Fall 2011, Stuart Russell

Learning prior knowledge Learning knowledge data CS 194-10 Fall 2011, Stuart Russell

Learning prior knowledge Learning knowledge data Crucial open problem: weak intermediate forms of knowledge that support future generalizations CS 194-10 Fall 2011, Stuart Russell

CS194-10 Fall 2011 Introduction to Machine Learning Machine Learning: An Overview