1 / 88

CS194-10 Fall 2011 Introduction to Machine Learning Machine Learning: An Overview

CS194-10 Fall 2011 Introduction to Machine Learning Machine Learning: An Overview. People. Avital Steinitz 2 nd year CS PhD student. Stuart Russell 30 th -year CS PhD student. Mert Pilanci 2 nd year EE PhD student. Administrative details. Web page Newsgroup. Course outline.

benjy
Download Presentation

CS194-10 Fall 2011 Introduction to Machine Learning Machine Learning: An Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS194-10 Fall 2011Introduction to Machine LearningMachine Learning: An Overview

  2. People Avital Steinitz 2nd year CS PhD student Stuart Russell 30th-year CS PhD student Mert Pilanci 2nd year EE PhD student CS 194-10 Fall 2011, Stuart Russell

  3. Administrative details • Web page • Newsgroup CS 194-10 Fall 2011, Stuart Russell

  4. Course outline • Overview of machine learning (today) • Classical supervised learning • Linear regression, perceptrons, neural nets, SVMs, decision trees, nearest neighbors, and all that • A little bit of theory, a lot of applications • Learning probabilistic models • Probabilistic classifiers (logistic regression, etc.) • Unsupervised learning, density estimation, EM • Bayes net learning • Time series models • Dimensionality reduction • Gaussian process models • Language models • Bandits and other exciting topics CS 194-10 Fall 2011, Stuart Russell

  5. Lecture outline • Goal: Provide a framework for understanding all the detailed content to come, and why it matters • Learning: why and how • Supervised learning • Classical: finding simple, accurate hypotheses • Probabilistic: finding likely hypotheses • Bayesian: updating belief in hypotheses • Data and applications • Expressiveness and cumulative learning • CTBT CS 194-10 Fall 2011, Stuart Russell

  6. Learning is…. … a computational process for improving performance based on experience CS 194-10 Fall 2011, Stuart Russell

  7. Learning: Why? CS 194-10 Fall 2011, Stuart Russell

  8. Learning: Why? • The baby, assailed by eyes, ears, nose, skin, and entrails at once, feels it all as one great blooming, buzzing confusion … • [William James, 1890] CS 194-10 Fall 2011, Stuart Russell

  9. Learning: Why? • The baby, assailed by eyes, ears, nose, skin, and entrails at once, feels it all as one great blooming, buzzing confusion … • [William James, 1890] Learning is essential for unknown environments, i.e., when the designer lacks omniscience CS 194-10 Fall 2011, Stuart Russell

  10. Learning: Why? • Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child's? If this were then subjected to an appropriate course of education one would obtain the adult brain. Presumably the child brain is something like a notebook as one buys it from the stationer's. Rather little mechanism, and lots of blank sheets. • [Alan Turing, 1950] • Learning is useful as a system construction method, i.e., expose the system to reality rather than trying to write it down CS 194-10 Fall 2011, Stuart Russell

  11. Learning: How? CS 194-10 Fall 2011, Stuart Russell

  12. Learning: How? CS 194-10 Fall 2011, Stuart Russell

  13. Learning: How? CS 194-10 Fall 2011, Stuart Russell

  14. Learning: How? CS 194-10 Fall 2011, Stuart Russell

  15. Structure of a learning agent CS 194-10 Fall 2011, Stuart Russell

  16. Design of learning element • Key questions: • What is the agent design that will implement the desired performance? • Improve the performance of what piece of the agent system and how is that piece represented? • What data are available relevant to that piece? (In particular, do we know the right answers?) • What knowledge is already available? CS 194-10 Fall 2011, Stuart Russell

  17. Examples Supervised learning: correct answers for each training instance Reinforcement learning: reward sequence, no correct answers Unsupervised learning: “just make sense of the data” CS 194-10 Fall 2011, Stuart Russell

  18. Supervised learning • To learn an unknown target functionf • Input: a training set of labeled examples(xj,yj) where yj = f(xj) • E.g., xj is an image, f(xj) is the label “giraffe” • E.g., xj is a seismic signal, f(xj) is the label “explosion” • Output: hypothesishthat is “close” to f, i.e., predicts well on unseen examples (“test set”) • Many possible hypothesis families for h • Linear models, logistic regression, neural networks, decision trees, examples (nearest-neighbor), grammars, kernelized separators, etc etc CS 194-10 Fall 2011, Stuart Russell

  19. Supervised learning • To learn an unknown target functionf • Input: a training set of labeled examples(xj,yj) where yj = f(xj) • E.g., xj is an image, f(xj) is the label “giraffe” • E.g., xj is a seismic signal, f(xj) is the label “explosion” • Output: hypothesishthat is “close” to f, i.e., predicts well on unseen examples (“test set”) • Many possible hypothesis families for h • Linear models, logistic regression, neural networks, decision trees, examples (nearest-neighbor), grammars, kernelized separators, etc etc CS 194-10 Fall 2011, Stuart Russell

  20. Supervised learning • To learn an unknown target functionf • Input: a training set of labeled examples(xj,yj)where yj = f(xj) • E.g., xjis an image,f(xj)is the label “giraffe” • E.g.,xjis a seismic signal, f(xj)is the label “explosion” • Output: hypothesishthat is “close” to f, i.e., predicts well on unseen examples (“test set”) • Many possible hypothesis families for h • Linear models, logistic regression, neural networks, decision trees, examples (nearest-neighbor), grammars, kernelized separators, etc etc CS 194-10 Fall 2011, Stuart Russell

  21. Example: object recognition x f(x) giraffe giraffe giraffe llama llama llama CS 194-10 Fall 2011, Stuart Russell

  22. Example: object recognition x f(x) giraffe giraffe giraffe llama llama llama X= f(x)=? CS 194-10 Fall 2011, Stuart Russell

  23. Example: curve fitting CS 194-10 Fall 2011, Stuart Russell

  24. Example: curve fitting CS 194-10 Fall 2011, Stuart Russell

  25. Example: curve fitting CS 194-10 Fall 2011, Stuart Russell

  26. Example: curve fitting CS 194-10 Fall 2011, Stuart Russell

  27. Example: curve fitting CS 194-10 Fall 2011, Stuart Russell

  28. Basic questions • Which hypothesis space H to choose? • How to measure degree of fit? • How to trade off degree of fit vs. complexity? • “Ockham’s razor” • How do we find a good h? • How do we know if a good h will predict well? CS 194-10 Fall 2011, Stuart Russell

  29. Philosophy of Science (Physics) • Which hypothesis space H to choose? • Deterministic hypotheses, usually mathematical formulas and/or logical sentences; implicit relevance determination • How to measure degree of fit? • Ideally, h will be consistent with data • How to trade off degree of fit vs. complexity? • Theory must be correct up to “experimental error” • How do we find a good h? • Intuition, imagination, inspiration (invent new terms!!) • How do we know if a good h will predict well? • Hume’s Problem of Induction: most philosophers give up CS 194-10 Fall 2011, Stuart Russell

  30. Kolmogorov complexity (also MDL, MML) • Which hypothesis space H to choose? • All Turing machines (or programs for a UTM) • How to measure degree of fit? • Fit is perfect (program has to output data exactly) • How to trade off degree of fit vs. complexity? • Minimize size of program • How do we find a good h? • Undecidable (unless we bound time complexity of h) • How do we know if a good h will predict well? • (recent theory borrowed from PAC learning) CS 194-10 Fall 2011, Stuart Russell

  31. Classical stats/ML: Minimize loss function • Which hypothesis space H to choose? • E.g., linear combinations of features: hw(x) = wTx • How to measure degree of fit? • Loss function, e.g., squared error Σj (yj – wTx)2 • How to trade off degree of fit vs. complexity? • Regularization: complexity penalty, e.g., ||w||2 • How do we find a good h? • Optimization (closed-form, numerical); discrete search • How do we know if a good h will predict well? • Try it and see (cross-validation, bootstrap, etc.) CS 194-10 Fall 2011, Stuart Russell

  32. Probabilistic: Max. likelihood, max. a priori • Which hypothesis space H to choose? • Probability model P(y | x,h) , e.g., Y ~ N(wTx,σ2) • How to measure degree of fit? • Data likelihood Πj P(yj | xj,h) • How to trade off degree of fit vs. complexity? • Regularization orprior: argmaxh P(h) Πj P(yj | xj,h) (MAP) • How do we find a good h? • Optimization (closed-form, numerical); discrete search • How do we know if a good h will predict well? • Empirical process theory (generalizes Chebyshev, CLT, PAC…); • Key assumption is (i)id CS 194-10 Fall 2011, Stuart Russell

  33. Bayesian: Computing posterior over H • Which hypothesis space H to choose? • All hypotheses with nonzero a priori probability • How to measure degree of fit? • Data probability, as for MLE/MAP • How to trade off degree of fit vs. complexity? • Use prior, as for MAP • How do we find a good h? • Don’t! Bayes predictor P(y|x,D) = Σh P(y|x,h) P(D|h) P(h) • How do we know if a good h will predict well? • Silly question! Bayesian prediction is optimal!! CS 194-10 Fall 2011, Stuart Russell

  34. Bayesian: Computing posterior over H • Which hypothesis space H to choose? • All hypotheses with nonzero a priori probability • How to measure degree of fit? • Data probability, as for MLE/MAP • How to trade off degree of fit vs. complexity? • Use prior, as for MAP • How do we find a good h? • Don’t! Bayes predictor P(y|x,D) = Σh P(y|x,h) P(D|h) P(h) • How do we know if a good h will predict well? • Silly question! Bayesian prediction is optimal!! CS 194-10 Fall 2011, Stuart Russell

  35. Neon sculpture at Autonomy Corp. CS 194-10 Fall 2011, Stuart Russell

  36. CS 194-10 Fall 2011, Stuart Russell

  37. Lots of data • Web: estimated Google index 45 billion pages • Clickstream data: 10-100 TB/day • Transaction data: 5-50 TB/day • Satellite image feeds: ~1TB/day/satellite • Sensor networks/arrays • CERN Large Hadron Collider ~100 petabytes/day • Biological data: 1-10TB/day/sequencer • TV: 2TB/day/channel; YouTube 4TB/day uploaded • Digitized telephony: ~100 petabytes/day CS 194-10 Fall 2011, Stuart Russell

  38. CS 194-10 Fall 2011, Stuart Russell

  39. Real data are messy CS 194-10 Fall 2011, Stuart Russell

  40. Arterial blood pressure (high/low/mean) 1s CS 194-10 Fall 2011, Stuart Russell

  41. Application: satellite image analysis CS 194-10 Fall 2011, Stuart Russell

  42. Application: Discovering DNA motifs ...TTGGAACAACCATGCACGGTTGATTCGTGCCTGTGACCGCGCGCCTCACACGGAAGACGCAGCCACCGGTTGTGATG TCATAGGGAATTCCCCATGTCGTGAATAATGCCTCGAATGATGAGTAATAGTAAAACGCAGGGGAGGTTCTTCAGTAGTA TCAATATGAGACACATACAAACGGGCGTACCTACCGCAGCTCAAAGCTGGGTGCATTTTTGCCAAGTGCCTTACTGTTAT CTTAGGACGGAAATCCACTATAAGATTATAGAAAGGAAGGCGGGCCGAGCGAATCGATTCAATTAAGTTATGTCACAAGG GTGCTATAGCCTATTCCTAAGATTTGTACGTGCGTATGACTGGAATTAATAACCCCTCCCTGCACTGACCTTGACTGAAT AACTGTGATACGACGCAAACTGAACGCTGCGGGTCCTTTATGACCACGGATCACGACCGCTTAAGACCTGAGTTGGAGTT GATACATCCGGCAGGCAGCCAAATCTTTTGTAGTTGAGACGGATTGCTAAGTGTGTTAACTAAGACTGGTATTTCCACTA GGACCACGCTTACATCAGGTCCCAAGTGGACAACGAGTCCGTAGTATTGTCCACGAGAGGTCTCCTGATTACATCTTGAA GTTTGCGACGTGTTATGCGGATGAAACAGGCGGTTCTCATACGGTGGGGCTGGTAAACGAGTTCCGGTCGCGGAGATAAC TGTTGTGATTGGCACTGAAGTGCGAGGTCTTAAACAGGCCGGGTGTACTAACCCAAAGACCGGCCCAGCGTCAGTGA... CS 194-10 Fall 2011, Stuart Russell

  43. Application: Discovering DNA motifs ...TTGGAACAACCATGCACGGTTGATTCGTGCCTGTGACCGCGCGCCTCACACGGAAGACGCAGCCACCGGTTGTGATG TCATAGGGAATTCCCCATGTCGTGAATAATGCCTCGAATGATGAGTAATAGTAAAACGCAGGGGAGGTTCTTCAGTAGTA TCAATATGAGACACATACAAACGGGCGTACCTACCGCAGCTCAAAGCTGGGTGCATTTTTGCCAAGTGCCTTACTGTTAT CTTAGGACGGAAATCCACTATAAGATTATAGAAAGGAAGGCGGGCCGAGCGAATCGATTCAATTAAGTTATGTCACAAGG GTGCTATAGCCTATTCCTAAGATTTGTACGTGCGTATGACTGGAATTAATAACCCCTCCCTGCACTGACCTTGACTGAAT AACTGTGATACGACGCAAACTGAACGCTGCGGGTCCTTTATGACCACGGATCACGACCGCTTAAGACCTGAGTTGGAGTT GATACATCCGGCAGGCAGCCAAATCTTTTGTAGTTGAGACGGATTGCTAAGTGTGTTAACTAAGACTGGTATTTCCACTA GGACCACGCTTACATCAGGTCCCAAGTGGACAACGAGTCCGTAGTATTGTCCACGAGAGGTCTCCTGATTACATCTTGAA GTTTGCGACGTGTTATGCGGATGAAACAGGCGGTTCTCATACGGTGGGGCTGGTAAACGAGTTCCGGTCGCGGAGATAAC TGTTGTGATTGGCACTGAAGTGCGAGGTCTTAAACAGGCCGGGTGTACTAACCCAAAGACCGGCCCAGCGTCAGTGA... CS 194-10 Fall 2011, Stuart Russell

  44. Application: User website behavior from clickstream data (from P. Smyth, UCI) 128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -, 128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -, User 1 2 3 2 2 3 3 3 1 1 1 3 1 3 3 3 3 User 2 3 3 3 1 1 1 User 3 7 7 7 7 7 7 7 7 User 4 1 5 1 1 1 5 1 5 1 1 1 1 1 1 User 5 5 1 1 5 … … CS 194-10 Fall 2011, Stuart Russell

  45. Application: social network analysis HP Labs email data 500 users, 20k connections evolving over time CS 194-10 Fall 2011, Stuart Russell

  46. Application: spam filtering • 200 billion spam messages sent per day • Asymmetric cost of false positive/false negative • Weak label: discarded without reading • Strong label (“this is spam”) hard to come by • Standard iid assumption violated: spammers alter spam generators to evade or subvert spam filters (“adversarial learning” task) CS 194-10 Fall 2011, Stuart Russell

  47. Learning Learning knowledge data CS 194-10 Fall 2011, Stuart Russell

  48. Learning prior knowledge Learning knowledge data CS 194-10 Fall 2011, Stuart Russell

  49. Learning prior knowledge Learning knowledge data CS 194-10 Fall 2011, Stuart Russell

  50. Learning prior knowledge Learning knowledge data Crucial open problem: weak intermediate forms of knowledge that support future generalizations CS 194-10 Fall 2011, Stuart Russell

More Related