Tom M. Mitchell E. Fredkin Professor and Department Head March 2007

The Discipline and Future of Machine Learning Tom M. Mitchell E. Fredkin Professor and Department Head March 2007

The Discipline of Machine Learning The defining question: • How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes? A process learns with respect to <T,P,E> if it • Improves its performance P • at task T • through experience E

Mining Databases Object recognition Machine Learning - Practice Speech Recognition • Reinforcement learning • Supervised learning • Bayesian networks • Hidden Markov models • Unsupervised clustering • Explanation-based learning • .... Control learning Extracting facts from text

Machine Learning - Theory • Other theories for • Reinforcement skill learning • Semi-supervised learning • Active student querying • … PAC Learning Theory (for supervised concept learning) # examples (m) representational complexity (H) error rate (e) • … also relating: • # of mistakes during learning • convergence rate • asymptotic performance • bias, variance • VC dimension failure probability (d)

The Discipline of Machine Learning Machine Learning: • How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes? Computer Science: • How can we build machines that solve problems, and which problems are inherently tractable/intractable? Statistics: • What can be learned from data with a set of modeling assumptions, while taking into account the data-collection process?

Computer science Animal learning (Cognitive science, Psychology, Neuroscience) Economics Machine learning Adaptive Control Theory and Robotics Evolution Statistics

ML and CS • Machine learning already the preferred approach to • Speech recognition, Natural language processing • Computer vision • Medical outcomes analysis • Many robot control problems • … • The ML niche will grow • Why? All software ML software

ML and Empirical Sciences • Empirical science is a learning process, subject to automation and to study • improve performance P (accuracy) • at task T (predict which gene knockouts will impact the aromatic AA pathway, and how) • with experience E (active experimentation) Which protein ORFs influence which enzymes in the AAA pathway Functional genomic hypothesis generation and experimentation by a robot scientist, King et al., Nature, 427(6971), 247-252

Our current state: • The problem of tabula-rasa function approximation is solved (in an 80-20 sense): • Given: • Class of hypotheses H = {h: X Y} • Labeled examples {<xi,f(yi)>} • Determine: • The h from H that best approximates f • It’s time to move on • Enrich the function approx problem definition • Use function approx as building block • Work on new problems

Some Current Research Questions • When/how can unlabeled data be useful in function approximation? • How can assumed sparsity of relevant features be exploited in high dimensional nonparametric learning? • How can information learned from one task be transferred to simplify learning another? • What algorithms can learn control strategies from delayed rewards and other inputs? • What are the best “active learning” strategies for different learning problems? • To what degree can one preserve data privacy while obtaining the benefits of data mining?

The Future of Machine Learning

A Quick Look Back Privacy preserving data mining Speech applications Robot control Large scale datamining Evolutionary and revolutionary changes What might lead to the next revolution? Dimensionality reduction Decision tree learning SVMs HMMs Samuel’s checker learner Winston’s symbolic concept learner Non-parametric methods Neural networks Reinforcement learning 1960 1970 1980 1990 2000 Rule learning Semi-supervised learning Perceptrons Explanation-based learning Bayes nets Architectures for learning and problem solving Version Spaces Transfer learning Theories of perceptron capacity and learnability Theories of grammar induction PAC learning theory Statistical perspective on learning

Use Machine Learning to help understand Human Learning(and vice versa)

# of examples Error rate Reinforcement learning Explanations Learning from examples Complexity of learner’s representation Probability of success Prior probabilities Loss functions # of examples Error rate Reinforcement learning Explanations Human supervision Lectures Questions, Homeworks Attention, motivation Skills vs. Principles Implicit vs. Explicit learning Memory, retention, forgetting Hebbian learning, consolidation Models of Learning Processes Machine Learning: Human Learning:

Reinforcement Learning [Sutton and Barto 1981; Samuel 1957] Observed immediate reward Learned sum of future rewards

To learn V, use each transition to generate a training signal: Reinforcement Learning in ML r =100 = .9 S1 S0 S3 S2 0 V=100 V=72 V=81 V=90

Reinforcement Learning in ML • Variants of RL have been used for a variety of practical control learning problems • Temporal Difference learning • Q learning • Learning MDPs, POMDPs • Theoretical results too • Assured convergence to optimal V(s) under certain conditions • Assured convergence for Q(s,a) under certain conditions

Dopamine As Reward Signal t [Schultz et al., Science, 1997]

RL Models for Human Learning [Seymore et al., Nature 2004]

[Seymore et al., Nature 2004]

Humanand Machine Learning Additional overlaps: • Learning of perceptual representations • Dimensionality reduction methods, low level percepts • Lewicky et al.: optimal sparse codes of natural scenes yield gabor filters found in primate visual cortex. Similar result for auditory cortex. • Learning with redundant sensory input • CoTraining methods, Sensory redundancy hypothesis in development • De Sa & Ballard; Coen: co-clustering voice/video yields phonemes • Mitchell & Perfetti: co-training in second language learning • Learning and explanations • Explanation-based learning, teaching concepts & skills, chunking • VanLehn et al: explanation-based learning accounts for some human learning behaviors. • Chi: students learn best when forced to explain • Newell; Anderson: chunking/knowledge-compilation models

2. Never-ending learning

Never-Ending Learning Current machine learning systems: • Learn one function • Are shut down after they learn it • Start from scratch when programmed to learn the next function Let’s study and construct learning processes that: • Learn many different things • Formulate their own next learning task • Use what they have already learned to help learn the next thing

Example: Never-ending learning robot Imagine a robot with three goals: (1) avoid collisions, (2) recharge when battery low, and (3) find and collect trash What is stopping us from giving it some trash examples, then letting it learn for a year? What must it start with to formulate and solve relevant learning subtasks? • Learn to recognize trash in scene • Learn where tosearch for trash, and when • Learn how close to get to find out whether trash is there • Learn to manipulate trash • Transfer what it learned about paper trash to help with bottle trash • Discover relevant subcategories of trash (e.g., plastic versus glass bottles), and of other objects in the environment

Core Questions for Never-Ending Learning Agent • What function or fact to learn next? • Self-reflection on performance, credit assignment • What representation for this target function or fact? • Choice of input-output representation for target function • E.g., “classify whether it’s trash” • How to obtain (which type of) training experience? • Primarily self-supervised, but occasional teacher input • E.g., “classify whether it’s trash” • Guided by what prior knowledge? • Transfer learning, but transfer between what? • XPaperTrash help learn XPlasticTrash ? • State(t) x Action(t)  State(t+1) help learn XPlasticTrash ?

Example: Never-ending language learner [Carlson, Cohen, Fahlman, Hong, Nyberg, Wang, ...] Read the Web project: Create 24x7 web agent that each day: • Extracts more facts from the web into structured database • Learns to extract facts better than yesterday Starting point: • Ontology of hundreds of categories and relations • and 6-10 training examples of each • Never-ending learning architecture • State of art language processing primitives • Learning mechanisms • Top level task: • Populate a database of these categories and relations by reading the web, and improve continually

Q: how can it obtain useful training experience (i.e., self-supervise)? A: redundancy

Bootstrapping: Learning to extract named entities location? I arrived in Pittsburgh on Saturday. x1: I arrived in _________ on Saturday. x2: Pittsburgh

South Africa United Kingdom Warrenton Far_East Oregon Lexington Europe U.S._A. Eastern Canada Blair Southwestern_states Texas States Singapore … Thailand Maine production_control northern_Los New_Zealand eastern_Europe Americas Michigan New_Hampshire Hungary south_america district Latin_America Florida ... Initialization Australia Canada China England France Germany Japan Mexico Switzerland United_states … ... Iterations locations in ?x operations in ?x republic of ?x Bootstrap learning to extract named entities[Riloff and Jones, 1999], [Collins and Singer, 1999], ...

Co-Training Idea: Train Classifier1and Classifier2 to: 1. Correctly classify labeled examples 2. Agree on classification of unlabeled Answer1 Answer2 Classifier1 Classifier2 New York I flew to ____ today I flew to New York today.

Co-Training Theory [Blum&Mitchell 98; Dasgupta 04, ...] # labeled examples Number of redundant inputs # unlabeled examples Conditional dependence among inputs Final Accuracy  want inputs less dependent, increased number of redundant inputs, …  disagreement over unlabeled examples can bound true error

Example Bootstrap learning algorithms: • Classifying web pages [Blum&Mitchell 98; Slattery 99] • Classifying email [Kiritchenko&Matwin 01; Chan et al. 04] • Named entity extraction [Collins&Singer 99; Jones 05] • Wrapper induction [Muslea et al., 01; Mohapatra et al. 04] • Word sense disambiguation [Yarowsky 96] • Discovering new word senses [Pantel&Lin 02] • Synonym discovery [Lin et al., 03] • Relation extraction [Brin et al.; Yangarber et al. 00] • Statistical parsing [Sarkar 01]

What is relation between “Elvis” and “January 8”?

Q: how can it choose next learning task? A: self-reflect on where it is failing, then formulate learning task to repair failure

Some strategies for generating new tasks • Collect more data from web • To learn about specific entities (e.g., “Rolling Stones”) • To learn meaning of particular language (e.g., “will attend”) • To locate easy-to extract facts (e.g., web pages with lists) • Learn regularities from the populated KB • “Most LTI office names are of the form “NSH dddd” • Explore specializations of ontological categories • What distinguishes events occurring on CMU campus from those who occurring elsewhere? Can this be predicted? What subsets of events warrant becoming categories? • Explore specializations of language structures • Which ‘location’ entities share surrounding language? e.g., “the city of ?x,” Do they share other properties?

Some Types of Knowledge to Learn • Linguistic regularities • {“spoon”,”fork”,”chopsticks”} occur often in “eat with my ___” • They’re instances of ontology class “eating implement” • HTML layout regularities • HTML lists often contain items of the same class • Web site regularities • University departments often have page listing all faculty • Regularities over extracted facts • ‘Professors typically have more publications than their advisees’ • ‘Professors typically received their BS degree before their advisees’ • Temporal stability • Birthdays don’t change. Stock prices do.

Research Issues • What target knowledge representation? • How can initial ontology be extended? • What types of self-reflection are required? • Can one learn language without non-linguistic knowledge? • How can we manage mapping between text tokens and non-text entities they describe? • What curriculum for staging the learning? • What active learning methods?

More Revolutionary Research Directions • Can we design new kinds of computer programming languages with explicit learning primitives? • Can we build robot scientists? • What are the fundamental tradeoffs between computational efficiency and statistical efficiency? • How can we build systems that learn from instruction, dialogs and problem sets, in addition to labeled examples? • How can we unify machine learning theories and models with those from other fields studying adaptation, eg., adaptive control theory, economics, evolution?

Summary • Machine Learning research is (should be more) connected to understanding all learning processes • Field is ripe for new revolutionary directions: • Computational models for human learning • Never-ending learners • <your idea here>

Thank you!

Tom M. Mitchell E. Fredkin Professor and Department Head March 2007