910 likes | 2.39k Views
An introduction to machine learning and probabilistic graphical models. Kevin Murphy MIT AI Lab . Presented at Intel’s workshop on “Machine learning for the life sciences”, Berkeley, CA, 3 November 2003. Overview. Supervised learning Unsupervised learning Graphical models
E N D
An introduction to machine learning and probabilistic graphical models Kevin Murphy MIT AI Lab Presented at Intel’s workshop on “Machine learningfor the life sciences”, Berkeley, CA, 3 November 2003 .
Overview • Supervised learning • Unsupervised learning • Graphical models • Learning relational models Thanks to Nir Friedman, Stuart Russell, Leslie Kaelbling andvarious web sources for letting me use many of their slides
F(x1, x2, x3) -> t Learn to approximate function from a training set of (x,t) pairs Supervised learning no yes
Supervised learning Training data Learner Prediction Testing data Hypothesis
Key issue: generalization yes no ? ? Can’t just memorize the training set (overfitting)
Hypothesis spaces • Decision trees • Neural networks • K-nearest neighbors • Naïve Bayes classifier • Support vector machines (SVMs) • Boosted decision stumps • …
Perceptron(neural net with no hidden layers) Linearly separable data
margin The linear separator with the largest margin is the best one to pick
z3 x2 x1 z2 z1 Kernel trick kernel Kernel implicitly maps from 2D to 3D,making problem linearly separable
Support Vector Machines (SVMs) • Two key ideas: • Large margins • Kernel trick
Boosting Simple classifiers (weak learners) can have their performanceboosted by taking weighted combinations Boosting maximizes the margin
Supervised learning success stories • Face detection • Steering an autonomous car across the US • Detecting credit card fraud • Medical diagnosis • …
Unsupervised learning • What if there are no output labels?
Reiterate K-means clustering • Guess number of clusters, K • Guess initial cluster centers, 1, 2 • Assign data points xi to nearest cluster center • Re-compute cluster centers based on assignments
AutoClass (Cheeseman et al, 1986) • EM algorithm for mixtures of Gaussians • “Soft” version of K-means • Uses Bayesian criterion to select K • Discovered new types of stars from spectral data • Discovered new classes of proteins and introns from DNA/protein sequence databases
Principal Component Analysis (PCA) PCA reduces the dimensionality of feature space by restricting attention to those directions along which the scatter of the cloud is greatest. PCA seeks a projection that best represents the data in a least-squares sense. .
Discovering rules (data mining) Find the most frequent patterns (association rules) Num in household = 1 ^ num children = 0 => language = English Language = English ^ Income < $40k ^ Married = false ^num children = 0 => education {college, grad school}
Unsupervised learning: summary • Clustering • Hierarchical clustering • Linear dimensionality reduction (PCA) • Non-linear dim. Reduction • Learning rules
Discovering networks ? From data visualization to causal discovery
Networks in biology • Most processes in the cell are controlled by networks of interacting molecules: • Metabolic Network • Signal Transduction Networks • Regulatory Networks • Networks can be modeled at multiple levels of detail/ realism • Molecular level • Concentration level • Qualitative level Decreasing detail
Molecular level: Lysis-Lysogeny circuit in Lambda phage Arkin et al. (1998), Genetics 149(4):1633-48 • 5 genes, 67 parameters based on 50 years of research • Stochastic simulation required supercomputer
g1 g2 w12 w55 g5 w23 g4 g3 Concentration level: metabolic pathways • Usually modeled with differential equations
Probabilistic graphical models • Supports graph-based modeling at various levels of detail • Models can be learned from noisy, partial data • Can model “inherently” stochastic phenomena, e.g., molecular-level fluctuations… • But can also model deterministic, causal processes. "The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful. Therefore the true logic for this world is the calculus of probabilities." -- James Clerk Maxwell "Probability theory is nothing but common sense reduced to calculation." -- Pierre Simon Laplace
Graphical models: outline • What are graphical models? • Inference • Structure learning
Simple probabilistic model:linear regression Deterministic (functional) relationship Y = + X + noise Y X
Simple probabilistic model:linear regression Deterministic (functional) relationship Y = + X + noise Y “Learning” = estimatingparameters , , from(x,y) pairs. Is the empirical mean Can be estimate byleast squares X Is the residual variance
Piecewise linear regression Latent “switch” variable – hidden process at work
X Q Y Probabilistic graphical model for piecewise linear regression input • Hidden variable Q chooses which set ofparameters to use for predicting Y. • Value of Q depends on value of input X. • This is an example of “mixtures of experts” output Learning is harder because Q is hidden, so we don’t know whichdata points to assign to each line; can be solved with EM (c.f., K-means)
Classes of graphical models Probabilistic models Graphical models Undirected Directed Bayes nets MRFs DBNs
Qualitative part: Directed acyclic graph (DAG) Nodes - random variables Edges - direct influence Family of Alarm E B P(A | E,B) e b 0.9 0.1 e b 0.2 0.8 e b 0.9 0.1 0.01 0.99 e b Bayesian Networks Compact representation of probability distributions via conditional independence Burglary Earthquake Radio Alarm Call Together: Define a unique distribution in a factored form Quantitative part: Set of conditional probability distributions
MINVOLSET KINKEDTUBE PULMEMBOLUS INTUBATION VENTMACH DISCONNECT PAP SHUNT VENTLUNG VENITUBE PRESS MINOVL FIO2 VENTALV PVSAT ANAPHYLAXIS ARTCO2 EXPCO2 SAO2 TPR INSUFFANESTH HYPOVOLEMIA LVFAILURE CATECHOL LVEDVOLUME STROEVOLUME ERRCAUTER HR ERRBLOWOUTPUT HISTORY CO CVP PCWP HREKG HRSAT HRBP BP Example: “ICU Alarm” network Domain: Monitoring Intensive-Care Patients • 37 variables • 509 parameters …instead of 254
Success stories for graphical models • Multiple sequence alignment • Forensic analysis • Medical and fault diagnosis • Speech recognition • Visual tracking • Channel coding at Shannon limit • Genetic pedigree analysis • …
Graphical models: outline • What are graphical models? p • Inference • Structure learning
Burglary Earthquake Radio Alarm Call Probabilistic Inference • Posterior probabilities • Probability of any event given any evidence • P(X|E) Radio Call
Viterbi decoding Compute most probable explanation (MPE) of observed data Hidden Markov Model (HMM) hidden X1 X2 X3 Y1 Y3 observed Y2 “Tomato”
MINVOLSET KINKEDTUBE PULMEMBOLUS INTUBATION VENTMACH DISCONNECT PAP SHUNT VENTLUNG VENITUBE PRESS MINOVL VENTALV PVSAT ARTCO2 EXPCO2 SAO2 TPR INSUFFANESTH HYPOVOLEMIA LVFAILURE CATECHOL LVEDVOLUME STROEVOLUME ERRCAUTER HR ERRBLOWOUTPUT HISTORY CO CVP PCWP HREKG HRSAT HRBP BP Inference: computational issues Easy Hard Dense, loopy graphs Chains Trees Grids
MINVOLSET KINKEDTUBE PULMEMBOLUS INTUBATION VENTMACH DISCONNECT PAP SHUNT VENTLUNG VENITUBE PRESS MINOVL VENTALV PVSAT ARTCO2 EXPCO2 SAO2 TPR INSUFFANESTH HYPOVOLEMIA LVFAILURE CATECHOL LVEDVOLUME STROEVOLUME ERRCAUTER HR ERRBLOWOUTPUT HISTORY CO CVP PCWP HREKG HRSAT HRBP BP Inference: computational issues Easy Hard Dense, loopy graphs Chains Trees Grids Many difference inference algorithms,both exact and approximate
Bayesian inference • Bayesian probability treats parameters as random variables • Learning/ parameter estimation is replaced by probabilistic inference P(|D) • Example: Bayesian linear regression; parameters are = (, , ) Parameters are tied (shared)across repetitions of the data X1 Xn Y1 Yn
Bayesian inference • + Elegant – no distinction between parameters and other hidden variables • + Can use priors to learn from small data sets (c.f., one-shot learning by humans) • - Math can get hairy • - Often computationally intractable
Graphical models: outline • What are graphical models? • Inference • Structure learning p p
Increases the number of parameters to be estimated Wrong assumptions about domain structure Cannot be compensated for by fitting parameters Wrong assumptions about domain structure Truth Earthquake Earthquake Alarm Set AlarmSet Burglary Burglary Earthquake Alarm Set Burglary Sound Sound Sound Why Struggle for Accurate Structure? Missing an arc Adding an arc
E, B, A <Y,N,N> <Y,Y,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> Scorebased Learning Define scoring function that evaluates how well a structure matches the data E B E E A A B A B Search for a structure that maximizes the score
Learning Trees • Can find optimal tree structure in O(n2 log n) time: just find the max-weight spanning tree • If some of the variables are hidden, problem becomes hard again, but can use EM to fit mixtures of trees
Heuristic Search • Learning arbitrary graph structure is NP-hard.So it is common to resort to heuristic search • Define a search space: • search states are possible structures • operators make small changes to structure • Traverse space looking for high-scoring structures • Search techniques: • Greedy hill-climbing • Best first search • Simulated Annealing • ...