1 / 88

An introduction to machine learning and probabilistic graphical models

An introduction to machine learning and probabilistic graphical models. Kevin Murphy MIT AI Lab . Presented at Intel’s workshop on “Machine learning for the life sciences”, Berkeley, CA, 3 November 2003. Overview. Supervised learning Unsupervised learning Graphical models

andrew
Download Presentation

An introduction to machine learning and probabilistic graphical models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An introduction to machine learning and probabilistic graphical models Kevin Murphy MIT AI Lab Presented at Intel’s workshop on “Machine learningfor the life sciences”, Berkeley, CA, 3 November 2003 .

  2. Overview • Supervised learning • Unsupervised learning • Graphical models • Learning relational models Thanks to Nir Friedman, Stuart Russell, Leslie Kaelbling andvarious web sources for letting me use many of their slides

  3. F(x1, x2, x3) -> t Learn to approximate function from a training set of (x,t) pairs Supervised learning no yes

  4. Supervised learning Training data Learner Prediction Testing data Hypothesis

  5. Key issue: generalization yes no ? ? Can’t just memorize the training set (overfitting)

  6. Hypothesis spaces • Decision trees • Neural networks • K-nearest neighbors • Naïve Bayes classifier • Support vector machines (SVMs) • Boosted decision stumps • …

  7. Perceptron(neural net with no hidden layers) Linearly separable data

  8. Which separating hyperplane?

  9. margin The linear separator with the largest margin is the best one to pick

  10. What if the data is not linearly separable?

  11. z3 x2 x1 z2 z1 Kernel trick kernel Kernel implicitly maps from 2D to 3D,making problem linearly separable

  12. Support Vector Machines (SVMs) • Two key ideas: • Large margins • Kernel trick

  13. Boosting Simple classifiers (weak learners) can have their performanceboosted by taking weighted combinations Boosting maximizes the margin

  14. Supervised learning success stories • Face detection • Steering an autonomous car across the US • Detecting credit card fraud • Medical diagnosis • …

  15. Unsupervised learning • What if there are no output labels?

  16. Reiterate K-means clustering • Guess number of clusters, K • Guess initial cluster centers, 1, 2 • Assign data points xi to nearest cluster center • Re-compute cluster centers based on assignments

  17. AutoClass (Cheeseman et al, 1986) • EM algorithm for mixtures of Gaussians • “Soft” version of K-means • Uses Bayesian criterion to select K • Discovered new types of stars from spectral data • Discovered new classes of proteins and introns from DNA/protein sequence databases

  18. Hierarchical clustering

  19. Principal Component Analysis (PCA) PCA reduces the dimensionality of feature space by restricting attention to those directions along which the scatter of the cloud is greatest. PCA seeks a projection that best represents the data in a least-squares sense. .

  20. Discovering nonlinear manifolds

  21. Combining supervised and unsupervised learning

  22. Discovering rules (data mining) Find the most frequent patterns (association rules) Num in household = 1 ^ num children = 0 => language = English Language = English ^ Income < $40k ^ Married = false ^num children = 0 => education {college, grad school}

  23. Unsupervised learning: summary • Clustering • Hierarchical clustering • Linear dimensionality reduction (PCA) • Non-linear dim. Reduction • Learning rules

  24. Discovering networks ? From data visualization to causal discovery

  25. Networks in biology • Most processes in the cell are controlled by networks of interacting molecules: • Metabolic Network • Signal Transduction Networks • Regulatory Networks • Networks can be modeled at multiple levels of detail/ realism • Molecular level • Concentration level • Qualitative level Decreasing detail

  26. Molecular level: Lysis-Lysogeny circuit in Lambda phage Arkin et al. (1998), Genetics 149(4):1633-48 • 5 genes, 67 parameters based on 50 years of research • Stochastic simulation required supercomputer

  27. g1 g2 w12 w55 g5 w23 g4 g3 Concentration level: metabolic pathways • Usually modeled with differential equations

  28. Qualitative level: Boolean Networks

  29. Probabilistic graphical models • Supports graph-based modeling at various levels of detail • Models can be learned from noisy, partial data • Can model “inherently” stochastic phenomena, e.g., molecular-level fluctuations… • But can also model deterministic, causal processes. "The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful. Therefore the true logic for this world is the calculus of probabilities." -- James Clerk Maxwell "Probability theory is nothing but common sense reduced to calculation." -- Pierre Simon Laplace

  30. Graphical models: outline • What are graphical models? • Inference • Structure learning

  31. Simple probabilistic model:linear regression Deterministic (functional) relationship Y =  +  X + noise Y X

  32. Simple probabilistic model:linear regression Deterministic (functional) relationship Y =  +  X + noise Y “Learning” = estimatingparameters , ,  from(x,y) pairs. Is the empirical mean Can be estimate byleast squares X Is the residual variance

  33. Piecewise linear regression Latent “switch” variable – hidden process at work

  34. X Q Y Probabilistic graphical model for piecewise linear regression input • Hidden variable Q chooses which set ofparameters to use for predicting Y. • Value of Q depends on value of input X. • This is an example of “mixtures of experts” output Learning is harder because Q is hidden, so we don’t know whichdata points to assign to each line; can be solved with EM (c.f., K-means)

  35. Classes of graphical models Probabilistic models Graphical models Undirected Directed Bayes nets MRFs DBNs

  36. Qualitative part: Directed acyclic graph (DAG) Nodes - random variables Edges - direct influence Family of Alarm E B P(A | E,B) e b 0.9 0.1 e b 0.2 0.8 e b 0.9 0.1 0.01 0.99 e b Bayesian Networks Compact representation of probability distributions via conditional independence Burglary Earthquake Radio Alarm Call Together: Define a unique distribution in a factored form Quantitative part: Set of conditional probability distributions

  37. MINVOLSET KINKEDTUBE PULMEMBOLUS INTUBATION VENTMACH DISCONNECT PAP SHUNT VENTLUNG VENITUBE PRESS MINOVL FIO2 VENTALV PVSAT ANAPHYLAXIS ARTCO2 EXPCO2 SAO2 TPR INSUFFANESTH HYPOVOLEMIA LVFAILURE CATECHOL LVEDVOLUME STROEVOLUME ERRCAUTER HR ERRBLOWOUTPUT HISTORY CO CVP PCWP HREKG HRSAT HRBP BP Example: “ICU Alarm” network Domain: Monitoring Intensive-Care Patients • 37 variables • 509 parameters …instead of 254

  38. Success stories for graphical models • Multiple sequence alignment • Forensic analysis • Medical and fault diagnosis • Speech recognition • Visual tracking • Channel coding at Shannon limit • Genetic pedigree analysis • …

  39. Graphical models: outline • What are graphical models? p • Inference • Structure learning

  40. Burglary Earthquake Radio Alarm Call Probabilistic Inference • Posterior probabilities • Probability of any event given any evidence • P(X|E) Radio Call

  41. Viterbi decoding Compute most probable explanation (MPE) of observed data Hidden Markov Model (HMM) hidden X1 X2 X3 Y1 Y3 observed Y2 “Tomato”

  42. MINVOLSET KINKEDTUBE PULMEMBOLUS INTUBATION VENTMACH DISCONNECT PAP SHUNT VENTLUNG VENITUBE PRESS MINOVL VENTALV PVSAT ARTCO2 EXPCO2 SAO2 TPR INSUFFANESTH HYPOVOLEMIA LVFAILURE CATECHOL LVEDVOLUME STROEVOLUME ERRCAUTER HR ERRBLOWOUTPUT HISTORY CO CVP PCWP HREKG HRSAT HRBP BP Inference: computational issues Easy Hard Dense, loopy graphs Chains Trees Grids

  43. MINVOLSET KINKEDTUBE PULMEMBOLUS INTUBATION VENTMACH DISCONNECT PAP SHUNT VENTLUNG VENITUBE PRESS MINOVL VENTALV PVSAT ARTCO2 EXPCO2 SAO2 TPR INSUFFANESTH HYPOVOLEMIA LVFAILURE CATECHOL LVEDVOLUME STROEVOLUME ERRCAUTER HR ERRBLOWOUTPUT HISTORY CO CVP PCWP HREKG HRSAT HRBP BP Inference: computational issues Easy Hard Dense, loopy graphs Chains Trees Grids Many difference inference algorithms,both exact and approximate

  44. Bayesian inference • Bayesian probability treats parameters as random variables • Learning/ parameter estimation is replaced by probabilistic inference P(|D) • Example: Bayesian linear regression; parameters are = (, , ) Parameters are tied (shared)across repetitions of the data  X1 Xn Y1 Yn

  45. Bayesian inference • + Elegant – no distinction between parameters and other hidden variables • + Can use priors to learn from small data sets (c.f., one-shot learning by humans) • - Math can get hairy • - Often computationally intractable

  46. Graphical models: outline • What are graphical models? • Inference • Structure learning p p

  47. Increases the number of parameters to be estimated Wrong assumptions about domain structure Cannot be compensated for by fitting parameters Wrong assumptions about domain structure Truth Earthquake Earthquake Alarm Set AlarmSet Burglary Burglary Earthquake Alarm Set Burglary Sound Sound Sound Why Struggle for Accurate Structure? Missing an arc Adding an arc

  48. E, B, A <Y,N,N> <Y,Y,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> Score­based Learning Define scoring function that evaluates how well a structure matches the data E B E E A A B A B Search for a structure that maximizes the score

  49. Learning Trees • Can find optimal tree structure in O(n2 log n) time: just find the max-weight spanning tree • If some of the variables are hidden, problem becomes hard again, but can use EM to fit mixtures of trees

  50. Heuristic Search • Learning arbitrary graph structure is NP-hard.So it is common to resort to heuristic search • Define a search space: • search states are possible structures • operators make small changes to structure • Traverse space looking for high-scoring structures • Search techniques: • Greedy hill-climbing • Best first search • Simulated Annealing • ...

More Related