670 likes | 812 Views
Statistical Machine Learning and Computational Biology. Michael I. Jordan University of California, Berkeley November 5, 2007. Statistical Modeling in Biology. Motivated by underlying stochastic phenomena thermodynamics recombination mutation environment Motivated by our ignorance
E N D
Statistical Machine Learning and Computational Biology • Michael I. Jordan • University of California, Berkeley • November 5, 2007
Statistical Modeling in Biology • Motivated by underlying stochastic phenomena • thermodynamics • recombination • mutation • environment • Motivated by our ignorance • evolution of molecular function • protein folding • molecular concentrations • incomplete fossil record • Motivated by the need to fuse disparate sources of data
Outline • Graphical models • phylogenomics • Nonparametric Bayesian models • protein backbone modeling • multi-population haplotypes • Sparse regression • protein folding
Probabilistic Graphical Models p(x3| x2) p(x2| x1) X3 X2 p(x1) X1 X6 p(x6| x2, x5) p(x4| x1) p(x5| x4) X4 X5 • The joint distribution on (X1, X2,…, XN) factors according to the “parent-of” relation defined by the edgesE : • p(x1, x2, x3, x4, x5, x6) = p(x1) p(x2| x1)p(x3| x2) p(x4| x1)p(x5| x4)p(x6| x2, x5) • Given a graphG= (V,E), where each nodevÎVis associated with a random variableXv
Inference • Conditioning • Marginalization • Posterior probabilities
Inference Algorithms • Exact algorithms • sum-product • junction tree • Sampling algorithms • Metropolis-Hastings • Gibbs sampling • Variational algorithms • mean-field • Bethe, Kikuchi • convex relaxations
Hidden Markov Models • Widely used in computational biology to parse strings of various kinds (nucleotides, markers, amino acids) • Sum-product algorithm yields
Phylogenies • The shaded nodes represent the observed nucleotides at a given site for a set of organisms • Site independence model (note the plate) • The unshaded nodes represent putative ancestral nucleotides • Computing the likelihood involves summing over the unshaded nodes
Hidden Markov Phylogeny • This yields a gene finder that exploits evolutionary constraints • Evolutionary rate is state-dependent • (edges from state to nodes in phylogeny are omitted for simplicity) • Based on sequence data from 12-15 primate species, we obtain a nucleotide sensitivity of 100%, with a specificity of 89% • GENSCAN yields a sensitivity of 45%, with a specificity of 34%
Annotating new genomes Species: Aspergillus nidulans (Fungal organism) >Q8X1T6 (hypothetical protein) MCPPNTPYQSQWHAFLHSLPKCEHHVHLEGCLEPPLIFSMARKNNVSLPSPSSNPAYTSV ETLSKRYGHFSSLDDFLSFYFIGMTVLKTQSDFAELAWTYFKRAHAEGVHHTEVFFDPQV HMERGLEYRVIVDGYVDGCKRAEKELGISTRLIMCFLKHLPLESAQRLYDTALNEGDLGL DGRNPVIHGLGASSSEVGPPKDLFRPIYLGAKEKSINLTAHAGEEGDASYIAAALDMGAT RIDHGIRLGEDPELMERVAREEVLLTVCPVSNLQLKCVKSVAEVPIRKFLDAGVRFSINSDDPAYFGAYILECYCAVQEAFNLSVADWRLIAENGVKGSWIGEERKNELLWRIDECVKRF What molecular function does protein Q8X1T6 have? Images courtesy of Broad Institute, MIT
Annotation Transfer BLAST Search: Q8X1T6 (Aspergillus nidulans) • Species Name Molecular Function Score E-value • Schizosaccharomyces pomb adenosine deaminase 390 e-107 • Gibberella zeae hypothetical protein FG01567.1 345 7e-94 • Saccharomyces cerevisiae adenine deaminase 308 1e-82 • Wolinella succinogenes putative adenosine deaminase 268 1e-70 • Rhodospirillum rubrum adenosine deaminase 266 6e-70 • Azotobacter vinelandii adenosine deaminase 260 4e-68 • Streptomyces coelicolor probable adenosine deaminase 254 2e-68 • Caulobacter crescentus CB1 adenosine deaminase 253 5e-66 • Streptomyces avermitilis putative adenosine deaminase 251 2e-65 • Ralstonia solanacearum adenosine deaminase 251 2e-65 • environmental sequence unknown 246 5e-64 • Pseudomonas aeruginosa probable adenosine deaminase 245 1e-63 • Pseudomonas aeruginosa adenosine deaminase 245 1e-63 • environmental sequence unknown 244 3e-63 • Pseudomonas fluorescens adenosine deaminase 243 7e-63 • Pseudomonas putida KT2440 adenosine deaminase 243 7e-63
Species Name Molecular Function Score E-value Schizosaccharomyces pombe adenosine deaminase 390 e-107 Gibberella zeae hypothetical protein FG01567.1 345 7e-94 Saccharomyces cerevisiae adenine deaminase 308 1e-82 Wolinella succinogenes putative adenosine deaminase 268 1e-70 Rhodospirillum rubrum adenosine deaminase 266 6e-70 Azotobacter vinelandii adenosine deaminase 260 4e-68 Streptomyces coelicolor probable adenosine deaminase 254 2e-68 Caulobacter crescentus adenosine deaminase 253 5e-66 Streptomyces avermitilis putative adenosine deaminase 251 2e-65 Ralstonia solanacearum adenosine deaminase 251 2e-65 environmental sequence unknown 246 5e-64 Pseudomonas aeruginosa probable adenosine deaminase 245 1e-63 Pseudomonas aeruginosa adenosine deaminase 245 1e-63 environmental sequence unknown 244 3e-63 Pseudomonas fluorescens adenosine deaminase 243 7e-63 Pseudomonas putida KT adenosine deaminase 243 7e-63 MP Methodology to System: SIFTER Set of homologous proteins (Pfam) Gene Tree Species Tree adenine adenine adenosine Gene Ontology SIFTER adenosine adenosine
2-5 different functions >51 different functions 21-50 different functions 11-20 different functions 6-10 different functions Functional diversity problem 1887 Pfam-A families with more than two experimentally characterized functions
Available methods for comparison • Sequence similarity methods • BLAST[Altschul 1990]: sequence similarity search, transfer annotation from sequence with most significant similarity • Runs against largest curated protein database in world • GOtcha[Martin 2004]: BLAST search on seven genomes with GO functional annotations • GOtcha runs use all available annotations • GOtcha-exp runs use only available experimental annotations • Sequence similarity plus bootstrap orthology • Orthostrapper[Storm 2002]: transfer annotation when query protein is in statistically supported orthologous cluster with annotated protein
AMP/adenosine deaminase • 251 member proteins in Pfam v. 18.0 • 13 proteins with experimental evidence GOA • 20 proteins with experimental annotations from manual literature search • 129 proteins with electronic annotations from GOA • Molecular function: remove amine group from base of substrate • Alignment from Pfam family seed alignment • Phylogeny built with PAUP* parsimony, BLOSUM50 matrix Mouse adenosine deaminase, courtesy PDB
AMP/adenosine deaminase SIFTER Errors Leave-one-out cross-validation: 93.9% accuracy (31 of 33) BLAST: 66.7% accuracy (22 of 33)
AMP/adenosine deaminase Note: x-axis is on log scale Multifunction families: can choose numerical cutoff for posterior probability prediction using this type of plot
Sulfotransferases: ROC curve • SIFTER (no truncation): 70.0% accuracy (21 of 30) • BLAST: 50.0% accuracy (15 of 30) Note: x-axis is on log scale
Nudix Protein Family • 3703 proteins in the family • 97 proteins with molecular • functions characterized • 66 different candidate • molecular functions
Nudix: SIFTER vs BLAST • SIFTER truncation level 1: 47.4% accuracy (46 of 97) • BLAST: 34.0% accuracy (33 of 97); 23.3% of terms at • all in search results
Trade specificity for accuracy • Leave-one-out cross-validation, truncation at 1: 47.4% accuracy 15 candidate functions 66 candidate functions Leave-one-out cross-validation, truncation at 1,2: 78.4% accuracy
Fungal genomes Euascomycota Hemiascomycota Archeascomycota Basidiomycota Zygomycota Work with Jason Stajich; Images courtesy of Broad Institute
Fungal Genomes Methods • Gene finding in all 46 genomes • hmmsearch for all 427,324 genes • Aligned hits with hmmalign to 2,883 Pfam v. 20 families • Built trees using PAUP* maximum parsimony for 2,883 Pfam v. 20 families; reconciled with Forester • BLASTed each protein against Swiss-Prot/TrEMBL for exact match; used ID to search for GOA annotations • Ran SIFTER with (a) experimental annotations only and (b) experimental and electronic annotations
Clustering • There are many, many methodologies for clustering • Heuristic methods • hierarchical clustering • M estimation • K means • spectral clustering • Model-based methods • finite mixture models • Dirichlet process mixture models
Nonparametric Bayesian Clustering • Dirichlet process mixture models are a nonparametric Bayesian approach to clustering • They have the major advantage that we don’t have to assume that we know the number of clusters a priori
Chinese Restaurant Process (CRP) • Customers sit down in a Chinese restaurant with an infinite number of tables • first customer sits at the first table • th subsequent customer sits at a table drawn from the following distribution: • where is the number of occupants of table
The CRP and Mixture Models • The customers around a table form a cluster • associate a mixture component with each table • the first customer at a table chooses from the prior • e.g., for Gaussian mixtures, choose • It turns out that the (marginal) distribution that this induces on the theta’s is exchangeable 1 2 3 4
Dirichlet Process • Exchangeability implies an underlying stochastic process; that process is known as a Dirichlet process 0 1
Dirichlet Process Mixture Models • Given observations, we model each with a latent factor: • We put a Dirichlet process prior on :
Connection to the Chinese Restaurant • The marginal distribution on the theta’s obtained by marginalizing out the Dirichlet process is the Chinese restaurant process • Let’s now consider how to build on these ideas and solve the multiple clustering problem
Multiple Clustering Problems • In many statistical estimation problems, we have not one data analysis problem, but rather we have groups of related problems • Naive approaches either treat the problems separately, lump them together, or merge in some adhoc way; in statistics we have a better sense of how to proceed: • shrinkage • empirical Bayes • hierarchical Bayes • Does this multiple group problem arise in clustering? • I’ll argue “yes!” • If so, how do we “shrink” in clustering?
Multiple Data Analysis Problems • Consider a set of data which is subdivided into groups, and where each group is characterized by a Gaussian distribution with unknown mean: • Maximum likelihood estimates of are obtained independently • This often isn’t what we want (on theoretical and practical grounds)
Hierarchical Bayesian Models • Multiple Gaussian distributions linked by a shared hyperparameter • Yields shrinkage estimators for the
Protein Backbone Modeling • An important contribution to the energy of a protein structure is the set of angles linking neighboring amino acids • For each amino acid, it turns out that two angles suffice; traditionally called φandψ • A plot of φandψ angles across some ensemble of amino acids is called a Ramachandran plot
A Ramachandran Plot • This can be (usefully) approached as a mixture modeling problem • Doing so is much better than the “state-of-the-art,” in which the plot is binned into a three-by-three grid
Ramachandran Plots • But that plot is an overlay of 400 different plots, one for each combination of 20 amino acids on the left and 20 amino acids on the right • Shouldn’t we be treating this as a multiple clustering problem?
Haplotype Modeling • A haplotype is the pattern of alleles along a single chromosome • Data comes in the form of genotypes, which lose the information as to which allele is associated to which member of a pair of homologous chromosomes: • Need to restore haplotypes from genotypes • A genotype is well modeled as a mixture model, where a mixture component is a pair of haplotypes (the real difficulty is that we don’t know how many mixture components there are)
Multiple Population Haplotype Modeling • When we have multiple populations (e.g., ethnic groups) we have multiple mixture models • How should we analyze these data (which are now available, e.g., from the HapMap project)? • Analyze them separately? Lump them together?
Hidden Markov Models • An HMM is a discrete state space model • The discrete state can be viewed as a cluster indicator • We thus have a set of clustering problems, one for each value of the previous state (i.e., for each row of the transition matrix)
Solving the Multiple Clustering Problem • It’s natural to take a hierarchical Bayesian approach • It’s natural to take a nonparametric Bayesian in which the number of clusters is not known a priori • How do we do this?
Hierarchical Bayesian Models • Multiple Gaussian distributions linked by a shared hyperparameter • Yields shrinkage estimators for the
Hierarchical DP Mixture Model? • Let us try to model each group of data with a Dirichlet process mixture model • let the groups share an underlying hyperparameter • But each group is generated independently • different groups cannot share the same components if is continuous. spikes do not match up