Machine Learning for High-Throughput Biological Data

Machine Learning for High-Throughput Biological Data These notes were originally from KDD2006 tutorial notes, written by David page at Dept. Biostatistics and Medical InformaticsDept. Computer SciencesUniversity of Wisconsin-Madison. http://www.biostat.wisc.edu/~page/PageKDD2006.ppt

Some Data Types We’ll Discuss • Gene expression microarray • Single-nucleotide polymorphisms (單一核苷酸基因多形性) • Mass spectrometry proteomics (蛋白質組學 ) and metabolomics (代謝物組學 ) • Protein-protein interactions (from co-immunoprecipitation) • High-throughput screening of potential drug molecules

image from the DOE Human Genome Program http://www.ornl.gov/hgmis

How Microarrays Work Probes (DNA) Labeled Sample (RNA) Hybridization GeneChip Surface

Two Views of Microarray Data • Data points are genes • Represented by expression levels across different samples (ie, features=samples) • Goal: categorize new genes • Data points are samples (eg, patients) • Represented by expression levels of different genes (ie, features=genes) • Goal: categorize new samples

Two Ways to View The Data

Data Points are Genes

Data Points are Samples

Supervision: Add Class Values

Supervised Learning Task • Given: a set of microarray experiments, each done with mRNA from a different patient (same cell type from every patient) Patient’s expression values for each gene constitute the features, and patient’s disease constitutes the class • Do: Learn a model that accurately predictsclass based on features

Location in Task Space

Leukemia (Golub et al., 1999) • Classes Acute Lymphoblastic Leukemia(淋巴白血病) (ALL) and Acute Myeloid Leukemia (骨髓白血病) (AML) • Approach Weighted voting (essentially naïve Bayes) • Cross-Validated Accuracy Of 34 samples, declined to predict 5, correct on other 29

Cancer vs. Normal • Relatively easy to predict accurately, because so much goes “haywire” in cancer cells • Primary barrier is noise in the data… impure RNA, cross-hybridization, etc • Studies include breast, colon (结肠), prostate (前列腺), lymphoma (淋巴瘤), and multiple myeloma (骨髓瘤)

X-Val Accuracies for Multiple Myeloma(74 MM vs. 31 Normal)

More MM (300), Benign Condition MGUS (Hardin et al., 2004)

ROC Curves: Cancer vs. Normal

ROC: Cancer vs. Benign (MGUS)

Work by Statisticians Outside of Standard Classification/Clustering • Methods to better convert Affymetrix’s low-level intensity measurements into expression levels: e.g., work by Speed, Wong, Irrizary • Methods to find differentially expressed genes between two samples, e.g. work by Newton and Kendziorski • But the following is most related…

Ranking Genes by Significance • Some biologists don’t want one predictive model, but a rank-ordered list of genes to explore further (with estimated significance) • For each gene we have a set of expression levels under our conditions, say cancer vs. normal • We can do a t-test to see if the mean expression levels are different under the two conditions: p-value • Multiple comparisons problem: if we repeat this test for 30,000 genes, some will pop up as significant just by chance alone • Could do a Bonferoni correction (multiply p-values by 30,000), but this is drastic and might eliminate all

False Discovery Rate (FDR) [Storey and Tibshirani, 2001] • Addresses multiple comparisons but is less extreme than Bonferoni • Replaces p-value by q-value: fraction of genes with this p-value or lower that really don’t have different means in the two classes (false discoveries) • Publicly available in R as part of Bioconductor package • Recommendation: Use this in addition to your supervised data mining… your collaborators will want to see it

FDR Highlights Difficulties Getting Insight into Cancer vs. Normal

Using Benign Condition Instead of Normal Helps Somewhat

Question to Anticipate • You’ve run a supervised data mining algorithm on your collaborator’s data, and you present an estimate of accuracy or an ROC curve (from X-val) • How did you adjust this for the multiple comparisons problem? • Answer: you don’t need to because you commit to a single predictive model before ever looking at the test data for a fold—this is only one comparison

Prognosis and Treatment • Features same as for diagnosis • Rather than disease state, class value becomes lifeexpectancy with a given treatment (or positive response vs. no response to given treatment)

Breast Cancer Prognosis(Van’t Veer et al., 2002) • Classesgood prognosis (no metastasis within five years of initial diagnosis) vs. poor prognosis • Algorithm Ensemble of voters • Results 83% cross-validated accuracy on 78 cases

A Lesson • Previous work selected features to use in ensemble by looking at the entire data set • Should have repeated feature selection on each cross-val fold • Authors also chose ensemble size by seeing which size gave highest cross-val result • Authors corrected this in web supplement;accuracy went from 83% to 73% • Remember to “tune parameters” separately for each cross-val fold!

Prognosis with Specific Therapy (Rosenwald et al., 2002) • Data set contains gene-expression patterns for 160 patients with diffuse large B-cell lymphoma, receiving anthracycline chemotherapy • Class label is five-year survival • One test-train split 80/80 • True positive rate: 60% False negative rate: 39%

Some Future Directions • Using gene-chip data to select therapyPredict which therapy gives best prognosis for patient • Combining Gene Expression Data with Clinical Data such as Lab Results, Medical and Family History Multiple relational tables, may benefit from relational learning

Unsupervised Learning Task • Given: a set of microarray experiments under different conditions • Do: cluster the genes, where a gene described by its expression levels in different experiments

Example(Green = up-regulated, Red = down-regulated) Genes Experiments (Samples)

Normalized expression Visualizing Gene Clusters (eg, Sharan and Shamir, 2000) Gene Cluster 1, size=20 Gene Cluster 2, size=43 Time (10-minute intervals)

Unsupervised Learning Task 2 • Given: a set of microarray experiments (samples) corresponding to different conditions or patients • Do: cluster the experiments

Examples • Cluster samples from mice subjected to a variety of toxic compounds (Thomas et al., 2001) • Cluster samples from cancer patients, potentially to discover different subtypes of a cancer • Cluster samples taken at different time points

Some Biological Pathways • Regulatory pathways • Nodes are labeled by genes • Arcs denote influence on transcription • G1 codes for P1, P1 inhibits G2’s transcription • Metabolic pathways • Nodes are metabolites, large biomolecules (eg, sugars, lipids, proteins and modified proteins) • Arcs from biochemical reaction inputs to outputs • Arcs labeled by enzymes (themselves proteins)

Metabolic Pathway Example H20 HSCoA Citrate cis-Aconitate Acetyl CoA citrate synthase aconitase H20 Oxaloacetate NADH MDH (Krebs Cycle, TCA Cycle, Citric Acid Cycle) Isocitrate NAD+ NAD+ Malate IDH NADH + CO2 fumarase H20 a-Ketoglutarate NAD+ + HSCoA Fumarate a-KDGH NADH + CO2 succinate thikinase Succinyl-CoA FADH2 Succinate FAD GTP GDP + Pi + HSCoA

Regulatory Pathway (KEGG)

Using Microarray Data Only • Regulatory pathways • Nodes are labeled by genes • Arcs denote influence on transcription • G1 codes for P1, P1 inhibits G2’s transcription • Metabolic pathways • Nodes are metabolites, large biomolecules (eg, sugars, lipids, proteins, and modified proteins) • Arcs from biochemical reaction inputs to outputs • Arcs labeled by enzymes (themselves proteins)

Supervised Learning Task 2 • Given: a set of microarray experiments for same organism under different conditions • Do: Learn graphical model that accurately predicts expression of some genes in terms of others

Some Approaches to Learning Regulatory Networks • Bayes Net Learning (started with Friedman & Halpern, 1999, we’ll see more) • Boolean Networks (Akutsu, Kuhara, Maruyama & Miyano, 1998; Ideker, Thorsson & Karp, 2002) • Related Graphical Approaches (Tanay & Shamir, 2001; Chrisman, Langley, Baay & Pohorille, 2003)

Data P(geneA) geneA geneB geneA Expt1 parent node Expt2 parent node Expt3 child node child node Expt4 P(geneB) geneB P(geneA) 0.0 1.0 0.5 0.5 0.5 0.5 Bayesian Network (BN) Note: direction of arrow indicatesdependence notcausality

Problem: Not Causality A B A is a good predictor of B. But is A regulating B?? Ground truth might be: B A A C B B C A C Or a more complicated variant A B

Approaches to Get Causality • Use “knock-outs” (Pe’er, Regev, Elidan and Friedman, 2001). But not available in most organisms. • Use time-series data and Dynamic Bayesian Networks (Ong, Glasner and Page, 2002). But even less data typically. • Use other data sources, eg sequences upstream of genes, where transcription regulators may bind. (Segal, Barash, Simon, Friedman and Koller, 2002; Noto and Craven, 2005)

gene 2 gene 2 gene 1 gene 1 gene 3 gene 3 gene N gene N A Dynamic Bayes Net

Problem: Not Enough Data Points to Construct Large Network • Fortunate to get 100s of chips • But have 1000s of genes • E. coli: ~4000 • Yeast: ~6000 • Human: ~30,000 • Want to learn causal graphical model over 1000s of variables with 100s of examples (settings of the variables)

Advance: Module Networks [Segal, Pe’er, Regev, Koller & Friedman, 2005] • Cluster genes by similarity over expression experiments • All genes in a cluster are “tied together”: same parents and CPDs • Learn structure subject to this tying together of genes • Iteratively re-form clusters and re-learn network, in an EM-like fashion

Problem: Data are Continuous but Models are Discrete • Gene chips provide a real-valued mRNA measurement • Boolean networks and most practical Bayes net learning algorithms assume discrete variables • May lose valuable information by discretizing

Advance: Use of Dynamic Bayes Nets with Continuous Variables [Segal, Pe’er, Regev, Koller & Friedman, 2005] • Expression measurements used instead of discretized (up, down, same) • Assume linear influence of parents on children (Michaelis-Menten assumption) • Work so far constructed the network from literature and learned parameters

Problem: Much Missing Information • mRNA from gene 1 doesn’t directly alter level of mRNA from gene 2 • Rather, the protein product from gene 1 may alter level of mRNA from gene 2 (e.g., transcription factor) • Activation of transcription factor might not occur by making more of it, but just by phosphorylating it (post-translational modification)

Machine Learning for High-Throughput Biological Data

Machine Learning for High-Throughput Biological Data

Presentation Transcript

High-throughput Proteomics

High-Throughput Screening

High Throughput Computing

Microplates for High Throughput Crystallization

High-Throughput Sequencing

High-throughput Biological Data The data deluge and bioinformatics algorithms

Introduction to bioinformatics Lecture 3 High-throughput Biological Data - data deluge, bioinformatics algorithms- and e

Simulating high throughput data with FBA

High-Throughput Sequencing

High Throughput Sequencing

Significance Testing of High-Throughput Data

Supervised and u nsupervised learning in high-throughput data

High Throughput Sequencing

AMEN an exploration platform for high-throughput data

High-throughput Biological Data The data deluge

High-Throughput Screening

Data Analysis for High-Throughput Sequencing

High Throughput Rheometer

High Data Throughput Recommended Standard

High Throughput Sequence (HTS) data analysis

High-Throughput Machine Learning from Electronic Health Records

high throughput phenotyping