Motif Discovery: Algorithm and Application

Motif Discovery:Algorithm and Application Dan Scanfeld Hong Xue Sumeet Gupta Varun Aggarwal

Objective: Motif discovery and use for deriving biological information Get bound and unbound sequences by TF nanog in human ES cells Find a motif using a motif finding algorithm Genome wide functional analysis using motif to find biological pattern

Why nanog: Relevance to ES Cells • Activate certain genes essential for cell growth • Repress a key set of genes needed for an embryo to develop. • This key set of repressed genes activate entire networks for generating many different specialized cells and tissues. 1 Genome 1 Cell >200 Phenotypes 1013 Cells

Objective: Motif discovery and use for deriving biological information Get bound and unbound Sequences by TF nanog in Human ES cells Find a motif (nanog) using a motif finding algorithm Genome wide Functional Analysis using motif to find biological signals

Location Analysis (ChIP-CHIP) in Human ES Cells (CellBoyer et al122: 947-956) Crosslink Fragment Enrich for Nanog Differentially label 44k 10 SetAgilent

Probe-set p-value p=0.005 P<=0.001 P<=0.005 Enrichment ratio P<=0.01 Chromosomal position ChIP-CHIP Data Analysis negative control-subtracted Perform Median Normalization Set - normalized Obtain Intensities using Genepix Sequences (500 bp) May 2004 Genome Release IP signal 0 WCE signal

Objective: Motif discovery and use for deriving biological information Get bound and unbound Sequences by TF nanog in Human ES cells Find a motif (nanog) using a motif finding algorithm (State-of-the-art) Genome wide functional analysis using motif to find biological pattern

Motif Finding Algorithm(Mac Isaac, et. al., 2006) Use Structural Prior (Database, MacIssac, et. al.) Refinement: Expectation-Maximization (ZOOPS) Score of found motifs: Classification on unseen data Significance testing on score: Use of Empirical p-value

Refinement:Expectation-Maximization Differences from EM in Lab 1 • Use of structural prior (beta = Strength of prior) • ZOOPS (Zero or One per sequence) model • 5th order Markov Model for background trained over unbound sequences • SVM for hypothesis testing

ZOOPS Model (Bailey & Elkan 1994) B Background Model, M: Motif Model Λ Percentage of Bound Sequences (Mixture Model parameter) Sequences are drawn from the distribution P(S) = P(S| M) Λ + P(S|B)(1- Λ) Hidden Variable for EM: Zij : 1 or 0, position j in sequence i is bound by the TF (1) or not (0) E-step: Prob(Zij) = [Λ *P(Si bound at j |M)] ----------------------------------------- [(1- Λ)P(Si |B) + Λ *∑ j P(Si bound at j |M)] M-step: (SAME AS BEFORE) Updating M (Motif Model): For position p on the motif model and each base b (A C T or G) Baseip : Base at position p of ith sequence PWM(p,b) = ∑ i (∑ j (prob(Zi(j-p+1))* (Baseij = = b))) + pseudocounts AND NORMALIZE Updating Background Model [[WE DON’T UPDATE BACKGROUND) Updating Λ Λ = (∑ i ∑ j prob(Zij))/( number of sequences ) P(M bound at j | Si) P(Si)

Hypothesis testing B + EM Motif (M) • Get motifs from EM • Use 2 sets of bound and unbound seq. ( Train and test) • Train a linear SVM on train set. • Find classification error on test set Error = Misclassifications/Total Samples • Score = 1 – error Input = P(S|M)/P(S|B) Output = B OR UB B B UB UB Train Set Test Set Test Classifier Train Classifier

Expectation-Maximization When to stop? Will it overtrain? • Rules of thumb (When likelihood increases very slowly) • Second derivative is negative for given number of times • Euclidean distance is less than given value • Over-train to given sequences • Maximizes likelihood of motif in given sequences. Disregards their likelihood in unbound sequences • Find test classification error at each EM step using SVMs.

Expectation-Maximization Final Motif SVM & Error A different Methodology: • 4 sets of data: Bound (for EM), B & U.B. (Train SVM), B. & U.B. (Test SVM), B. & U.B. (Validation) • At each EM iteration, train SVM and find test Error. • Use two kind of motifs • Best Test Error motif • EM last iteration motif Choose 10 best hypothesis Use larger validation set Initial Points Final Motif SVM & Error SVM & Error SVM & Error SVM & Error Initial Points

Expectation-Maximization Details of RUN • Transfactor: Nanog • Beta = [0 0.2 0.35 0.5 0.6 0.7 1] (Strength of prior) • 5 motifs per beta by masking motifs • Motif Length : 8 • 25 bound seqs for EM • 500 base pairs in each seq. • 150 total train seq (SVM) [Low: Noisy] • 150 total test seq (SVM) [Low: Noisy] • 500 total Validation seq. • c = [1e-3,0.05,100.0] (SVM: Budget for misclassifications) • EM for minimum 60 iterations, Second derivative is negative for five iterations

Expectation-Maximization Representative Score graphs during EM iterations X-Axis: EM Iteration Y-Axis: Score of Motif Beta 0.0 Beta 0.35 Beta 0.7 Beta 0.6

Expectation-Maximization Test and Validate Error of refined Motifs X-Axis: beta Value Y-Axis: Score of Motif Test Classification Score *: End of iteration EM result o: Best of Iteration Validate Classification Score *: End of iteration EM result o: Best of Iteration

Expectation-Maximization When is it the best-of-iteration? iteration RUNS Total iterations Iterations for Best-Of-Iterations

Expectation Maximization Results:: • 6 out of 7 top ranking motifs were best-of-iteration and 1 was end-of-iteration (6 out of 10 as well) • Best Motif: Validate Error over set of 500 • Score: 61.2%, Error: 38.8% A 0.003392 0.764554 0.995187 0.072268 0.063644 0.459349 0.000033 0.088069 C 0.268216 0.050266 0.000149 0.000022 0.303880 0.003363 0.472214 0.201074 G 0.039865 0.000023 0.002015 0.205620 0.105970 0.537248 0.446827 0.228689 T 0.688527 0.185157 0.002648 0.722090 0.526506 0.000040 0.080927 0.482167 T A A T T A or G C or G T

Assumptions and Caveats • Random baseline: End-of-run motif in EM • Low number of sequences for test error • Bound sets may actually not be bound. Better to use highly probable sequences as bound. • All runs (inc. beta=0) used starting point as the structural prior.

Objective: Motif discovery and use for deriving biological information Get bound and unbound Sequences by TF nanog in Human ES cells Find a motif (nanog) using a motif finding algorithm Genome wide functional analysis using motif to find biological pattern

GSEA (Subramanian et al 2005) • Gene Set Enrichment Analysis (GSEA) determines whether an a priori defined set of genes shows statistically significant differences between two biological states.

GSEA Output • Enrichment Plot • Gene List • Gene Set Information

GSEA Ranked List • Set of promoter sequences for every human gene. • 2000 bp upstream and 200 bp downstream of Transcription initiation site. • Score each promoter for likelihood of the motif. • Input this ranked list into GSEA. • Search for gene sets enriched in the ranked list.

Results • Human embryonic stem cell genes OCT4, NANOG, STELLAR, and GDF3 are expressed in both seminoma and breast carcinoma. ( Ezeh et al 2006 ) • Breast cancer geneset found at p-value: 0.008

Implementation Details • Young Lab Error model for chIP-chip data Analysis • Motif finding Algorithm in MATLAB • Implemented Markov Model • Implemented ZOOPS Model • Integrated SVM Toolbox ( by S. R. Gunn.) with code • Used structural prior from MacIsaac, et.al. 2006 • Used software for GSEA for Functional Analysis.

Future Directions • Algorithm • Better use of classification error. • Maximize Likelihood in Bound + Minimizes Likelihood in Unbound (Multi-objective Optimization using GAs) • Biological Information: Distance from transcription site, Conservation • Integrating expression data • Cross-species Motif search and functional analysis, maybe using GO Terms • Scoring • Sequence length

Acknowledgments • Fraenkel Lab • Young Lab • Kenzie D. MacIsaac • Dr. David Gifford (CSAIL) • Dr. Richard Young (WIBR) • Dr. Tommi Jaakkola (CSAIL)

Motif Discovery: Algorithm and Application