MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers Pengyu Hong 10/06/2005

mRNA transcript Binding sites Regulators Genes Motivation • Understand transcriptional regulation Gene X TF • Model transcriptional regulatory networks

Motivation Previous works on motif finding • AlignACE (Hughes et al 2000) • ANN-Spec (Workman et al 2000) • BioProspector (Liu et al 2001) • Consensus (Hertz et al 1999) • Gibbs Motif Sampler (Lawrence et al 1993) • LogicMotif (Keles et al 2004) • MDScan (Liu et al 2002) • MEME (Bailey and Elkan 1995) • Motif Regressor (Colon et al 2003) • … …

A A C A T C C G • • • • • • Motivation A widely used model – Motif Weight Matrix (Stormo et al 1982) 1 2 3 4 5 6 7 8 A 0.19 1.11 -0.17 1.65 -2.65 -2.66 -1.98 0.92 C -0.14 -0.49 1.89 -1.81 1.70 2.32 2.14 -2.07 G -1.39 0.25 -1.22 -1.07 -2.07 -2.07 -2.07 1.13 T 0.86 -1.39 -2.65 -2.65 0.41 -2.65 -1.16 -1.80 Score of the site = = 10.84 vs. threshold + A sequence is a target if it contains a binding site (score > threshold). Computational << Molecular

Motivation Non-linear binding effects, e.g., different binding modes. • • • CACCCATACAT • • • Mode 1 Preferred binding • • • CATCCGTACAT • • • Mode 2 • • • CA C/T CC A/G TACAT • • • • • • CACCCGTACAT • • • Mode 3 Non-preferred binding • • • CATCCATACAT • • • Mode 4

Modeling Model a TF-DNA binding classifier as an ensemble model. ensemble model weight base classifier

qm(Si) hm(Si) Modeling The mth base classifier Sequence scoring function: fm(sik) is a site scoring function (weight matrix + threshold). The scoring function considers (a) the number of matching sites (b) the degree of matching

(a) Decide the number of base classifiers. (b) Learn the parameters of each base classifier and its weight. Training – Boosting Modify the confidence-rated boosting (CRB) algorithm (Schapire et al. 1999) to train ensemble models

Margin of training samples Generalization error Training error Why Boosting? Booting is a Newton-like technique that iteratively adds base classifiers to minimize the upper bound on the training error. (Schapire et al. 1998)

Challenges •Positive sequences – targets of a TF •Negative sequences • Sequences are labeled, but not the sites in the sequences. • Cannot be well separated by the weight matrix model (linear). • Number of negative sequences >> number of positive sequences.

Boosting Initialization •Positive • Total weight of the positive samples == Total weight of the negative samples. • Since the motif must be an enriched pattern in the positive sequences, use Motif Regressor to find a seed motif matrix W0. •Negative

Boosting Train a base classifier (BC) •Positive •Negative • Use the seed matrix W0 +to initialize the mth base classifier qm() and let m=1. • Refine m and the parameters of qm() to minimize where yi is the label of Si and dim is the weight of Si in the mth round. BC 1 • Negative information is explicitly used to train qm() and m.

Boosting Adjust sample weights and gives higher weights to previously misclassified samples. •Positive •Negative • yi is the label of Si • dim is the weight of Si in the mth round. • dim+1 is the new weight of Si. BC 1

Boosting Add a new base classifier •Positive •Negative BC 1 BC 2

Boosting Add a new base classifier •Positive •Negative Decision boundary

Boosting Adjust sample weights again •Positive •Negative Decision boundary

Boosting Add one more base classifier •Positive •Negative BC 3

Boosting Add one more base classifier •Positive •Negative Decision boundary

Boosting •Positive Stop if the result is perfect or the performance on the internal validation sequences drops. •Negative Decision boundary

Results Data: ChIP-chip data of Saccharomyces cerevisiae (Lee et al. 2002 ) • Positive sequences • p-value < 0.001 • Number of positive sequences  25. • Negative sequences • p-value  0.05 & ratio  1 Got 40 TFs.

Results Leave-one-out test results Boosted models vs. Seed weight matrices Vertical axis: Improvements on specificity Horizontal axis: TFs

Results Capture Position-Correlation + RAP1 0  Weight Matrix Base classifier 1 Base classifier 2 Base classifier 3 Boosting

Results Capture Position-Correlation REB1 Weight Matrix Base classifier 1 Base classifier 2 Boosting

MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

Presentation Transcript

Statistic Graphs

Sequence features of DNA binding sites reveal structural class of associated transcription factor

Constructing Splits Graphs

The Communicative Approach

MANNOSE-BINDING LECTIN PATHWAY

Constructing the Survey

Chapter 6: The Traditional Approach to Requirements

Phonics Approach

The CIPP Approach to Evaluation

Dr. Bill Vicars Lifeprint.com

Lecture 3 Growing the language S copes, binding, train wrecks, and syntactic sugar.

Unit 3: Constructing the American Government RUSH Mrs. Baker

Psychodynamic Approach to Leadership

Machine Learning

Question Classification II

Chapter 3: Supervised Learning

Non Linear Classifiers

Boosting Life Cycle Assessment in Small and Medium Enterprises

Comparative genomics to identify DNA binding motifs

PROTEIN FUNCTIONS