230 likes | 325 Views
MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers. Pengyu Hong 10/06/2005. mRNA transcript. Binding sites. Regulators. Genes. Motivation. Understand transcriptional regulation. Gene X. TF. Model transcriptional regulatory networks. Motivation.
E N D
MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers Pengyu Hong 10/06/2005
mRNA transcript Binding sites Regulators Genes Motivation • Understand transcriptional regulation Gene X TF • Model transcriptional regulatory networks
Motivation Previous works on motif finding • AlignACE (Hughes et al 2000) • ANN-Spec (Workman et al 2000) • BioProspector (Liu et al 2001) • Consensus (Hertz et al 1999) • Gibbs Motif Sampler (Lawrence et al 1993) • LogicMotif (Keles et al 2004) • MDScan (Liu et al 2002) • MEME (Bailey and Elkan 1995) • Motif Regressor (Colon et al 2003) • … …
A A C A T C C G • • • • • • Motivation A widely used model – Motif Weight Matrix (Stormo et al 1982) 1 2 3 4 5 6 7 8 A 0.19 1.11 -0.17 1.65 -2.65 -2.66 -1.98 0.92 C -0.14 -0.49 1.89 -1.81 1.70 2.32 2.14 -2.07 G -1.39 0.25 -1.22 -1.07 -2.07 -2.07 -2.07 1.13 T 0.86 -1.39 -2.65 -2.65 0.41 -2.65 -1.16 -1.80 Score of the site = = 10.84 vs. threshold + A sequence is a target if it contains a binding site (score > threshold). Computational << Molecular
Motivation Non-linear binding effects, e.g., different binding modes. • • • CACCCATACAT • • • Mode 1 Preferred binding • • • CATCCGTACAT • • • Mode 2 • • • CA C/T CC A/G TACAT • • • • • • CACCCGTACAT • • • Mode 3 Non-preferred binding • • • CATCCATACAT • • • Mode 4
Modeling Model a TF-DNA binding classifier as an ensemble model. ensemble model weight base classifier
qm(Si) hm(Si) Modeling The mth base classifier Sequence scoring function: fm(sik) is a site scoring function (weight matrix + threshold). The scoring function considers (a) the number of matching sites (b) the degree of matching
(a) Decide the number of base classifiers. (b) Learn the parameters of each base classifier and its weight. Training – Boosting Modify the confidence-rated boosting (CRB) algorithm (Schapire et al. 1999) to train ensemble models
Margin of training samples Generalization error Training error Why Boosting? Booting is a Newton-like technique that iteratively adds base classifiers to minimize the upper bound on the training error. (Schapire et al. 1998)
Challenges •Positive sequences – targets of a TF •Negative sequences • Sequences are labeled, but not the sites in the sequences. • Cannot be well separated by the weight matrix model (linear). • Number of negative sequences >> number of positive sequences.
Boosting Initialization •Positive • Total weight of the positive samples == Total weight of the negative samples. • Since the motif must be an enriched pattern in the positive sequences, use Motif Regressor to find a seed motif matrix W0. •Negative
Boosting Train a base classifier (BC) •Positive •Negative • Use the seed matrix W0 +to initialize the mth base classifier qm() and let m=1. • Refine m and the parameters of qm() to minimize where yi is the label of Si and dim is the weight of Si in the mth round. BC 1 • Negative information is explicitly used to train qm() and m.
Boosting Adjust sample weights and gives higher weights to previously misclassified samples. •Positive •Negative • yi is the label of Si • dim is the weight of Si in the mth round. • dim+1 is the new weight of Si. BC 1
Boosting Add a new base classifier •Positive •Negative BC 1 BC 2
Boosting Add a new base classifier •Positive •Negative Decision boundary
Boosting Adjust sample weights again •Positive •Negative Decision boundary
Boosting Add one more base classifier •Positive •Negative BC 3
Boosting Add one more base classifier •Positive •Negative Decision boundary
Boosting •Positive Stop if the result is perfect or the performance on the internal validation sequences drops. •Negative Decision boundary
Results Data: ChIP-chip data of Saccharomyces cerevisiae (Lee et al. 2002 ) • Positive sequences • p-value < 0.001 • Number of positive sequences 25. • Negative sequences • p-value 0.05 & ratio 1 Got 40 TFs.
Results Leave-one-out test results Boosted models vs. Seed weight matrices Vertical axis: Improvements on specificity Horizontal axis: TFs
Results Capture Position-Correlation + RAP1 0 Weight Matrix Base classifier 1 Base classifier 2 Base classifier 3 Boosting
Results Capture Position-Correlation REB1 Weight Matrix Base classifier 1 Base classifier 2 Boosting