1 / 13

Promoter prediction assessment

ENCODE Workshop 2005 at Sanger Institute. Promoter prediction assessment. by Vladimir B Bajic. Predictors. ENCODE participants (3): 7-80-8 (McPromoter1) 7-81-8 (McPromoter2) 41-108-8 (Fprom) additional predictors Beyond ENCODE participants (4) ( out of competition )

Download Presentation

Promoter prediction assessment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ENCODE Workshop 2005 at Sanger Institute Promoter prediction assessment by Vladimir B Bajic

  2. Predictors • ENCODE participants (3): • 7-80-8 (McPromoter1) • 7-81-8 (McPromoter2) • 41-108-8 (Fprom) additional predictors • Beyond ENCODE participants (4) (out of competition) • DBTSS (reference experimental dataset of capped flcDNA) • FirstEF • Dragon Gene Start Finder • Dragon Promoter Finder

  3. Goals • How good are promoter predictors? • Does performance change on this dataset? • Implications for future developments

  4. Data (1) Category “Known genes with CDS” (category = 2) • 1061 annotated transcripts • 1009 -> 994 unique starts of transcripts (TSSs) • 319 unique TSSs in Encode ‘training’ set (13 regions) • 675 unique TSSs in Encode test set • Length of ENCODE regions 29,998,060 bp • Length of ‘training’ regions 8,538,447 bp • Length of testing regions 21,459,613 bp

  5. Data (2)

  6. Method for counting TP and FP All hits to ‘orange’ count as FPs Only one hit within A, B, or C counts as TP for unique position of TSS (3 hits within C will count only as 1 TP) Only minimum distance from all TSSs counts

  7. Results • Different measures of success • Test ENCODE regions • Also: comparison with other participants (test + all regions)

  8. Se, ppv, AE (average positional error)

  9. DIP1, DIP2, CC, ASM

  10. Comments • Compared to previous whole human genome analysis, now we use a more strict distance constraint: max allowed distance 1000 nt (vs. previous 2000 nt) • Previously: Se [0.4 – 0.8], ppv [0.25 – 0.67] • Now, for experimental DBTSS data: • Positional error ~100 nt, Se 0.61, ppv 0.93 • Computational promoter prediction (CPP) (using single genome, no transcripts): positional error 200-300 nt (2-3 fold larger than DBTSS) (positive surprise) • Se [0.32-0.62] (negative surprise but expected) • (reason poor G+C content of some of the test regions) • CPP: ppv >80 (in some cases >90%) (positive surprise) • Having in mind the type of information used for ab initio promoter finding, we see no dramatic difference in 5’ end prediction by methods class 1 and 3, and CPP (positive surprise); however, Se and ppv are better with methods of class 1 and class 3 for obvious reasons.

  11. Future developments • Combine TSS predictors and gene finding programs or transcript info (positive effects of this are visible in Fprom, 20-76-4 and 20-76-5, since in these cases the TSS search space is effectively restricted) • This, however, requires retuning of TSS predictors and some change in their design philosophy • Expected performance should be similar or better than in class 1 and class 3 systems as TSS finding systems should be more specialized for the 5’end type of signals • More emphasis should be given to positional accuracy of TSS predictors

  12. Thank you for your time You may wake up now

More Related