1 / 16

CSTminer

ENCODE Gene Prediction Workshop - 2005. CSTminer. G. Pesole (F. Mignone). CSTminer and CPS computation I. CPS compuation. CSTminer: - compares evolutionary related sequences identifies Conserved Sequence Tags – CSTs

penda
Download Presentation

CSTminer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ENCODE Gene Prediction Workshop - 2005 CSTminer G. Pesole (F. Mignone)

  2. CSTminer and CPS computation I CPS compuation • CSTminer: • - compares evolutionary related sequences • identifies Conserved Sequence Tags – CSTs • assigns a Coding Potential Score (CPS) upon the quantification of a peculiar evolutionary dynamics of coding sequences at both coding and aminoacid level. Homologous sequences BLAST-like alignment HSP

  3. Definition of CPS cutoff I % CSTs CPS 24,600 CSTs (≥ 5%) Average CPS = 8.32 (± 0.99) 184,046 CSTs (≥ 5%) Average CPS = 5.43 (± 0.79)

  4. Definition of CPS cutoff II Less 1% Coding CPS≤6.41 Less 1% Non coding CPS≥7.66 6.41 7.66 H-COD L-COD Non coding CPS

  5. Prediction of “novel” human genes by comparing mouse synthenic regions of Chr15, Chr21 and Chr22 H.sapiens Chr 22 H.sapiens Chr 21 H.sapiens Chr 15

  6. CST annotation L-COD H-COD

  7. CST annotation Intergenic CST Intronic CST Exonic CST Exon1 Exon3 Exon2 Known gene

  8. Genome annotation of Coding CSTs 984 coding CSTs in intergenic regions, 423 CSTs in intronic regions

  9. Clustered intergenic/coding CSTs may represent novel genes ≥ 4 clustered coding CSTs (>90% genes) Typical gene (average L: 57 kbp)

  10. Cluster Definition preclusters Step I :preclusters definition CSTstart i CSTstarti+1 … genomic sequence pc pc pc pc pc Step II :clusters building genomic sequence

  11. Clustered intergenic/coding CSTs: Supporting features -> 301 Clustered CSTs (out of 984 intergenic CSTs) -> 25 Clusters 20/25 Genscan/twinscan 20/25 RefSeq, Trembl, Unigene 18/25 ESTs 19/25 Mouse ensembl genes 11/25 Human ensembl genes (new release) 4 unsupported clusters

  12. CST cluster 15P1 corresponds to a newly annotated gene

  13. Intronic CSTs may represent novel splicing isoforms CST 22_E_936

  14. Conclusion • What CSTminer does: • Detects Coding-conserved regions • With CST clustering is possible to detect coding gene regions • May support any other kind of gene predictions • May identify splice variants What CSTminer doesn’t do: -Doesn’t detect gene structure and exon boundaries -May merge proximal genes What we can do next: -improve cutoff definition -multi species comparison -improve clustering definition Pros -No annotation or known sequences (mRNAs, ESTs…) required -Easy to automatize -No manual work -very fast

  15. CPS computation - II for f = 1, 2, 3, -1, -2, -3 The CPS computation requires a given amount of genetic divergence between aligned CSTs (i.e. Ka>0 & Ks>0). But, which is the minimum divergence to obtain a reliable CPS? … and what the CPS cutoff to discriminate coding from non coding?

  16. Definition of CST minimum divergence To assess the minimum divergence for reliable CPS computation and to define optimal cutoff values we used two benchmark datasets (coding and non coding). 5 % 3 2 % of divergence

More Related