160 likes | 274 Views
ENCODE Gene Prediction Workshop - 2005. CSTminer. G. Pesole (F. Mignone). CSTminer and CPS computation I. CPS compuation. CSTminer: - compares evolutionary related sequences identifies Conserved Sequence Tags – CSTs
E N D
ENCODE Gene Prediction Workshop - 2005 CSTminer G. Pesole (F. Mignone)
CSTminer and CPS computation I CPS compuation • CSTminer: • - compares evolutionary related sequences • identifies Conserved Sequence Tags – CSTs • assigns a Coding Potential Score (CPS) upon the quantification of a peculiar evolutionary dynamics of coding sequences at both coding and aminoacid level. Homologous sequences BLAST-like alignment HSP
Definition of CPS cutoff I % CSTs CPS 24,600 CSTs (≥ 5%) Average CPS = 8.32 (± 0.99) 184,046 CSTs (≥ 5%) Average CPS = 5.43 (± 0.79)
Definition of CPS cutoff II Less 1% Coding CPS≤6.41 Less 1% Non coding CPS≥7.66 6.41 7.66 H-COD L-COD Non coding CPS
Prediction of “novel” human genes by comparing mouse synthenic regions of Chr15, Chr21 and Chr22 H.sapiens Chr 22 H.sapiens Chr 21 H.sapiens Chr 15
CST annotation L-COD H-COD
CST annotation Intergenic CST Intronic CST Exonic CST Exon1 Exon3 Exon2 Known gene
Genome annotation of Coding CSTs 984 coding CSTs in intergenic regions, 423 CSTs in intronic regions
Clustered intergenic/coding CSTs may represent novel genes ≥ 4 clustered coding CSTs (>90% genes) Typical gene (average L: 57 kbp)
Cluster Definition preclusters Step I :preclusters definition CSTstart i CSTstarti+1 … genomic sequence pc pc pc pc pc Step II :clusters building genomic sequence
Clustered intergenic/coding CSTs: Supporting features -> 301 Clustered CSTs (out of 984 intergenic CSTs) -> 25 Clusters 20/25 Genscan/twinscan 20/25 RefSeq, Trembl, Unigene 18/25 ESTs 19/25 Mouse ensembl genes 11/25 Human ensembl genes (new release) 4 unsupported clusters
CST cluster 15P1 corresponds to a newly annotated gene
Intronic CSTs may represent novel splicing isoforms CST 22_E_936
Conclusion • What CSTminer does: • Detects Coding-conserved regions • With CST clustering is possible to detect coding gene regions • May support any other kind of gene predictions • May identify splice variants What CSTminer doesn’t do: -Doesn’t detect gene structure and exon boundaries -May merge proximal genes What we can do next: -improve cutoff definition -multi species comparison -improve clustering definition Pros -No annotation or known sequences (mRNAs, ESTs…) required -Easy to automatize -No manual work -very fast
CPS computation - II for f = 1, 2, 3, -1, -2, -3 The CPS computation requires a given amount of genetic divergence between aligned CSTs (i.e. Ka>0 & Ks>0). But, which is the minimum divergence to obtain a reliable CPS? … and what the CPS cutoff to discriminate coding from non coding?
Definition of CST minimum divergence To assess the minimum divergence for reliable CPS computation and to define optimal cutoff values we used two benchmark datasets (coding and non coding). 5 % 3 2 % of divergence