260 likes | 405 Views
C. elegans – “Back To The Future”. Paul Davis (aka Huey). Overview. C. elegans Gene Prediction Past. Overview of genome project. 1 st Pass annotation Present. Script based list generation. Gene Refinement (Transcript Based). Small peptides. C. briggsae comparison.
E N D
C. elegans – “Back To The Future”. Paul Davis (aka Huey) Informatics Meeting
Overview • C. elegans Gene Prediction • Past. • Overview of genome project. • 1st Pass annotation • Present. • Script based list generation. • Gene Refinement (Transcript Based). • Small peptides. • C. briggsae comparison. • Large external gene family analysis. • Future. • Un-annotated Overlap between gene predictors • Gene Family curation. • Multiple species comparison. • Summary. Informatics Meeting
Past • Genome Project • C. elegans 1st multicellular organism genome published 1998. • 97-Mb of sequence made up of • 2527 cosmids, • 257 YACs, • 113 fosmids, • 44 PCR products. • 5 gaps closed by 2002. • Annotated to find 19,099 protein coding genes. • 1st pass annotation Genefinder (Phil Green WASHU). • Curators appraised gene predictions on a clone by clone basis as they were finished. Informatics Meeting
Partially Confirmed Confirmed Genome View Colour corresponds to strand not confidence. Predicted Informatics Meeting
Stats for WS141 • Currently 22,436 gene predictions. • 11,169 “un-touched” • + good 1st pass annotation. • + re-annotated >50%. • 2,576 Confirmed status. • Unlikely to change. • 5,624 Partially Confirmed. • Potentially modified. • 2,969 Predicted. • Potentially removed or altered. Informatics Meeting
Present(re)annotation of a genome Painting the Forth Rail Bridge Painting by numbers Informatics Meeting
(re)annotating a genome • We adopted a ‘paint by numbers’ approach involving automated appraisal of all gene models on a regular basis. • Generation of lists of genes/features to be checked by human annotators. Process and report Curate Appraise Release and synchronise Informatics Meeting
Script Based Targeted Annotation • Create a number of curation lists • Confirmed introns not in gene models • ESTs/mRNAs in introns. • Overlapping Gene predictions. • Predictions overlapping known repeats. • Short Genes <150bp • Short introns <40bp Informatics Meeting
Transcript Based Refinements • Automatic import of transcript data during our build cycle. • C. elegans mRNAs/cDNAs. • C. elegans ESTs. • Nematode ESTs. • Processed and aligned to genome. • This produces data for our curation lists Informatics Meeting
5’ Transcript Data 3’ Gene Refinement Fmap View • EST data points to 5’ extension and 3’ extension. • Identified due to confirmed introns not in a gene model Refined Prediction Confirmed intron. Old prediction Informatics Meeting
Not all <150bp Predictions are Bad? • Small peptides can be real. • H12D21.1 is a 34 aa peptide that appeared on curation list. • Investigated. • Prediction had peptide similarity to 2 other elegans proteins. • Multi sequence alignment proved interesting. Informatics Meeting
Gene Prediction Protein Homology Blocks SignalP cleavage site H12D21.1 + Homols Fmap View & M.S.A. Informatics Meeting
Expanded Family Pseudogene New Family Members • Used tBlastn to identify other regions in genome, • Annotated these ORFs to give. • 9 additional family members • These have been called nspa-1 to 12 • Nematode Specific Peptide family A Informatics Meeting
C. briggsae Comparison • C. elegans vs C. briggsae • C. briggsae hybrid gene set analysis(Avril Coghlan). • Detailed in PloS Biol 2003 1:166-192 “The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics.” • WormBase Has worked to incorporate the ~1300 new genes reported. Informatics Meeting
Predictions Including Isoforms Coding Genes Coding Gene Predictions Over Time. 22500 briggsae hybrid gene set Increase in CDS due to 1st round of new genes identified by comparison with briggsae. 22000 21500 21000 20500 Number 20000 19500 19000 18500 18000 17500 WS21 WS24 WS27 WS48 WS54 WS60 WS85 WS91 WS97 WS30 WS36 WS57 WS73 WS76 WS79 WS82 WS88 WS94 WS112 WS118 WS121 WS124 WS100 WS103 WS106 WS109 WS115 WS33 WS39 WS42 WS45 WS51 Release Informatics Meeting
Large family analysis • Worm Community Members. • Multi Sequence Alignments of some large Families. • 7 TM receptor families • 1700 family members • Sub families have been worked on by multiple worm community members. • Hugh Robertson (University of Illinois) • Jim Thomas (University of Washington Seattle) • Jack Chen (CSH Laboratories) Informatics Meeting
Future • Identify new avenues for gene refinement and identification. • Looking at predictor overlaps • (Genefinder/Twinscan overlaps) vs (WormBase Gene set) • In house protein family analysis • Multiple species comparisons Informatics Meeting
Strong Splicing Good briggsae DNA::DNA Alignment Predictor Overlaps. Genefinder Prediction New CDS Prediction Twinscan Prediction Informatics Meeting
Gene Family Analysis • Protein alignments of multiple family members can refine gene predictions. • ClustalW • blast • Main problems identified • Incorrect splicing • Truncations • Invalid extensions Informatics Meeting
Example of a Small Family Analysis. • Problematic alignment • F56H6.9 appears to have 18aa extra sequence. • E03H4.4 seems to be lacking sequence. Informatics Meeting
Fmap View of F56H6.9 Informatics Meeting
Example of Problem. • Problematic alignment • Alignment following annotation. Informatics Meeting
Multiple Species Comparison. • More nematode genomes are on their way • C. remanei • shotgun in progress • Blast server available http://genome.wustl.edu/projects/cremanei/ • PB2801 • shotgun in progress • C. japonica • shotgun in progress Informatics Meeting
elegans/briggsae/remanei Alignment for nspa- like peptides. Informatics Meeting
Summary • Gene (Re)annotation >7 years. • New genes are still being discovered. • Primarily Transcript driven. • More work on protein families • New strategies for gene prediction and refinement. • Using multiple gene predictors • Multi species comparison Informatics Meeting
Acknowledgements • Genome Sequencing Center St. Louis • Sequencing and finishing teams etc. • WormBase team Tamberlyn Bieri Darin Blasiar Phil Ozersky John Spieth • Wellcome Trust Sanger Institute • Sequencing and finishing teams etc. • WormBase team Richard Durbin Anthony Rogers Dan Lawson Mary Ann Tuli • AceDB Ed Griffiths Roy Storey Informatics Meeting