1 / 26

C. elegans – “Back To The Future”.

C. elegans – “Back To The Future”. Paul Davis (aka Huey). Overview. C. elegans Gene Prediction Past. Overview of genome project. 1 st Pass annotation Present. Script based list generation. Gene Refinement (Transcript Based). Small peptides. C. briggsae comparison.

bowie
Download Presentation

C. elegans – “Back To The Future”.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C. elegans – “Back To The Future”. Paul Davis (aka Huey) Informatics Meeting

  2. Overview • C. elegans Gene Prediction • Past. • Overview of genome project. • 1st Pass annotation • Present. • Script based list generation. • Gene Refinement (Transcript Based). • Small peptides. • C. briggsae comparison. • Large external gene family analysis. • Future. • Un-annotated Overlap between gene predictors • Gene Family curation. • Multiple species comparison. • Summary. Informatics Meeting

  3. Past • Genome Project • C. elegans 1st multicellular organism genome published 1998. • 97-Mb of sequence made up of • 2527 cosmids, • 257 YACs, • 113 fosmids, • 44 PCR products. • 5 gaps closed by 2002. • Annotated to find 19,099 protein coding genes. • 1st pass annotation Genefinder (Phil Green WASHU). • Curators appraised gene predictions on a clone by clone basis as they were finished. Informatics Meeting

  4. Partially Confirmed Confirmed Genome View Colour corresponds to strand not confidence. Predicted Informatics Meeting

  5. Stats for WS141 • Currently 22,436 gene predictions. • 11,169 “un-touched” • + good 1st pass annotation. • + re-annotated >50%. • 2,576 Confirmed status. • Unlikely to change. • 5,624 Partially Confirmed. • Potentially modified. • 2,969 Predicted. • Potentially removed or altered. Informatics Meeting

  6. Present(re)annotation of a genome Painting the Forth Rail Bridge Painting by numbers Informatics Meeting

  7. (re)annotating a genome • We adopted a ‘paint by numbers’ approach involving automated appraisal of all gene models on a regular basis. • Generation of lists of genes/features to be checked by human annotators. Process and report Curate Appraise Release and synchronise Informatics Meeting

  8. Script Based Targeted Annotation • Create a number of curation lists • Confirmed introns not in gene models • ESTs/mRNAs in introns. • Overlapping Gene predictions. • Predictions overlapping known repeats. • Short Genes <150bp • Short introns <40bp Informatics Meeting

  9. Transcript Based Refinements • Automatic import of transcript data during our build cycle. • C. elegans mRNAs/cDNAs. • C. elegans ESTs. • Nematode ESTs. • Processed and aligned to genome. • This produces data for our curation lists Informatics Meeting

  10. 5’ Transcript Data 3’ Gene Refinement Fmap View • EST data points to 5’ extension and 3’ extension. • Identified due to confirmed introns not in a gene model Refined Prediction Confirmed intron. Old prediction Informatics Meeting

  11. Not all <150bp Predictions are Bad? • Small peptides can be real. • H12D21.1 is a 34 aa peptide that appeared on curation list. • Investigated. • Prediction had peptide similarity to 2 other elegans proteins. • Multi sequence alignment proved interesting. Informatics Meeting

  12. Gene Prediction Protein Homology Blocks SignalP cleavage site H12D21.1 + Homols Fmap View & M.S.A. Informatics Meeting

  13. Expanded Family Pseudogene New Family Members • Used tBlastn to identify other regions in genome, • Annotated these ORFs to give. • 9 additional family members • These have been called nspa-1 to 12 • Nematode Specific Peptide family A Informatics Meeting

  14. C. briggsae Comparison • C. elegans vs C. briggsae • C. briggsae hybrid gene set analysis(Avril Coghlan). • Detailed in PloS Biol 2003 1:166-192 “The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics.” • WormBase Has worked to incorporate the ~1300 new genes reported. Informatics Meeting

  15. Predictions Including Isoforms Coding Genes Coding Gene Predictions Over Time. 22500 briggsae hybrid gene set Increase in CDS due to 1st round of new genes identified by comparison with briggsae. 22000 21500 21000 20500 Number 20000 19500 19000 18500 18000 17500 WS21 WS24 WS27 WS48 WS54 WS60 WS85 WS91 WS97 WS30 WS36 WS57 WS73 WS76 WS79 WS82 WS88 WS94 WS112 WS118 WS121 WS124 WS100 WS103 WS106 WS109 WS115 WS33 WS39 WS42 WS45 WS51 Release Informatics Meeting

  16. Large family analysis • Worm Community Members. • Multi Sequence Alignments of some large Families. • 7 TM receptor families • 1700 family members • Sub families have been worked on by multiple worm community members. • Hugh Robertson (University of Illinois) • Jim Thomas (University of Washington Seattle) • Jack Chen (CSH Laboratories) Informatics Meeting

  17. Future • Identify new avenues for gene refinement and identification. • Looking at predictor overlaps • (Genefinder/Twinscan overlaps) vs (WormBase Gene set) • In house protein family analysis • Multiple species comparisons Informatics Meeting

  18. Strong Splicing Good briggsae DNA::DNA Alignment Predictor Overlaps. Genefinder Prediction New CDS Prediction Twinscan Prediction Informatics Meeting

  19. Gene Family Analysis • Protein alignments of multiple family members can refine gene predictions. • ClustalW • blast • Main problems identified • Incorrect splicing • Truncations • Invalid extensions Informatics Meeting

  20. Example of a Small Family Analysis. • Problematic alignment • F56H6.9 appears to have 18aa extra sequence. • E03H4.4 seems to be lacking sequence. Informatics Meeting

  21. Fmap View of F56H6.9 Informatics Meeting

  22. Example of Problem. • Problematic alignment • Alignment following annotation. Informatics Meeting

  23. Multiple Species Comparison. • More nematode genomes are on their way • C. remanei • shotgun in progress • Blast server available http://genome.wustl.edu/projects/cremanei/ • PB2801 • shotgun in progress • C. japonica • shotgun in progress Informatics Meeting

  24. elegans/briggsae/remanei Alignment for nspa- like peptides. Informatics Meeting

  25. Summary • Gene (Re)annotation >7 years. • New genes are still being discovered. • Primarily Transcript driven. • More work on protein families • New strategies for gene prediction and refinement. • Using multiple gene predictors • Multi species comparison Informatics Meeting

  26. Acknowledgements • Genome Sequencing Center St. Louis • Sequencing and finishing teams etc. • WormBase team Tamberlyn Bieri Darin Blasiar Phil Ozersky John Spieth • Wellcome Trust Sanger Institute • Sequencing and finishing teams etc. • WormBase team Richard Durbin Anthony Rogers Dan Lawson Mary Ann Tuli • AceDB Ed Griffiths Roy Storey Informatics Meeting

More Related