280 likes | 377 Views
WormBase - and the not so stable genome. Paul Davis, WormBase. Overview. Genome Overview Project. Layout. Data as of 1998. Curated Gene Set and Genome Genome stats. Gene stats. Gene curation. User Community and catering for their needs. The Genome Sequencing Project.
E N D
WormBase - and the not so stable genome. Paul Davis, WormBase Informatics Meeting
Overview • Genome Overview • Project. • Layout. • Data as of 1998. • Curated Gene Set and Genome • Genome stats. • Gene stats. • Gene curation. • User Community and catering for their needs. Informatics Meeting
The Genome Sequencing Project • Clone based sequencing venture between Genome Sequencing Centre (St Louis) and Sanger. • C. elegans was 1st multicellular organism genome published. • 97-Mb made up of • 2527 cosmids, • 257 YACs, • 113 fosmids, • 44 PCR products. • 5 major clone gaps. • Annotated to find 19,099 protein coding genes. Informatics Meeting
Clone Based Strategy Random Clone Library Produced (30-40kb). Tiling path selected and sequenced. Yac Library produced (fragments used to fill gaps) Remaining gaps PCR’d and sequenced. Informatics Meeting
C.elegans genome sequence • Clones are assembled into superlinks based on their overlap tags. • 6 Chromosomes are split into 17 superlinks which are maintained by: • G.S.C. St Louis • Sanger. Informatics Meeting
Genomic Change Since The 1998 Science Paper. Contiguous genome! Prior to scheduled release cycle Informatics Meeting
Since 1998 Science Publication • Continue to re-annotate gene set based on a number of different sources. • Transcript data. • Comparison to protein databases. • Comparison to the C. briggsae data sets. • Literature • Last gap closed in October 2002. • 2.42% increase in genome size. • Identified from a number of sources. • Relatively stable genome. • Small number of sequencing errors. • Small number of repeat errors. • List of errors to be validated. Informatics Meeting
Sequence updates Repeat assembly Issues 3rd Party Submission. Genomic Change Since Final Gap Closure Oct 2002 5909 9416 4419 8133 Informatics Meeting
How errors are identified. • Gene predictions may have an incorrect structure compared to available experimental data • mRNA, • EST, • Identification biased towards coding regions. • Curator may identify a prediction that avoids problems by: • Use of incorrect splice donor/acceptors on intron exon boundaries. • Premature truncation of the prediction. • Splicing out of Internal stop codons. • Extra intron to allow for frame shift. Informatics Meeting
How errors are identified Cont. • WormBase Users • Identification of a single copy prediction that is a pseudogene that the user believes not to be a pseudogene through their research/observations. • Or vice versa. • Identification of a prediction that does not follow the “family” structure, missing out a motif/domain to avoid a problem region. • Pseudogenes may be real or reflect a sequencing error. • Each case is investigated. • Clone in archive, • PCR, • Comparison to multiple transcript reads. Informatics Meeting
mRNA mRNA ESTs ESTs 1 1 2 2 3 3 Example of a sequencing error. Single bp insertion into the genome causing a shift from frame 2 – 3. Investigated and corrected Base removed allowing original predictions to be corrected. Informatics Meeting
The Present Situation. • A contiguous genome sequence. • The contiguation of the genome has made an impact on the way the genome can be analysed as well as yielding numerous genes that would probably still be unknown. • WormBase • WormBase has been running since 2000 and has grown to allow accommodation of new data types, curation of existing data, and to facilitate the worm community in accessing and mining this data. • The needs of the community. • Always evolving. Informatics Meeting
Number of Gene Predictions in genome. • Gene predictions 15.7% increase from 1998 • 20,066 CDS (22,858 including splice forms) • Isoforms. • EST/mRNA data, • Paper evidence, • New gene predictions. • Gene family homology studies, • EST/mRNA data, • Gene predictions also removed & merged. Informatics Meeting
Collaboration to find new genes based on multiple strategies. Predictions Including Isoforms Coding Genes Coding Gene Predictions Over Time. Increase in CDS due to new strategies of gene identification Pruning bad gene predictions. Informatics Meeting
Partially Confirmed Confirmed Genome View Colour corresponds to strand not confidence. Predicted Informatics Meeting
Analysis of the gene set. Transcript Builder introduced OST (Orfeome Sequence Tags) 70,000 New ESTs submitted to NDB New strategies for gene annotation. Informatics Meeting
How Gene Curation is Driven. • We Create a number of curation lists • Confirmed introns not in gene models • ESTs/mRNAs in introns. • Overlapping Gene predictions. • Predictions overlapping known repeats. • Short Genes <150bp • Short introns <40bp • Mainly in maintenance mode. Informatics Meeting
New Direction for Gene Curation. • Looking at gene predictor overlaps vs WormBase Gene set. • Protein family analysis. • Multiple species comparisons. • Other transcript data. • TEC-RED • SAGE Informatics Meeting
Gene Predictor Overlaps. • Within WormBase we supply 2 extra gene sets generated by • Genefinder • Twinscan • Former curator did analysis of where two predictors overlap where we don’t have a curated gene. Informatics Meeting
Strong Splicing Good briggsae DNA::DNA Alignment Predictor Overlaps. Genefinder Prediction New CDS Prediction Twinscan Prediction Informatics Meeting
C. briggsae Comparison • C. elegans vs C. briggsae • C. briggsae hybrid gene set analysis(Avril Coghlan). • Detailed in PloS Biol 2003 1:166-192 “The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics.” • WormBase Has worked to incorporate the ~1300 new genes reported. • There are a number of nematodes being sequences and this data will be a main focus of our curation efforts in the future. Informatics Meeting
Predictions Including Isoforms Coding Genes Coding Gene Predictions Over Time. Informatics Meeting
Our User Community. • Within the worm community there are different needs from a sequenced genome. • Bioinformatics groups wanting stability to perform global analysis • Researchers wanting the latest, accurate sets of gene predictions. Informatics Meeting
How Are We Catering for Different Needs? • WormBase 3 week release cycle. • Quick turnaround for corrections and data. • Good for research groups interested in subsets of genes. • Bad for global analysis groups as sequence changes throw out coordinates. • Introduction of WormBase “Frozen” release versions. • These take place every 10 releases (~ 6 months). • 1st “Frozen” release was May 2003 (WS100). • Separate websites (http://ws**0.wormbase.org/. • Remain available on ftp site. Informatics Meeting
Genomic Change Since Final Gap Closure Oct 2002 5909 9416 4419 8133 Informatics Meeting
Frozen Release Effects • User benefits. • Allows bioinformatics groups to coordinate analyses. • Can reference a specific release. • Continued availability of release. • Stability/insulated from sequence changes. • Effects on WormBase. • Requires more resources. • Curation and sequence updates can be processed as they are identified. • Other database resources that use WormBase encouraged to use frozen releases. • NCBI. Informatics Meeting
Acknowledgements • Genome Sequencing Center St. Louis • Sequencing and finishing teams etc. • WormBase team Tamberlyn Bieri Darin Blasiar Phil Ozersky John Spieth • Wellcome Trust Sanger Institute • Sequencing and finishing teams etc. • WormBase team Richard Durbin Anthony Rogers Michael Han Mary Ann Tuli Gary Williams • AceDB Ed Griffiths Roy Storey Informatics Meeting
The End! Informatics Meeting