220 likes | 331 Views
Collective annotation of the Ixodes scapularis genome: VectorBase, MSCs and the tick community. Daniel Lawson, VectorBase. Arthropod vectors of human pathogens. Lutzomyia Phlebotomus. Glossina. Ixodes. Pediculus. Aedes. Culex. Rhodnius. Anopheles. Deer tick Ixodes scapularis.
E N D
Collective annotation of the Ixodes scapularis genome: VectorBase, MSCs and the tick community. Daniel Lawson, VectorBase BRC6 28th October 2008
Arthropod vectors of human pathogens Lutzomyia Phlebotomus Glossina Ixodes Pediculus Aedes Culex Rhodnius Anopheles BRC6 28th October 2008
Deer tick Ixodes scapularis • Vector of Lyme disease (spirochete Borrelia burgdorferi) • Estimated genome size of 2.1 Gb • Sequenced strain: Wikel • 12th generation from ticks sourced from New York, Oklahoma & Connecticut • First Chelicerate genome to be sequenced BRC6 28th October 2008
ESTs, cDNAs Assembly Repeat library (TEs etc) Other genomes, gene sets Manual annotations Community annotations Protein domains Genome annotation cycle Automatic gene build BRC6 28th October 2008
Generating sequence • Sequencing undertaken by established sequencing centres (e.g. Broad, JCVI,) • Initial assembly annotated in collaboration with the sequencing centre(s) • 19,300,000 trace reads generated • Approx. 6x WGS • 570K BAC end sequencing • Assembly produced at JCVI • 194K EST sequences BRC6 28th October 2008
Assembly statistics • This WGS project has the project accession ABJB000000000. The current version of the project (01) has the accession number ABJB010000000, and consists of 1,141,594 scaffolds (ABJB010000001-ABJB011141594). • Released assembly IscaW1 • 570,637 contigs • 369,495 supercontigs • Assembled coverage of 3.8x BRC6 28th October 2008
Preparing for gene build • Repeatmasking • Analyses to identify repeat elements • RepeatScout • RECON • Standard tandem-repeat & low-complexity filtering • Collate data sets • Transcripts (cDNA & EST data) • Peptides (taxonomic groupings, inc. Daphnia pulex) • Train gene predictors, mainly Augustus (JCVI) BRC6 28th October 2008
Annotation plan • First-pass gene prediction • Focused on protein-coding genes CDS’s • Semi-automated approach • This is not manual curation • Involvement of community where possible • Timely delivery of gene set BRC6 28th October 2008
Gene Prediction • Each group/centre has it’s own gene prediction pipeline/protocol. • Each group produces a 1st pass ‘best guess’ set of predictions • 0.5 sets, public release • These sets are merged into a single set • 1.0 set, not released • Quality control activities • 1.1.set, public release • Which is annotated with protein features • .. And released to the wider world BRC6 28th October 2008
Merging gene predictions Gene set #1 Gene set #2 Reduce to single predictions per locus Compare exon/intron structures Identical structures Compatible structures Different structures Merge/Split structures Complex No Map Add isoform predictions based on EST/Peptide data Canonical gene set BRC6 28th October 2008
Merge annotation comparisons BRC6 28th October 2008
Examples Isoform-compat Isoform-diff BRC6 28th October 2008
Examples Merge/Splits Difficult BRC6 28th October 2008
GBrowse viewer BRC6 28th October 2008
VectorBase browser BRC6 28th October 2008
Final gene set (IscaW1.1) • 20,486 protein-coding genes • 48% have Pfam domain • 40% have supporting EST evidence • 8,138 tRNAs • Over-prediction of Ser (4425) and Thr (1527) predictions • 301 ncRNA • Submitted to GenBank last week, release to be coordinated in the next couple of weeks BRC6 28th October 2008
ESTs, cDNAs Assembly Repeat library (TEs etc) Other genomes, gene sets Manual annotations Community annotations Protein domains Genome annotation cycle Automatic gene build BRC6 28th October 2008
Community annotation Gene Build GFF3 Web submission CHADO Researcher Approval Appraisal Total: 13,339 entries An. gambiae 9,423 Cx. quinquefasciatus 2,598 Ae. aegypti 1,281 Ix. scapularis 37 vb! Community representative BRC6 28th October 2008
Community annotation track in browser BRC6 28th October 2008
Lessons • Annotation plan for sequencing and annotation of new genomes is well established (MSC & BRC) • Clearly defining the data release strategies (0.5,1.0 & 1.1) • Monthly conference calls • Face to face meeting when merging 0.5 gene predictions • Coordinated release between MSC, VectorBase and GenBank BRC6 28th October 2008
But we can always improve • Agreement on project/public identifiers at the start of the project • Primarily contigs and supercontigs • Overall nomenclature applied as final step in annotation • More QC before the major milestones • Better communication BRC6 28th October 2008
Acknowledgements EMBL-EBI Harvard IMBB Imperial Notre Dame • Ewan Birney • Martin Hammond • Daniel Lawson • Karyn Megy • Bill Gelbart • Kathy Campbell • Kitsos Louis • Pantelis Topalis • Emmanuel Dialynas • Fotis Kafatos • George Christophides • Bob MacCallum • Seth Redmond • Frank Collins • Greg Madey • Scott Emrich • Ryan Butler • Katie Cybulski • Nate Konopinski • Rob Bruggner (alumni) • E.O. Stinson (alumni) Aedes Anopheles Culex Ixodes • Frank Collins • Neil Lobo • Peter Atkinson • Peter Arensburger • Catherine Hill • Jason Meyer • Dave Severson • Neil Lobo Colleagues Sequencers { JCVI & Broad Institute } BRCs { Pathema, ApiDB } Ensembl { Genebuilders, Web, Compara, Core, Outreach } BRC6 28th October 2008