1 / 19

MAKER Annotation Process Example of Glossina

MAKER Annotation Process Example of Glossina. Karyn Mégy. Dan Hughes. VectorBase http://www.vectorbase.org. Annotation: aims and means. Aims Preliminary Locus rather than exact position Means Automatic annotation By similarity Ab initio Manual annotation By regions

kalei
Download Presentation

MAKER Annotation Process Example of Glossina

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MAKER Annotation Process Example of Glossina Karyn Mégy Dan Hughes VectorBase http://www.vectorbase.org

  2. Annotation: aims and means • Aims • Preliminary • Locus rather than exact position • Means • Automatic annotation • By similarity • Ab initio • Manual annotation • By regions • By gene families

  3. Annotation: similarity vs. ab initio • Similarity • Similarity to known sequences -> only know genes -> based on available data (qty, qlty) • Ab initio • Follow a gene “recipe” -> potentially identify new genes -> over predictions

  4. Ensembl annotation 1 CommunityAnnotation 2 Proteinspecies specific 3 Transcriptomespecies specific Maskedgenome Rawgenome 4 Protein‘close’ specific Masking: RepeatModeler repeats + known repeats/transposons 5 Ab initio 5 5 5 5 5 5 4 4 4 Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission 4 Protein‘Close’ species 4 Protein‘Close’ species MASKEDgenome sequence MASKEDgenome sequence MASKEDgenome sequence MASKEDgenome sequence MASKEDgenome sequence MASKEDgenome sequence MASKEDgenome sequence 4 Protein‘Close’ species Raw genome sequence Raw genome sequence Raw genome sequence Raw genome sequence Raw genome sequence Raw genome sequence Raw genome sequence Protein‘Close’ species Protein‘Close’ species Protein‘Close’ species 3 3 3 3 Transcriptomespecies specific 3 Transcriptomespecies specific 3 Transcriptomespecies specific Transcriptomespecies specific Transcriptomespecies specific Transcriptomespecies specific 2 2 2 2 Proteinspecies specific 2 Proteinspecies specific 2 Proteinspecies specific Proteinspecies specific Proteinspecies specific Proteinspecies specific 1 1 1 Communityannotation 1 Communityannotation 1 Communityannotation 1 Communityannotation Communityannotation Communityannotation 4 3 1 1 5 2 4 3 1 1 5 2 4 3 1 1 5 2 4 3 1 1 5 2 4 3 1 1 5 2 4 3 1 1 5 2

  5. Ensembl annotation • Similarity-focused • Data rich organisms • Fiddly, time consuming • Rhodniusprolixus experience • In the meantime: Heliconius annotation using MAKER

  6. MAKER • Aim: • Generate gene sets • Combine into final gene set • Iterative process Rawgenome Annotatedgenome DATA DATA DATA • http://www.yandell-lab.org/software/maker.html • Cantarel et al. Gen. Res. 2008. PMID 18025269

  7. MAKER • Aim: • Generate gene sets • Combine into final gene set • Iterative process Rawgenome Annotatedgenome DATA DATA DATA

  8. Intermediate gene sets • ESTs • from GenBank • cleaned and clustered/assembled with CAP3 • 71,700 contigs • Insecta/metazoa proteins • from UniProt • align to the genome with BLAST • 690,000 seqces (insecta) • 2,200,00 seqces (metazoa) Raw data Maskedgenome Rawgenome Masking: RepeatModeler repeats + known repeats/transposons

  9. Intermediate gene sets • RNAseq Illumina Yale • - cleaned • - aligned to the genome using Tophat/Bowtie • - build ‘tranfrag’ with Cufflinks • 78,000 ‘transfrag’ (on 4 sets -> overlaps) • Augustus • - generated by Martin Swain • - trained with SOLiD data • 16, 963 models – high quality Raw data Gene models Maskedgenome Rawgenome Masking: RepeatModeler repeats + known repeats/transposons

  10. Intermediate gene sets • ESTs – aligned to the genome • from GenBank – clustered with CAP3 • 71,700 clusters • Insecta/metazoa proteins (UniProt) • 690,000 seqces (insecta) • 2,200,00 seqces (metazoa) Raw data • RNAseq Illumina Yale– using Tophat/Cufflinks • 78,000 ‘transfrag’ (on 4 sets -> overlaps) • Augustus – SOLiD data trained • 16, 963 models – high QC Gene models Maskedgenome Rawgenome Masking: RepeatModeler repeats + known repeats/transposons • SNAP – trained for Glossina (MAKER) • Augustus – trained for Glossina (Martin Swain) • - GenScan Ab initio

  11. Intermediate gene sets Raw data Gene models Maskedgenome Rawgenome Masking: RepeatModeler repeats + known repeats/transposons Ab initio

  12. MAKER ESTs Raw data Proteins Gene models Maskedgenome Rawgenome Provided as input Masking: RepeatModeler repeats + known repeats/transposons Ab initio Run software within MAKER

  13. MAKER – iterative process • Round-1: • Align ESTs and Insecta proteins to the genome • Train SNAP (1): Drosophila HMM ESTs and protein alignments, RNA-seq Illumina Yale, Augustus (SOLiD) • Round-2: • Re-train SNAP (2) – same as above but HMM = output of SNAP-1 • Round-3: • Re-train SNAP (3) – same as above but HMM = output of SNAP-2 • Align Metazoa proteins to the genome • Combine final gene set

  14. Using MAKER for… Heliconius Tsetse fly Salmon louse Centipede

  15. Annex…

  16. Augustus (SOLiD) • Glossina trained: • > ESTs only: 14,739 predictions, • 9.8% with similarity to Gl. proteins (1,455 seq., 95% seq. identity) • -> ESTs + SOLiD: 14,739 predictions, • 9.9% with similarity to Gl. proteins (1,465 seq., 95% ID) • -> Glossina GenBank proteins: 2,754 proteins sequences • 53% matching Augustus models • Glossina un-trained: • -> 8,581 predictions, 15% with similarity to Gl. proteins (1,299 seq., exact matches) Martin Swain’s stats, July 22nd, 2011

  17. ESTs • Total: 79,292 ESTs

  18. [1] Adult midgut expressed sequence tags from the tsetse fly Glossina morsitans morsitans and expression analysis of putative immune response genes. Genome Biol. 2003. Lehane et al. • [2] Differential expression of fat body genes in Glossina morsitans morsitans following infection with Trypanosoma brucei brucei. Int. J. Parasitol. 2008. Lehane et al. • [3] Analysis of fat body transcriptome from the adult tsetse fly, Glossina morsitans morsitans. Insect Mol. Biol. 2006 Attardo et al. • [4] Functional Characterisations of odorant binding proteins and chemosensory proteins in tsetse fly Glossina morsitans morsitans. Unpublished 2009. …., Lehane,M., Hertz-Fowler,C., Berriman,M., … • [5]Comprehensive analysis of the transcriptome of the Tsetse fly Glossina morsitans morsitans. Unpublished. 2009. Hertz-Fowler,C., Aslett,M.A. and Berriman,M.EST submitted under: GenomeProject:9563

  19. MAKER – final gene set • Genes: • Final genes: 12,220 • Raw data: • EST-based genes: 23,469 • Protein-based genes : 416,9591 (redundancy) • Gene sets: • Illumina-Yale: 70,915 (redundancy) • Augustus (SOLiD): 16,155 • Ab initio • SNAP: 48,464 • Augustus (MAKER): 14,413 (417,000)

More Related