190 likes | 360 Views
MAKER Annotation Process Example of Glossina. Karyn Mégy. Dan Hughes. VectorBase http://www.vectorbase.org. Annotation: aims and means. Aims Preliminary Locus rather than exact position Means Automatic annotation By similarity Ab initio Manual annotation By regions
E N D
MAKER Annotation Process Example of Glossina Karyn Mégy Dan Hughes VectorBase http://www.vectorbase.org
Annotation: aims and means • Aims • Preliminary • Locus rather than exact position • Means • Automatic annotation • By similarity • Ab initio • Manual annotation • By regions • By gene families
Annotation: similarity vs. ab initio • Similarity • Similarity to known sequences -> only know genes -> based on available data (qty, qlty) • Ab initio • Follow a gene “recipe” -> potentially identify new genes -> over predictions
Ensembl annotation 1 CommunityAnnotation 2 Proteinspecies specific 3 Transcriptomespecies specific Maskedgenome Rawgenome 4 Protein‘close’ specific Masking: RepeatModeler repeats + known repeats/transposons 5 Ab initio 5 5 5 5 5 5 4 4 4 Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission Add UTRs (ESTs) Functional annotation ncRNAs prediction Pseudogene prediction GenBank submission 4 Protein‘Close’ species 4 Protein‘Close’ species MASKEDgenome sequence MASKEDgenome sequence MASKEDgenome sequence MASKEDgenome sequence MASKEDgenome sequence MASKEDgenome sequence MASKEDgenome sequence 4 Protein‘Close’ species Raw genome sequence Raw genome sequence Raw genome sequence Raw genome sequence Raw genome sequence Raw genome sequence Raw genome sequence Protein‘Close’ species Protein‘Close’ species Protein‘Close’ species 3 3 3 3 Transcriptomespecies specific 3 Transcriptomespecies specific 3 Transcriptomespecies specific Transcriptomespecies specific Transcriptomespecies specific Transcriptomespecies specific 2 2 2 2 Proteinspecies specific 2 Proteinspecies specific 2 Proteinspecies specific Proteinspecies specific Proteinspecies specific Proteinspecies specific 1 1 1 Communityannotation 1 Communityannotation 1 Communityannotation 1 Communityannotation Communityannotation Communityannotation 4 3 1 1 5 2 4 3 1 1 5 2 4 3 1 1 5 2 4 3 1 1 5 2 4 3 1 1 5 2 4 3 1 1 5 2
Ensembl annotation • Similarity-focused • Data rich organisms • Fiddly, time consuming • Rhodniusprolixus experience • In the meantime: Heliconius annotation using MAKER
MAKER • Aim: • Generate gene sets • Combine into final gene set • Iterative process Rawgenome Annotatedgenome DATA DATA DATA • http://www.yandell-lab.org/software/maker.html • Cantarel et al. Gen. Res. 2008. PMID 18025269
MAKER • Aim: • Generate gene sets • Combine into final gene set • Iterative process Rawgenome Annotatedgenome DATA DATA DATA
Intermediate gene sets • ESTs • from GenBank • cleaned and clustered/assembled with CAP3 • 71,700 contigs • Insecta/metazoa proteins • from UniProt • align to the genome with BLAST • 690,000 seqces (insecta) • 2,200,00 seqces (metazoa) Raw data Maskedgenome Rawgenome Masking: RepeatModeler repeats + known repeats/transposons
Intermediate gene sets • RNAseq Illumina Yale • - cleaned • - aligned to the genome using Tophat/Bowtie • - build ‘tranfrag’ with Cufflinks • 78,000 ‘transfrag’ (on 4 sets -> overlaps) • Augustus • - generated by Martin Swain • - trained with SOLiD data • 16, 963 models – high quality Raw data Gene models Maskedgenome Rawgenome Masking: RepeatModeler repeats + known repeats/transposons
Intermediate gene sets • ESTs – aligned to the genome • from GenBank – clustered with CAP3 • 71,700 clusters • Insecta/metazoa proteins (UniProt) • 690,000 seqces (insecta) • 2,200,00 seqces (metazoa) Raw data • RNAseq Illumina Yale– using Tophat/Cufflinks • 78,000 ‘transfrag’ (on 4 sets -> overlaps) • Augustus – SOLiD data trained • 16, 963 models – high QC Gene models Maskedgenome Rawgenome Masking: RepeatModeler repeats + known repeats/transposons • SNAP – trained for Glossina (MAKER) • Augustus – trained for Glossina (Martin Swain) • - GenScan Ab initio
Intermediate gene sets Raw data Gene models Maskedgenome Rawgenome Masking: RepeatModeler repeats + known repeats/transposons Ab initio
MAKER ESTs Raw data Proteins Gene models Maskedgenome Rawgenome Provided as input Masking: RepeatModeler repeats + known repeats/transposons Ab initio Run software within MAKER
MAKER – iterative process • Round-1: • Align ESTs and Insecta proteins to the genome • Train SNAP (1): Drosophila HMM ESTs and protein alignments, RNA-seq Illumina Yale, Augustus (SOLiD) • Round-2: • Re-train SNAP (2) – same as above but HMM = output of SNAP-1 • Round-3: • Re-train SNAP (3) – same as above but HMM = output of SNAP-2 • Align Metazoa proteins to the genome • Combine final gene set
Using MAKER for… Heliconius Tsetse fly Salmon louse Centipede
Augustus (SOLiD) • Glossina trained: • > ESTs only: 14,739 predictions, • 9.8% with similarity to Gl. proteins (1,455 seq., 95% seq. identity) • -> ESTs + SOLiD: 14,739 predictions, • 9.9% with similarity to Gl. proteins (1,465 seq., 95% ID) • -> Glossina GenBank proteins: 2,754 proteins sequences • 53% matching Augustus models • Glossina un-trained: • -> 8,581 predictions, 15% with similarity to Gl. proteins (1,299 seq., exact matches) Martin Swain’s stats, July 22nd, 2011
ESTs • Total: 79,292 ESTs
[1] Adult midgut expressed sequence tags from the tsetse fly Glossina morsitans morsitans and expression analysis of putative immune response genes. Genome Biol. 2003. Lehane et al. • [2] Differential expression of fat body genes in Glossina morsitans morsitans following infection with Trypanosoma brucei brucei. Int. J. Parasitol. 2008. Lehane et al. • [3] Analysis of fat body transcriptome from the adult tsetse fly, Glossina morsitans morsitans. Insect Mol. Biol. 2006 Attardo et al. • [4] Functional Characterisations of odorant binding proteins and chemosensory proteins in tsetse fly Glossina morsitans morsitans. Unpublished 2009. …., Lehane,M., Hertz-Fowler,C., Berriman,M., … • [5]Comprehensive analysis of the transcriptome of the Tsetse fly Glossina morsitans morsitans. Unpublished. 2009. Hertz-Fowler,C., Aslett,M.A. and Berriman,M.EST submitted under: GenomeProject:9563
MAKER – final gene set • Genes: • Final genes: 12,220 • Raw data: • EST-based genes: 23,469 • Protein-based genes : 416,9591 (redundancy) • Gene sets: • Illumina-Yale: 70,915 (redundancy) • Augustus (SOLiD): 16,155 • Ab initio • SNAP: 48,464 • Augustus (MAKER): 14,413 (417,000)