Genome Annotation

Genome Annotation BCB 660 October 20, 2011

From Carson Holt

Annotations • Automated • Ab initio (based on genomic sequence alone) • Involves comparisons to known proteins (BLAST similarity) • Sequence motifs such as start/stop codons, intron/exon boundaries • Evidence-based (ESTs) • Involves alignment of experimental EST (cDNA) data to a gene prediction • Manual • Manual curation of genes predicted automatically • Check gene structure, presence of conserved domains, match of ESTs to gene prediction • Align to related genes/proteins and look for oddities (missing exons, early stop codons, etc). • Annotation can then be manually edited • May also involve assigning function (based on sequence similarity, conserved domains) via Gene Ontology • Structural: exons, introns, UTRs, splice forms etc. • Functional: process a gene is involved in (metabolism), molecular function (hydrolase), location of expression (expressed in the mitochondria), etc.

Classic strategy • Combine ab initio and evidence-based gene predictors together to come up with a concensus predicted gene set • Ask community to pitch in and manually annotate as many genes as possible • Leads to great variability in quality of different genome annotations, often many versions of official gene sets

NGS and the future of genome annotation • In 2010, 1300 eukaryotic genome projects were underway -- assuming 10,000 genes per genome, that’s 13,000,000 new annotations will be needed -- quality control and maintenance become an issue • Some organizations dedicated to genome annotation (i.e ENSEMBL and VectorBase) but 1300 genomes will not be feasible • Need for high quality, automated annotation pipelines, that are easy to use by small research groups without extensive bioinformatics expertise

MAKER Pipeline: Especially effective for Emerging Eukaryote Model Organisms • Incorporates ab initio and evidence-based gene predictors • Gene predictions are run a first time • Then a small subset of the genome assembly is used to train gene predictors (building genome-specific HMMs) • Then trained gene predictors are run again on whole genome • ** Really nice if you don’t have a basis to start from (e.g. de novo gene prediction)

What does MAKER do? • * Identifies and masks out repeat elements • * Aligns ESTs to the genome • * Aligns proteins to the genome • * Produces ab initio gene predictions • * Synthesizes these data into final annotations • * Produces evidence-based quality values for downstream annotation management

MAKER Steps involved 1. Compute phase RepeatMasker BLAST Exonerate SNAP (and other gene predictors) 2. Filter/cluster phase Identify/remove marginal predictions and alignments based on quality scores/cutoffs, etc Cluster to identify overlapping alignments/predictions– to remove redundancy and assess weight of evidence 3. Polish Realigns BLAST hits to obtain greater precision at exon boundaries (Exonerate) 4. Synthesis Collect evidence for each annotation, using EST evidence Evidences scores plus sequences (genomic, EST, coding, intron) passed to SNAP SNAP then uses this evidence to retrain and alter its internal HMM 5. Annotate Post-processing of SNAP prediction, recombine with evidence to generate complete annotations Output is a gff3 annotation that can be imported into genome browsers

Inputs to MAKER • Genomic sequence • Config files • External executables • Sequence database locations • Compute parameters • Sequence database files (choice of these turns out to be extremely important) • Transposons file (default plus known organism-specific) • Repeatmasker database file (organism-specific, optionsal) • Proteins file (known proteins from related organisms you want to align to the genome) • ESTs/mRNAs file (the evidence)

MAKER Output (Apollo browser)

Genome Annotation