320 likes | 536 Views
Ensembl Genome Annotation Overview. Steve Searle. Outline. Overview of Ensembl Outline of gene annotation process Details of the gene annotation process on high coverage genomes including recent modifications to it. Ensembl. Joint Sanger / EBI project (~40 people)
E N D
Ensembl Genome Annotation Overview Steve Searle
Outline • Overview of Ensembl • Outline of gene annotation process • Details of the gene annotation process on high coverage genomes including recent modifications to it
Ensembl • Joint Sanger / EBI project (~40 people) • Aim: Annotate Vertebrate genomes • Multiple teams • Genebuild • Compara • Functional Genomics • Variation • Core API • Mart • Web • Release every 2 months
Types of annotation in EnsEMBL • Raw computes • Gene annotation • Protein coding • Pseudogenes • ncRNAs • Xrefs • Protein annotation • Comparative genomic data • Genomic alignments • Gene homologs • Variation data • Functional genomic data • Probe mapping • ChipSeq • Regulatory region build
Outline • Overview of Ensembl • Outline of gene annotation process • Details of the gene annotation process on high coverage genomes including recent modifications to it
Ensembl Genebuilding • Ensembl generates automatic annotation on over 30 genomes. • Annotation strategies: • High coverage • Alignments of protein and cDNA sequences. • Low coverage • Project gene structures from reference species through genomic alignment • High coverage primate • Alignment of species specific sequences + gene projection • Fly, yeast, worm • Import manual annotation • (worm now done by wormbase team)
Data used for building on high coverage genomes • Build CDS models using • Uniprot/Swissprot • Uniprot/TrEMBL • Refseq NPs • cDNAs • EMBL/GenBank • Now also use: • IMGT data • Other transcript models incorporated • CCDS models • Subset of Havana manual annotation
Outline of Genebuild Process Other proteins Species specific cDNAs Species specfic ESTs Pmatch Genewise / Exonerate CDS Models Transcript Models (no CDS) Aligned ESTs UTR Addition CDS models with UTRs ClusterMerge Exonerate Exonerate Genebuilder Core EnsEMBL Genes EnsEMBL EST genes Final gene set Pseudogenes Species specific proteins
Species specific protein sequences Uniprot/Swissprot,Uniprot/TrEMBL,Refseq (NPs) pmatch vs assembly genewise Targetted Genewise overview Blast, cluster hits to define region to build model
genewise Similarity Genewise overview Fetch protein blast hits Score threshold Discard hits overlapping targetted genewises Blast vs repeatmasked Slice, not just genscan Exons (coverage threshold) reblast and optionally miniseq
Adding UTRs Genewise – phases, no UTRs Exonerate – UTRs, no phases Translateable gene with UTRs
Outline • Overview of Ensembl • Outline of gene annotation process • Details of the gene annotation process on high coverage genomes including recent modifications to it
New human and mouse gene builds • Modifications to build method • Improving predicted gene structures • Improved alignments of species specific data • Filtering out problematic sequences • chimeric cDNAs • retained introns • incorrect TrEMBL CDS entries • UTR addition modified • Immunoglobulin build • Updated manual annotation set incorporated • Homology data used to improve incomplete models • Reducing overprediction • Improved pseudogene detection • Filtering out problematic sequences • viral proteins • repeats • Reducing underprediction • Homology data used to identify missed orthologs • Updated manual annotation set incorporated
QC of current and previous human and mouse Ensembl gene sets Human Mouse
Species specific proteins Other proteins Species specific cDNAs Havana Manual annotation Species specfic ESTs KillList Filter KillList Filter KillList Filter Core Ensembl Genes with UTR CDS models Transcript models Aligned ESTs CDS models with UTRs Transcript- Coalescer Exonerate /cDNAUpdate Comparative analysis Pmatch/ Exonerate UTR Addition TranscriptConsensus BestTargetted Genebuilder Exonerate Exonerate Genewise Genewise Preliminary gene set GeneBuilder Pseudogenes HavanaAdder SetMerger EnsEMBL EST genes Final Gene Set Exonerate CCDS Transcripts Ditag alignments Fixed partial and Missed orthologs IMGT Ig Genes
Species specific protein alignment • Aim: • Improved alignment of species specific proteins • Problems: • Non GT-AG spliced introns were not always correctly predicted • Short terminal exons were sometimes missed • Solution: • Produce multiple models for each input protein • Pmatch with shorter word length • Genewise with different parameters • Exonerate protein2genome model • Choose best model (most similar to original protein)
Exonerate cdna2genome model • Combined model aligning cDNA and CDS simultaneously should improve transcript models
Input data filtering • Prevent alignments of ‘bad’ cDNAs and proteins to the genome • KillList database replaced kill list file • More flexible than a text file • Kill List as it was at any date • Species and analysis specific kills • Versioning • API access
UTR addition • UTR addition module rewritten to incorporate previously separate stages • LookForBoth • Expanding out incomplete CDSs into added UTR • KnownUTR • Makes known cDNA protein pairings • and greater flexibility: • Ditag and EST data can be used to score cDNAs • Other changes: • cDNA update pipeline for cDNA alignments in human and mouse rather than single exonerate run • 3 rounds of exonerate with increasingly sensitive parameters • Aligns 87% of 231145 cDNAs in human • Chimeric cDNA and retained intron cDNAs kill listed
Pseudogenes and artefacts • Using protein domains to identify ensembl families of viral origin • In first run more than 3000 genes have been removed across all species • Evidence has been kill listed • Patch applied to 49 release removing 1200 more (and 400 Alus) • Better identification of processed pseudogenes • new biotype ‘retrotransposed’
CCDS • Set of CDS structures identical between Ensembl / Havana and NCBI: • assessed by UCSC • guaranteed to be retained in both sets • removal requires agreement from Hinxton, NCBI and UCSC • Extended to mouse in 2007. • Current set sizes: • 16004 human genes • 16892 mouse genes (up from 12981 in previous set) • New round of comparisons on human in progress
Havana annotation merging • Incorporate Havana transcripts with complete CDS • Redundant transcripts in merged set removed • Merging process pipelined for new human and mouse builds and updated vega dbs used.
Immunoglobulin build • Aim: Improve annotation of Immunoglobulin gene segment clusters • Problem: Ig gene segments cause problems for standard build procedure • cDNAs and proteins do not represent the germline genomic arrangement • Solution: Incorporate IMGT data • Align IMGT Ig segment annotation to genome • Replace overlapping genes built by standard build procedure • Currently only used in human and mouse
Ig Build Rel 38 Rel 47
Clustering rulesFalse gene merging • Transcript models are clustered into genes based on exon overlap • Old rule: Any exons overlap • New rule: Only overlap of CDS exons
Status • Human and mouse gene builds were the main focus for 2007 • They have been the driver for a significant amount of development work • Resulting builds are the best we have created so far • We’ve produced patches to further improve the current sets • Problems which remain • Sequence data which causes problems for prediction • ‘Chimeric’ clones • Fragmentary protein and cDNA sequences • Translated 3' UTR • Gene clusters • CDS overlap as definition of gene leads to some cases that don't agree with manual definition of gene
Acknowledgements • ISG (Systems support) • Guy Coates • Havana • Jen Harrow • Laurens Wilming • NCBI • Kim Pruitt • UCSC • Mark Diekhans • Genebuilders • Val Curwen • Bronwen Aken • Julio Fernandez Banet • Laura Clarke • Amonida Zadissa • Felix Kokocinski • Jan-Hinnerk Vogel • Simon White • Sarah Dyer • Tim Hubbard • Ewan Birney • Michael Schuster • The rest of Ensembl