1 / 31

Ensembl Genome Annotation Overview

Ensembl Genome Annotation Overview. Steve Searle. Outline. Overview of Ensembl Outline of gene annotation process Details of the gene annotation process on high coverage genomes including recent modifications to it. Ensembl. Joint Sanger / EBI project (~40 people)

cana
Download Presentation

Ensembl Genome Annotation Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ensembl Genome Annotation Overview Steve Searle

  2. Outline • Overview of Ensembl • Outline of gene annotation process • Details of the gene annotation process on high coverage genomes including recent modifications to it

  3. Ensembl • Joint Sanger / EBI project (~40 people) • Aim: Annotate Vertebrate genomes • Multiple teams • Genebuild • Compara • Functional Genomics • Variation • Core API • Mart • Web • Release every 2 months

  4. Species in Ensembl

  5. Types of annotation in EnsEMBL • Raw computes • Gene annotation • Protein coding • Pseudogenes • ncRNAs • Xrefs • Protein annotation • Comparative genomic data • Genomic alignments • Gene homologs • Variation data • Functional genomic data • Probe mapping • ChipSeq • Regulatory region build

  6. Outline • Overview of Ensembl • Outline of gene annotation process • Details of the gene annotation process on high coverage genomes including recent modifications to it

  7. Ensembl Genebuilding • Ensembl generates automatic annotation on over 30 genomes. • Annotation strategies: • High coverage • Alignments of protein and cDNA sequences. • Low coverage • Project gene structures from reference species through genomic alignment • High coverage primate • Alignment of species specific sequences + gene projection • Fly, yeast, worm • Import manual annotation • (worm now done by wormbase team)

  8. Data used for building on high coverage genomes • Build CDS models using • Uniprot/Swissprot • Uniprot/TrEMBL • Refseq NPs • cDNAs • EMBL/GenBank • Now also use: • IMGT data • Other transcript models incorporated • CCDS models • Subset of Havana manual annotation

  9. Outline of Genebuild Process Other proteins Species specific cDNAs Species specfic ESTs Pmatch Genewise / Exonerate CDS Models Transcript Models (no CDS) Aligned ESTs UTR Addition CDS models with UTRs ClusterMerge Exonerate Exonerate Genebuilder Core EnsEMBL Genes EnsEMBL EST genes Final gene set Pseudogenes Species specific proteins

  10. Species specific protein sequences Uniprot/Swissprot,Uniprot/TrEMBL,Refseq (NPs) pmatch vs assembly genewise Targetted Genewise overview Blast, cluster hits to define region to build model

  11. genewise Similarity Genewise overview Fetch protein blast hits Score threshold Discard hits overlapping targetted genewises Blast vs repeatmasked Slice, not just genscan Exons (coverage threshold) reblast and optionally miniseq

  12. Adding UTRs Genewise – phases, no UTRs Exonerate – UTRs, no phases Translateable gene with UTRs

  13. Outline • Overview of Ensembl • Outline of gene annotation process • Details of the gene annotation process on high coverage genomes including recent modifications to it

  14. New human and mouse gene builds • Modifications to build method • Improving predicted gene structures • Improved alignments of species specific data • Filtering out problematic sequences • chimeric cDNAs • retained introns • incorrect TrEMBL CDS entries • UTR addition modified • Immunoglobulin build • Updated manual annotation set incorporated • Homology data used to improve incomplete models • Reducing overprediction • Improved pseudogene detection • Filtering out problematic sequences • viral proteins • repeats • Reducing underprediction • Homology data used to identify missed orthologs • Updated manual annotation set incorporated

  15. QC of current and previous human and mouse Ensembl gene sets Human Mouse

  16. Species specific proteins Other proteins Species specific cDNAs Havana Manual annotation Species specfic ESTs KillList Filter KillList Filter KillList Filter Core Ensembl Genes with UTR CDS models Transcript models Aligned ESTs CDS models with UTRs Transcript- Coalescer Exonerate /cDNAUpdate Comparative analysis Pmatch/ Exonerate UTR Addition TranscriptConsensus BestTargetted Genebuilder Exonerate Exonerate Genewise Genewise Preliminary gene set GeneBuilder Pseudogenes HavanaAdder SetMerger EnsEMBL EST genes Final Gene Set Exonerate CCDS Transcripts Ditag alignments Fixed partial and Missed orthologs IMGT Ig Genes

  17. Species specific protein alignment • Aim: • Improved alignment of species specific proteins • Problems: • Non GT-AG spliced introns were not always correctly predicted • Short terminal exons were sometimes missed • Solution: • Produce multiple models for each input protein • Pmatch with shorter word length • Genewise with different parameters • Exonerate protein2genome model • Choose best model (most similar to original protein)

  18. Species specific protein alignment

  19. Exonerate cdna2genome model • Combined model aligning cDNA and CDS simultaneously should improve transcript models

  20. Input data filtering • Prevent alignments of ‘bad’ cDNAs and proteins to the genome • KillList database replaced kill list file • More flexible than a text file • Kill List as it was at any date • Species and analysis specific kills • Versioning • API access

  21. ‘Chimeric’ cDNA example

  22. UTR addition • UTR addition module rewritten to incorporate previously separate stages • LookForBoth • Expanding out incomplete CDSs into added UTR • KnownUTR • Makes known cDNA protein pairings • and greater flexibility: • Ditag and EST data can be used to score cDNAs • Other changes: • cDNA update pipeline for cDNA alignments in human and mouse rather than single exonerate run • 3 rounds of exonerate with increasingly sensitive parameters • Aligns 87% of 231145 cDNAs in human • Chimeric cDNA and retained intron cDNAs kill listed

  23. Pseudogenes and artefacts • Using protein domains to identify ensembl families of viral origin • In first run more than 3000 genes have been removed across all species • Evidence has been kill listed • Patch applied to 49 release removing 1200 more (and 400 Alus) • Better identification of processed pseudogenes • new biotype ‘retrotransposed’

  24. CCDS • Set of CDS structures identical between Ensembl / Havana and NCBI: • assessed by UCSC • guaranteed to be retained in both sets • removal requires agreement from Hinxton, NCBI and UCSC • Extended to mouse in 2007. • Current set sizes: • 16004 human genes • 16892 mouse genes (up from 12981 in previous set) • New round of comparisons on human in progress

  25. Havana annotation merging • Incorporate Havana transcripts with complete CDS • Redundant transcripts in merged set removed • Merging process pipelined for new human and mouse builds and updated vega dbs used.

  26. Merged gene

  27. Immunoglobulin build • Aim: Improve annotation of Immunoglobulin gene segment clusters • Problem: Ig gene segments cause problems for standard build procedure • cDNAs and proteins do not represent the germline genomic arrangement • Solution: Incorporate IMGT data • Align IMGT Ig segment annotation to genome • Replace overlapping genes built by standard build procedure • Currently only used in human and mouse

  28. Ig Build Rel 38 Rel 47

  29. Clustering rulesFalse gene merging • Transcript models are clustered into genes based on exon overlap • Old rule: Any exons overlap • New rule: Only overlap of CDS exons

  30. Status • Human and mouse gene builds were the main focus for 2007 • They have been the driver for a significant amount of development work • Resulting builds are the best we have created so far • We’ve produced patches to further improve the current sets • Problems which remain • Sequence data which causes problems for prediction • ‘Chimeric’ clones • Fragmentary protein and cDNA sequences • Translated 3' UTR • Gene clusters • CDS overlap as definition of gene leads to some cases that don't agree with manual definition of gene

  31. Acknowledgements • ISG (Systems support) • Guy Coates • Havana • Jen Harrow • Laurens Wilming • NCBI • Kim Pruitt • UCSC • Mark Diekhans • Genebuilders • Val Curwen • Bronwen Aken • Julio Fernandez Banet • Laura Clarke • Amonida Zadissa • Felix Kokocinski • Jan-Hinnerk Vogel • Simon White • Sarah Dyer • Tim Hubbard • Ewan Birney • Michael Schuster • The rest of Ensembl

More Related