1 / 28

Evaluating Genes and Transcripts (“Genebuild”)

Evaluating Genes and Transcripts (“Genebuild”). Outline. Ensembl gene set Ensembl EST genes Ab initio predictions Manual curation (Vega) Ensembl / Havana merged gene set CCDS project. Biological Evidence. All Ensembl gene predictions are based on experimental evidence:.

jacqui
Download Presentation

Evaluating Genes and Transcripts (“Genebuild”)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating Genes and Transcripts(“Genebuild”)

  2. Outline • Ensembl gene set • Ensembl EST genes • Ab initio predictions • Manual curation (Vega) • Ensembl / Havana merged gene set • CCDS project

  3. Biological Evidence All Ensembl gene predictions are based on experimental evidence: • UniProt/Swiss-Prot A manually curated database and therefore of highest accuracy • NCBI RefSeq A partially manually curated database • UniProt/TrEMBL Automatically annotated translations of EMBL coding sequence (CDS) features • EMBL / GenBank / DDBJ Primary nucleotide sequence repository

  4. The Ensembl Genebuild Genome assembly + Experimental evidence Ensembl Genes + Computer programs

  5. The Ensembl Genebuild A new release of Ensembl doesn’t contain a new genebuild for each species! New genebuilds are only done if there is: • a new genome assembly • a lot of new supporting evidence

  6. Genome Assemblies Genome assemblies are not created by Ensembl, but provided by other institutes / consortia, e.g. • NCBI: human, mouse • Rat Genome Sequencing Consortium: rat • Sanger: zebrafish • Broad Institute: mammals • Baylor College: cow • Washington University: chicken etc. etc.

  7. The Ensembl Genebuild • Targeted build: Align species-specific proteins to the genometo create transcripts • Similarity build: Align proteins from closely related species to locate additional transcripts • Add UTRs using mRNA evidence • Eliminate redundant transcripts and create genes

  8. “Special” cases • Pseudogenes • Non-coding RNA genes • sequences from RFAM and miRBase dbs and covariance models • hand-checked set • Ig Segment Genes (Immunoglobulin and T-cell receptor segments) • sequences from IMGT db and Exonerate

  9. Classification of Transcripts • Ensembl Transcripts and Proteins are mapped to UniProt/Swiss-Prot, NCBI RefSeq and UniProt/TrEMBL entries • Genes that map to species-specific protein/mRNA records are classified as known • Genes that do not map to species-specific protein/mRNA records are classified as novel

  10. Names and Descriptions • Transcript names are inferred from mapped transcripts and proteins • Swiss-Prot > RefSeq > TrEMBL ID • Novel transcripts have only Ensembl identifiers • Genes are assigned the official gene symbol if available • HGNC (HUGO) symbol for human genes • Species-specific nomenclature committees (MGI, ZFIN etc.) • Otherwise Swiss-Prot > RefSeq > TrEMBL ID • Gene description is inferred from mapped database entries, the source is always given

  11. UTR coding/UTR Supporting evidence ExonView mRNA peptide peptide mRNA

  12. Supporting evidence ContigView

  13. Configuring the Genebuild Genebuild configured for each species Data availibility • Targeted build most useful in human, mouse • Similarity build most useful in C. intestinalis, mosquito Structural issues • Zebrafish • Many duplications • Genome from different haplotypes • Mosquito • Many single-exon genes • Genes within genes

  14. Low Coverage Genomes Low coverage genomes (~2x) come in lots of scaffolds: “classic” genebuild will result in many partial and fragmented genes Whole Genome Alignment (WGA) to an annotated reference genome: this method reduces fragmentation by piecing together scaffolds into “gene-scaffolds” that contain complete gene(s)

  15. Low Coverage Genomes reference assembly NNNNNN “gene-scaffold”

  16. EST Gene Set • ESTs (Expressed Sequence Tags) are single reads, high chance of sequencing mistakes • EST libraries are regularly contaminated with genomic DNA • Generally ~ 400 bp, so unlikely to cover a whole gene THEREFORE • EST gene predictions are less reliable and thus kept separate from the core Ensembl Gene Set

  17. EST Gene Set ContigView EST genes ESTs

  18. Ab initio Predictions • Predict translatable transcript structures solely on the basis of genome sequence. • No validation with biological expression information. • GENSCAN for vertebrate genomes • SNAP better for invertebrates • NB: Both programs are over-predicting transcript structures.

  19. Ab initio Predictions ContigView GENSCAN prediction

  20. Automatic Annotation Quick Use unfinished sequence or shotgun assembly Consistent annotation Automatic vs Manual Annotation Manual Annotation • Slow • Need finished sequence • Flexible, can deal with inconsistencies • Most rules have exceptions • Consult publications as well as databases

  21. Annotation that Causes Problems for Ensembl • Multiple variants • UTRs • Pseudogenes • Non-coding genes (ncRNAs) • Overlapping genes, anti-sense genes • Gene duplication events

  22. Manually Curated Gene Sets • FlyBase fruitfly • WormBase C. elegans • SGD yeast • Vega human, zebrafish, mouse, dog

  23. Vega Genome Browser http://vega.sanger.ac.uk

  24. Vega Transcripts Vega transcripts Vega Havana transcripts annotated by the Havana team atSanger Vega External transcripts annotated by other Vega teams

  25. Ensembl / Havana Merge Full-length protein-coding transcripts annotated by the Sanger Havana team (part of Vega) are merged with the human Ensembl transcript set Transcripts: • Ensembl/Havana: gold • Ensembl: red / black • Havana: blue Genes: • Ensembl/Havana: gold • Ensembl: red / black • Havana: blue

  26. Ensembl / Havana Merge Merged Ensembl / Havana gene Merged Ensembl / Havana transcript

  27. CCDS(Consensus Coding Sequences) • Collaboration between NCBI, UCSC, Ensembl and Havana to produce a set of stable, reliable, complete (ATG->stop) CDS structures for human and mouse • Long term aim is to get to a single gene set for human and mouse • The genebuild pipeline has been modified to retain these ‘blessed’ CDSs (stored in a database for incorporation in the build)

  28. Q & A Q U E S T I O N S A N S W E R S

More Related