280 likes | 445 Views
Evaluating Genes and Transcripts (“Genebuild”). Outline. Ensembl gene set Ensembl EST genes Ab initio predictions Manual curation (Vega) Ensembl / Havana merged gene set CCDS project. Biological Evidence. All Ensembl gene predictions are based on experimental evidence:.
E N D
Outline • Ensembl gene set • Ensembl EST genes • Ab initio predictions • Manual curation (Vega) • Ensembl / Havana merged gene set • CCDS project
Biological Evidence All Ensembl gene predictions are based on experimental evidence: • UniProt/Swiss-Prot A manually curated database and therefore of highest accuracy • NCBI RefSeq A partially manually curated database • UniProt/TrEMBL Automatically annotated translations of EMBL coding sequence (CDS) features • EMBL / GenBank / DDBJ Primary nucleotide sequence repository
The Ensembl Genebuild Genome assembly + Experimental evidence Ensembl Genes + Computer programs
The Ensembl Genebuild A new release of Ensembl doesn’t contain a new genebuild for each species! New genebuilds are only done if there is: • a new genome assembly • a lot of new supporting evidence
Genome Assemblies Genome assemblies are not created by Ensembl, but provided by other institutes / consortia, e.g. • NCBI: human, mouse • Rat Genome Sequencing Consortium: rat • Sanger: zebrafish • Broad Institute: mammals • Baylor College: cow • Washington University: chicken etc. etc.
The Ensembl Genebuild • Targeted build: Align species-specific proteins to the genometo create transcripts • Similarity build: Align proteins from closely related species to locate additional transcripts • Add UTRs using mRNA evidence • Eliminate redundant transcripts and create genes
“Special” cases • Pseudogenes • Non-coding RNA genes • sequences from RFAM and miRBase dbs and covariance models • hand-checked set • Ig Segment Genes (Immunoglobulin and T-cell receptor segments) • sequences from IMGT db and Exonerate
Classification of Transcripts • Ensembl Transcripts and Proteins are mapped to UniProt/Swiss-Prot, NCBI RefSeq and UniProt/TrEMBL entries • Genes that map to species-specific protein/mRNA records are classified as known • Genes that do not map to species-specific protein/mRNA records are classified as novel
Names and Descriptions • Transcript names are inferred from mapped transcripts and proteins • Swiss-Prot > RefSeq > TrEMBL ID • Novel transcripts have only Ensembl identifiers • Genes are assigned the official gene symbol if available • HGNC (HUGO) symbol for human genes • Species-specific nomenclature committees (MGI, ZFIN etc.) • Otherwise Swiss-Prot > RefSeq > TrEMBL ID • Gene description is inferred from mapped database entries, the source is always given
UTR coding/UTR Supporting evidence ExonView mRNA peptide peptide mRNA
Supporting evidence ContigView
Configuring the Genebuild Genebuild configured for each species Data availibility • Targeted build most useful in human, mouse • Similarity build most useful in C. intestinalis, mosquito Structural issues • Zebrafish • Many duplications • Genome from different haplotypes • Mosquito • Many single-exon genes • Genes within genes
Low Coverage Genomes Low coverage genomes (~2x) come in lots of scaffolds: “classic” genebuild will result in many partial and fragmented genes Whole Genome Alignment (WGA) to an annotated reference genome: this method reduces fragmentation by piecing together scaffolds into “gene-scaffolds” that contain complete gene(s)
Low Coverage Genomes reference assembly NNNNNN “gene-scaffold”
EST Gene Set • ESTs (Expressed Sequence Tags) are single reads, high chance of sequencing mistakes • EST libraries are regularly contaminated with genomic DNA • Generally ~ 400 bp, so unlikely to cover a whole gene THEREFORE • EST gene predictions are less reliable and thus kept separate from the core Ensembl Gene Set
EST Gene Set ContigView EST genes ESTs
Ab initio Predictions • Predict translatable transcript structures solely on the basis of genome sequence. • No validation with biological expression information. • GENSCAN for vertebrate genomes • SNAP better for invertebrates • NB: Both programs are over-predicting transcript structures.
Ab initio Predictions ContigView GENSCAN prediction
Automatic Annotation Quick Use unfinished sequence or shotgun assembly Consistent annotation Automatic vs Manual Annotation Manual Annotation • Slow • Need finished sequence • Flexible, can deal with inconsistencies • Most rules have exceptions • Consult publications as well as databases
Annotation that Causes Problems for Ensembl • Multiple variants • UTRs • Pseudogenes • Non-coding genes (ncRNAs) • Overlapping genes, anti-sense genes • Gene duplication events
Manually Curated Gene Sets • FlyBase fruitfly • WormBase C. elegans • SGD yeast • Vega human, zebrafish, mouse, dog
Vega Genome Browser http://vega.sanger.ac.uk
Vega Transcripts Vega transcripts Vega Havana transcripts annotated by the Havana team atSanger Vega External transcripts annotated by other Vega teams
Ensembl / Havana Merge Full-length protein-coding transcripts annotated by the Sanger Havana team (part of Vega) are merged with the human Ensembl transcript set Transcripts: • Ensembl/Havana: gold • Ensembl: red / black • Havana: blue Genes: • Ensembl/Havana: gold • Ensembl: red / black • Havana: blue
Ensembl / Havana Merge Merged Ensembl / Havana gene Merged Ensembl / Havana transcript
CCDS(Consensus Coding Sequences) • Collaboration between NCBI, UCSC, Ensembl and Havana to produce a set of stable, reliable, complete (ATG->stop) CDS structures for human and mouse • Long term aim is to get to a single gene set for human and mouse • The genebuild pipeline has been modified to retain these ‘blessed’ CDSs (stored in a database for incorporation in the build)
Q & A Q U E S T I O N S A N S W E R S