300 likes | 310 Views
Learn about the Ensembl Gene Set and how it is determined through the gene annotation pipeline, manual curation, and biological evidence. Explore pseudogenes, ncRNAs, and the CCDS project.
E N D
The Ensembl Gene setThe “Genebuild” 21 April 2008
Outline • The GeneBuild (determining the Ensembl gene set) • What it means for the scientist? • ‘annotation pipeline’ vs ‘manual curation’ • Pseudogenes • ncRNAs • The CCDS project
Introduction • What is available? I) Sequence Assemblies from genome sequencing efforts
Gene Sequencing- the Assembly This generates clones, vs new sequencing methods http://seqcore.brcf.med.umich.edu/doc/educ/dnapr/sequencing.html
Ciona intestinalis Shotgun assembly Clones Available • Human: • (Tilepath- used in the assembly)
ContigView: Clones and Contigs Contigs Clones (Plate/well numbers) Ensembl Transcripts
Task: View the tilepath clone in ContigView for the region containing the human BRCA2 gene. Hint: Start with a search for the BRCA2 gene.
The Ensembl Geneset • How does Ensembl use mRNA and protein information along with the sequence assembly to define distinct genes on the genome? Ensembl Geneset Protein Sequence Assembly
Once the Assembly is Imported… • Proteins/mRNAs are aligned. • These have been submitted to databases such as: • UniProt (manually curated) and • RefSeq (partially manually curated)
The BiologicalEvidence All Ensembl gene predictions are based on experimental evidence: • UniProt/Swiss-Prot • A manually curated database and therefore of highest accuracy • NCBI RefSeq • A partially manually curated database • UniProt/TrEMBL • Automatically annotated translations of EMBL coding sequence (CDS) features • EMBL / GenBank / DDBJ • Primary nucleotide sequence repository
Database Relationship NCBI RefSeq EMBL-Bank DDBJ GenBank Individual Lab’s Submission UniProt Swiss-Prot TrEMBL
EST Genebuild Sequence (Assembly) Manual annotation (HAVANA) EMBL-Bank GenBank DDBJ Proteins (e.g. Swiss-Prot) Ensembl mRNA EST genes
Why do I want to know?… • Ensembl genes may be based on multiple protein/mRNAs • What is an Ensembl gene based on?
Task • Look at the evidence for the human EPO gene. • What was this gene based on? • Hint: Go to Exon Information from the GeneView page
Species-Specific GeneBuilds • Pan troglodytes genes are built by projection from human genes. • Zebrafish has many gene duplications. Homo sapiens genes must have protein evidence, not just mRNA.
Task • When was the chimpanzee (Pan troglodytes) Genebuild performed? • Can you find information as to how genes were annotated? • Hint: Look on the chimpanzee index page
External Gene Set: VEGA/Havana • Human, zebrafish, mouse and dog • Havana transcripts in blue or gold… • What are Havana transcripts?
Havana and Ensembl match When a Havana (manually curated) and Ensembl (automatic methods) predict the same transcript, basepair for basepair, the transcripts are merged and coloured gold.
Manually-curated gene sets in Ensembl • Vega (Havana) • Homo sapiens,Danio rerio, • Mus musculus and Canis familiaris • WormBase • Caenorhabditis elegans • FlyBase • Drosophila melanogaster • SGD • Saccharomyces cerevisiae
What Can Go Wrong? • A Gap in the assembly • Gene might not be found in Ensembl • II) Fused genes BLAST hit (SwissProt entry) Gene might be associated with two names
Outline • The genome sequence • The Genebuild • ‘manual curation’ by Havana • Other: EST gene set Pseudogenes ncRNAs
Expressed Sequence Tags vs ‘cDNA’ • ESTs are annotated separately. Why? • mRNA and cDNA used in the GeneBuild: • Sequenced to high standard, often complete. • EST: Lower quality sequence. • ‘One shot’ sequencing of cDNA from the 5’ and 3’ end creates the EST sequence. • ESTs are only 500-800 nucleotides long • Low quality fragment- sequence error of ~2%. • BUT confers useful expression information • discovery of new genes esp in diseased organisms • Tissue type • Timing/developmental stage • Samples more transcripts, variants
Where Can I See This EST Geneset?ContigView Choose EST genes EST track
Processed Unprocessed mRNA AAAAAA Produced by gene duplication and rearrangement Reverse transcription and re-integration pseudogene AAAAAA Pseudogenes: ‘False’ Genes
ncRNAs (non coding RNAs) • What types are in Ensembl? • tRNA (transfer RNA) • rRNA (ribosomal RNA) • scRNA (small cytoplasmic) • snRNA (small nuclear) • snoRNA (small nucleolar) • miRNA (microRNA)
ncRNAs (2 types) • I) RNA with low homology can be identified through conserved 2ary structure (search genome using Rfam pattern) • II) High sequence conservation (miRNA) • BLAST alignment • ‘RNA fold’ applied to make sure • sequences can fold (hairpin)
ncRNAs… where can I see them? • Find them in ContigView: • or use BioMart.
Summary – Ensembl Genes *All Ensembl genes are based on biological evidence (protein and mRNA) • One Ensembl gene may come from proteins and mRNAs in various databases. • Havana (manually curated) genes are incorporated into the Ensembl geneset, merged for human. • The CCDS set strives for consensus coding sequences across databases. • Pseudogenes and RNAs are annotated, along with a separate EST gene set.
For more on GeneBuild: • Help and Documentation • (About Ensembl) http://www.ensembl.org/info/about/docs/genome_annotation.html