310 likes | 470 Views
Investigating Genomes with Ensembl. Drs. Bert Overduin and Giulietta Spudich. Overview of the day. Introduction and website walk-through Hands-on exercises (the browser) Tea/Coffee Introduction to BioMart Hands-on exercises (BioMart) Lunch Determining the gene set
E N D
Investigating Genomes with Ensembl Drs. Bert Overduin and Giulietta Spudich
Overview of the day • Introduction and website walk-through • Hands-on exercises (the browser) Tea/Coffee • Introduction to BioMart • Hands-on exercises (BioMart) Lunch • Determining the gene set • Hands-on exercises (gene set) Tea/Coffee • Variations presentation and hands-on
Introducing… • Genome browsing: a comparison • Consensus genes • Ensembl annotation and software • How to find help
Histone modification DNase I sensitive site Conserved sequence Gene SNP Sequencing the genome
What can we learn about genomes? • Within one genome: regulatory elements, gene order, chromatin structure… • Through comparative studies: Evolution, conserved regions, rearrangements… Gene quality and prediction.
Genome Browsers Today • Ensembl Genome browser http://www.ensembl.org • NCBI Map Viewer http://www.ncbi.nlm.nih.gov/mapview/ • UCSC Genome Browser http://genome.ucsc.edu
What Distinguishes Ensembl from the UCSC and NCBI Browsers? • The gene set. Automatic annotation based on mRNA and protein information. • Programmatic access via the Perl API (open source) • BioMart • Integration with other databases (DAS) • Comparative analysis (gene trees)
Challenges of genome browsers • Increasing sequence information 198,879,188,987 nt (Aug 2007)
Challenges of genome browsers • Increasing annotation: ENCODE • Pilot project completed in 2007: 1% of human genome • Discovered promoter elements are on either side of the transcription start site
To meet a challenge… Ensembl’s AIM: To provide annotation for the biological community that is freely available and of high quality • Started in 1999 • Joint project between EBI and Sanger • Funded primarily by the Wellcome Trust, additional funding by EMBL, NIH-NIAID, EU, BBSRC and MRC • Team of ca. 40 people, led by Ewan Birney (EBI) and Tim Hubbard (Sanger)
The Ensembl gene set • All Ensembl genes start from a known protein or mRNA Sequence Ensembl Assembly gene set mRNAs protein • An initial alignment of protein and mRNA to the genome • begins the ‘Genebuild’.
Have you heard of… • Ensembl – strives for best possible gene set www.ensembl.org • Havana (VEGA) – same goal http://vega.sanger.ac.uk • HGNC – a unique name and symbol for every gene in human http://www.genenames.org/ • UniProt – focus on proteins, and functional information www.uniprot.org
All genes at once (Ensembl Genebuild) Quick, keeps current Consistent annotation Can apply rules to more species Ensembl vs Havana annotation Gene by gene (Havana/ VEGA) • Flexible, can deal with inconsistencies • Consult publications as well as databases • ‘Out of the Ordinary’ Biology • However… Slow, Expensive
Merging sets • Havana transcripts are incorporated into Ensembl • UniProt proteins are aligned to the genome in the Ensembl genebuild • UniProt imports Ensembl peptides for human • HGNC moved to Hinxton… coordination
Consensus across genome browsers: the CCDS sethttp://www.ensembl.org/info/about/docs/ccds.html • A protein is deposited into the ‘Consensus CDS protein set’ or CCDS set if: NCBI UCSC Havana Ensembl have determined the same sequence.
More about Ensembl… • Genome browsing: a comparison • Consensus genes • Ensembl annotation and software • How to find help
Ensembl Genes – biological basis All Ensembl gene predictions are based on proteins and mRNAs in: • UniProt/Swiss-Prot (manually curated) • UniProt/TrEMBL • NCBI RefSeq (manually curated) Ensembl Genes Protein/ mRNA Sequence Assembly
Genes and Transcripts in Ensembl • Ensemblknowngenes or transcripts • Ensemblnovel genes or transcripts • EnsemblESTgenes or transcripts Non-Ensembl genes: • Imports for yeast, c. elegans, fly, mosquito, takifugu and tetraodon
Names in Ensembl • ENSG###Ensembl Gene ID • ENST###Ensembl Transcript ID • ENSP###Ensembl Peptide ID • ENSE###Ensembl Exon ID • For other species than human a suffix is added: MUS (Mus musculus) for mouse:ENSMUSG### DAR (Danio rerio) for zebrafish:ENSDARG###, etc.
Gene Structure in Ensembl Calmodulin Chicken No UTRs Calmodulin Human UTRs annotated
What annotation is available? • Gene/transcript/peptide models (coding and noncoding (ncRNAs)) • IDs in other database • Mapped cDNAs, peptides, micro array probes, BAC clones etc. • Cytogenetic bands, markers, repeats etc. • Comparative data: • orthologues and paralogues, protein families, whole genome alignments, syntenic regions • Variation data: • Single Nucleotide Polymorphisms (SNPs) • Regulatory data: • “best guess” set of regulatory elements from ENCODE • Data from external sources (DAS)
Specific data sources • Microarrays (Affimetrix, Illumina, Agilent) • GO (Gene Ontology: functional classes) http://www.geneontology.org/ • OMIM (human diseases and phenotypes) http://www.ncbi.nlm.nih.gov/sites/entrez?db=OMIM • Identifiers in Entrez, UniProt, Refseq, etc • PDB, MSD (structural databases) http://www.rcsb.org/pdb/ http://www.ebi.ac.uk/msd/
Interpro Collection of protein data Sequences, Motifs, Structures http://www.ebi.ac.uk/interpro/
How is this information organised? • Ensembl Views (Website) • Ensembl Database (open source) (Perl API, FTP site) • BioMart ‘DataMining tool’
Ensembl – Open Source Data and software freely available More than 50 installs worldwide Academia and industry Local or available via the web Mirrors with Ensembl data, e.g. http://ensembl.genome.tugraz.at/index.html http://ensembl.genomics.org.cn/ or user projects with own data 28 of 42
Powered by Ensembl 29 of 42
Help and Information • Use our helpdesk! helpdesk@ensembl.org • View our help pages! (the ‘using Ensembl’ link) • View our animated tutorials http://www.ensembl.org/common/Workshops_Online • Mailing lists: ensembl-announce@ebi.ac.uk • Come visit our blog! http://ensembl.blogspot.com/