1 / 107

Ensembl

Ensembl. Steve Searle Joint project leader, Ensembl Genebuild team. Outline. Ensembl project overview Core database schema and API Pipeline Genomic annotation Comparative genomics Variation data Ensembl BioMart datamining db Making the data available. What is Ensembl? project aims.

adina
Download Presentation

Ensembl

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ensembl Steve Searle Joint project leader, Ensembl Genebuild team

  2. Outline • Ensembl project overview • Core database schema and API • Pipeline • Genomic annotation • Comparative genomics • Variation data • Ensembl BioMart datamining db • Making the data available

  3. What is Ensembl?project aims • funded to provide vertebrate genomes to the world • aims to provide the high quality automated genome annotation • aims to a leading group in genome analysis • all software, data and results freely available

  4. What is Ensembl ?project background • Group split between EBI and Sanger • Mainly Wellcome Trust funded (recently received a new five year grant for 2006-2011) • Largest dedicated compute in biology in Europe • Developer community > 300 people, including companies

  5. Ensembl - Technical overview • Data storage • Mysql databases (~160Gb in current release) • Core databases - annotation for each species • Variation databases - variation data for some species • Compara - single database containing all comparative genomic data for species in ensembl • Mart - set of denormalised databases for datamining • Data production • Pipeline systems running automatic annotation on a compute farm of 800 CPUs • Interfaces • Website • Mart (datamining tool) • Apollo • SQL • APIs (both perl and Java)

  6. Currently 21 organismsin Ensembl

  7. Open source • Object model • standard interface makes it easy for others to build custom applications on top of Ensembl data • Open discussion of design (ensembl-dev@ebi.ac.uk) • Most major pharmaceutical companies and many academics on mailing list • Ensembl installs worldwide • Both public and commercial e.g. Gramene (CSHL) Ciona-sg (Temasek) Arabidopsis (NASC) Fugu (IMCB)

  8. Outline • Ensembl project overview • Core database and API • Pipeline • Genomic annotation • Comparative genomics • Variation data • Ensembl BioMart datamining db • Making the data available

  9. The Ensembl Core Database • Relational database (MySQL) containing the genomic sequence and annotations on it (genes, alignments, ab initio predictions etc) • Data stored in it throughout analysis process and the website displays features from it • Current schema has 68 tables • Ensembl core API team control changes

  10. Requirements for the schema • Store data for human genome • … and all the other genomes we have • … and all the genomes we might get • Flexible to add more data • Easy to adapt to new genome • Responds fast enough for web site display and pipelined genebuild

  11. System Context Perl API Mart DB Ensembl DBs Other Scripts & Applications Apollo www Pipeline MartShell MartView Java API (EnsJ)

  12. Sequence Tables 0..n 0..1 0..n 0..1 1 0..1 1 0..1 1…n 1 1 1 0..n 0..n

  13. Feature Tables • Feature tables describe annotations with positions in sequence. • Each feature is associated with a seq_region and has a start, end, and orientation on the seq_region. • There is no central feature table. There are tables specific to each feature type (DNA/DNA alignments, DNA/Protein alignments, Repeats, Simple features). • Different feature tables have different attributes, but always have a seq_region position. 1 0..n

  14. Other features 1..n 1 1 1..n

  15. Tables for Genes 0..1 0..n 1 0..n 1 1 0..1 0..1 1 1 1 1..n 0..1 1 0..n 1 1 1 0..n 0..1 0..n 0..n

  16. Other tables • Sets of tables to handle: • Cross references of ensembl features to external database • Markers • QTLs • Regulatory regions and factors • Stable ID archive • Affymetrix probe data • Misc features • Density features • Tables containing meta information about the database • Karyotype bands • Protein annotation • Supporting evidence • Assembly exceptions (haplotypes and PARs)

  17. Ensembl APIs Programmatic access to ensembl databases is via three main APIs: ensembl core API access to genome database ensembl compara API access to compara database ensembl variation API access to variation database All three have the same basic structure Data objects to represent biological entities eg. Gene, Homology, Variation DataAdaptor objects to store and retrieve data objects from database. Data production APIs ensembl-pipeline genebuild pipeline ensembl-analysis analysis wrapper objects ensembl-hive compara pipeline

  18. The Perl Core API • The Perl core API provides a layer of abstraction over the Ensembl core databases. • Written in Object-Oriented Perl. • Can be used to get information into or out of Ensembl databases. • Insulates programmers to some extent from changes to the database schema. • Insulates programmer from coordinate transformations

  19. Data Objects • Information is obtained from the API in the form of Data Objects. • Each object represents some data which is stored in the database. • A Gene object represents a gene, a Transcript object represents a transcript, a Marker Object represents a Marker, etc.

  20. Data Objects – Code Example # print out the start, end and strand of a transcript print $transcript->start(), '-', $transcript->end(), '(',$transcript->strand(), “)\n”; # print out the stable identifier for an exon print $exon->stable_id(), “\n”; # print out the name of a marker and its primer sequences print $marker->display_marker_synonym()->name, “\n”; print “left primer: ”, $marker->left_primer(), “\n”; print “right primer:”, $marker->right_primer(), “\n”; # set the start and end of a simple feature $simple_feature->start(10); $simple_feature->end(100);

  21. Object Adaptors • Object Adaptors are factories for Data Objects. • Data Objects are retrieved from and stored in databases using Object Adaptors. • Each Object Adaptor is responsible for creating objects of only one particular type. • Data Adaptor fetch, store, and remove methods are used to retrieve, save, and delete information in the database. • All the SQL is in the Object Adaptors

  22. Object Adaptors – Code Example # fetch a gene by its internal identifier $gene = $gene_adaptor->fetch_by_dbID(1234); # fetch a gene by its stable identifier $gene = $gene_adaptor->fetch_by_stable_id('ENSG0000005038'); # store a transcript in the database $transcript_adaptor->store($transcript); # remove an exon from the database $exon_adaptor->remove($exon); # get all transcripts having a specific interpro domain @transcripts = @{$transcript_adaptor->fetch_all_by_domain('IPR000980')};

  23. Gene Marker … The DBAdaptor and the Registry • The Database Adaptor is a factory for Object Adaptors • It is used to connect to the database and to obtain Object Adaptors Data Objects GeneAdaptor MarkerAdaptor … Object Adaptors DBAdaptor DB • Registry enables access to multiple databases using information from a config file (important for compara work)

  24. Slices • A Slice Data Object represents an arbitrary region of a genome. • Slices are not directly stored in the database. • A Slice is used to request sequence or features from a specific region in a specific coordinate system. chr20 Clone AC022035

  25. Slices – Code Example # get the slice adaptor $slice_adaptor = $db->get_SliceAdaptor(); # fetch a slice on a region of chromosome 12 $slice = $slice_adaptor->fetch_by_region('chromosome', '12', 1e6, 2e6); # print out the sequence from this region print $slice->seq(); # get all clones in the database and print out their names @slices = @{$slice_adaptor->fetch_all('clone')}; foreach $slice (@slices) { print $slice->seq_region_name(), “\n”; }

  26. Features • Features are Data Objects with associated genomic locations. • All Features have start, end, strand and slice attributes. • Features are retrieved from Object Adaptors using limiting criteria such as identifiers or regions (slices). • Gene • Transcript • Exon • PredictionTranscript • PredictionExon • DnaAlignFeature • ProteinAlignFeature • SimpleFeature • MarkerFeature • QtlFeature • MiscFeature • KaryotypeBand • RepeatFeature • AssemblyExceptionFeature • DensityFeature

  27. A Complete Code Example use Bio::EnsEMBL::DBSQL::DBAdaptor; my $db = Bio::EnsEMBL::DBSQL::DBAdaptor->new (-host => ‘ensembldb.ensembl.org’, -dbname => ‘homo_sapiens_core_35_35h’, -user => ‘anonymous’); my $slice_ad = $db->get_SliceAdaptor(); my $slice = $slice_ad->fetch_by_region('chromosome', 'X', 1e6, 10e6); foreach my $sf (@{$slice->get_all_SimpleFeatures()}) { my $start = $sf->start(); my $end = $sf->end(); my $strand = $sf->strand(); my $score = $sf->score(); print “$start-$end($strand)$score\n”; }

  28. A Gene Object Code Example #!/usr/bin/perl -w use Bio::EnsEMBL::DBSQL::DBAdaptor; use strict; my $db = Bio::EnsEMBL::DBSQL::DBAdaptor->new (-host => ‘ensembldb.ensembl.org’, -dbname => ‘homo_sapiens_core_35_35h’, -user => ‘anonymous’); my $slice_ad = $db->get_SliceAdaptor(); my $slice = $slice_ad->fetch_by_region('chromosome', 'X', 1e6, 10e6); foreach my $gene (@{$slice->get_all_Genes_by_type(‘ensembl’)}) { print “Gene “,$gene->stable_id,“ “, $gene->start,“ “, $gene->end,“\n”; foreach my $trans (@{$gene->get_all_Transcripts}) { print “ Trans “,$trans->stable_id,”\n”; my $tlnseq = $trans->translate->seq; $tlnseq =~ s/(.{1,60})/$1\n/g; print “ “,$tlnseq; } }

  29. Coordinate Transformations • The API provides the means to convert between any related coordinate systems in the database. • Feature methods transfer, transform, project can be used to move features between coordinate systems. • Slice method project can be used to move features between coordinate systems.

  30. Feature::transfer • The Feature method transfer moves a feature from one Slice to another. • The Slice may be in the same coordinate system or a different coordinate system. Chr20 Chr17 Chr17 AC099811 ChrX

  31. Feature::transfer – Code Example # fetch an exon from the database $exon = $exon_adaptor->fetch_by_stable_id('ENSE00001180238'); print “Exon is on slice: “, $exon->slice()->name(), “\n”; print “Exon coords: “, $exon->start(), '-', $exon->end(), “\n”; # transfer the exon to a small slice just covering it $exon_slice = $slice_adaptor->fetch_by_Feature($exon); $exon = $exon->transfer($exon_slice); print “Exon is on slice: “, $exon->slice()->name(), “\n”; print “Exon coords: “, $exon->start(), '-', $exon->end(), “\n”; Sample output: Exon is on slice: chromosome:NCBI34:12:1:132078379:1 Exon coords: 56452706-56452951 Exon is on slice: chromosome:NCBI34:12:56452706:56452951:1 Exon coords: 1-246

  32. Stability of API • Ensembl API changes to meet our needs • Request for greater stability from users • Some methods are now labelled as stable and we guarentee that they will not change for at least 2 years.

  33. Outline • Ensembl project overview • Core database and API • Pipeline • Genomic annotation • Comparative genomics • Variation data • Ensembl BioMart datamining db • Making the data available

  34. Runnables and RunnableDBs • Runnables are perl objects which wrap analysis programs. Methods: • run • parse_results • Generates ensembl data objects • output • Returns generated data objects eg. Blast runnable wraps blast • RunnableDBs are perl objects which wrap Runnables allowing them to retrieve input data from and store output data into ensembl databases • fetch_input • write_output

  35. Runnable example my $seq = Bio::SeqIO->new( -file => "<test.fa", -format => 'Fasta')->next_seq; my $slice = Bio::EnsEMBL::Slice->new( -seq => $seq->seq, -coord_system => Bio::EnsEMBL::CoordSystem->new(-name => 'contig', -rank => 1), -seq_region_name => $seq->display_id, -start => 1, -end => $seq->length); my $genscan_runnable = Bio::EnsEMBL::Analysis::Runnable::Genscan->new( -query => $slice, -analysis => Bio::EnsEMBL::Analysis->new(-logic_name=>'genscan')); $genscan_runnable->run; my @output; foreach my $prediction (@{$genscan_runnable->output}) { my $blast_run = Bio::EnsEMBL::Analysis::Runnable::Blast->new ( -query => $prediction->translate, -parser => Bio::EnsEMBL::Analysis::Tools::BPliteWrapper->new(), -database => 'embl_vertrna', -program => 'wutblastn', -analysis => Bio::EnsEMBL::Analysis->new(-logic_name=>'vertrna')); $blast_run->run; push(@output, @{$blast_run->output}); }

  36. #!/usr/local/ensembl/bin/perl -w use strict; use Bio::EnsEMBL::Pipeline::DBSQL::DBAdaptor; use Bio::EnsEMBL::Pipeline::Analysis; my $db = new Bio::EnsEMBL::Pipeline::DBSQL::DBAdaptor( -host => 'localhost', -user => 'root', -dbname => 'test_db'); my $anal = $db->get_AnalysisAdaptor->fetch_by_logic_name(’Uniprot'); my $rdbstr = “Bio::EnsEMBL::Analysis::RunnableDB::”.$anal->module; my $runobj = “$rdbstr”->new( -db => $db, -input_id => 'contig::AL1347153.1.3517:1:3571:1', -analysis => $anal); $runobj->fetch_input; $runobj->run; $runobj->write_output; RunnableDB example

  37. Writing Runnables and RunnableDBs • A lot of functionality is implemented in the base classes • At its simplest just requires implementing: parse_results in the Runnable get_adaptor in the RunnableDB fetch_input in the RunnableDB • Other methods which may need overriding write_output in the RunnableDB run_analysis in the Runnable

  38. Pipeline Summary

  39. Example pipeline

  40. Current hardware • 8x ES40 Alpha (667 MHz) with 2Tb fibre channel storage • 10x ES45 Alpha (1GZ) with 5Tb fibre channel storage • 3x Itanium 4 CPU with 1.6Tb storage • 400 HS20 IBM Blades (2x2.8 or 3.2Ghz PIV + 4 Gig memory + 2TB clustered SAN filesystem or 600GB clustered IDE filesystem (both IBM GPFS) • Tru64 UNIX/Linux • 21 MySQL (v 4.1) instances • Most binaries and all sequence databases stored locally (avoids using NFS)

  41. Outline • Ensembl project overview • Core database and API • Pipeline • Genomic annotation • Comparative genomics • Variation data • Ensembl BioMart datamining db • Making the data available

  42. Genome annotation overview Raw compute - Alignments against protein and DNA dbs, and other basic analyses Automatic gene annotation Protein coding gene models Pseudogenes (some) RNA genes Alignment of species ESTs and cDNAs Affymetrix probe mapping Protein domain annotation Cross reference generation

  43. The Raw Computes • Repeat Features • RepeatMasker • Dust • TRF • Ab Initio Genes • Genscan (sometimes other programs) • Blast alignments • Blastp against Uniprot • Blastn against EMBL vertebrate RNAs and UniGene Clusters • Other Features • CPG islands • tRNA genes • Transcription start sites using Eponine

  44. Gene Annotation Species Specific Proteins Other Proteins Species Specific cDNAs Species Specific ESTs Genewise Exonerate Exonerate Genewise genes Aligned cDNAs AlignedESTs Blessed gene set (optional) Genewise geneswith UTRs ClusterMerge ClusterMerge Supported ab initio (optional) Genebuilder Preliminary gene set cDNA genes Gene Combiner Final set + pseudogenes Pseudogenes Core Ensembl genes Ensembl EST genes

  45. ncRNAs • Functional RNAs • Families share conserved secondary structure • Low sequence identity • Ribosome • Spliceosome • tRNAs • miRNA

  46. Difficulties in annotating ncRNAs Ab initio gene predicting programs such as GENSCAN cannot predict non-coding genes. BLAST performs poorly at detecting non coding genes where structure is conserved but sequence identity is low. Cannot use repeat masked DNA as some ncRNAs look very much like repeats (ALU related to SRP RNA) Cannot use ESTs as ncRNAs lack poly-A

  47. RFAM • Hand made alignments • Use Infernal to make Covariance Models • Scan models over subset of EMBL to build family alignments

  48. Problems 2 • Infernal does not scale well: • “Covariance model searches are extremely compute intensive… The compute time scales roughly to the 4th power of the length of the RNA, so larger models quickly become infeasible without significant compute resources” • How long would it take to run the human genome? • Rough estimate > a week on the farm • Need to limit the amount of sequence we run Infernal on

  49. Rfam Scan • Rfam procedure to speed up Infernal on large eukaryotes • Uses Blast to narrow search: • BLAST is poor at finding ncRNAs with low sequence ID • RFAM families contain sequences from all organisms • More sequence variation = more chance of Blast making alignment • In ensembl: • Separated blast and Infernal steps (using Runnables) • Determined filtering for blast results to limit time without significant reduction in sensitivity • Now runs in less than 24hrs.

  50. miRNA • Highly conserved across species • Precursor stem loop sequence ~ 70nt • Mature miRNA ~ 21nt • Identified using BLAST genomic vs miRBase precursors • RNAfold used to test for stem loop • Mature sequence identified (only 2 nt changes tolerated) • Start with ~ 290,000 blast hits • End with 222 miRNA • 96% of SE miRNAs + additional 60 • Novel c.f. miRBase: • 1 – chicken, 36 – mouse, 5 – rat

More Related