1 / 41

Ensembl Developers Workshop Core API

Ensembl Developers Workshop Core API. Bert Overduin Edinburgh, 24 February 2009. Outline. The Ensembl Core databases and Perl API Documentation & Help Data Objects, Object Adaptors, Database Adaptors & The Registry Coordinate Systems & Slices Features

galia
Download Presentation

Ensembl Developers Workshop Core API

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EnsemblDevelopers WorkshopCore API Bert Overduin Edinburgh, 24 February 2009

  2. Outline • The Ensembl Core databases and Perl API • Documentation & Help • Data Objects, Object Adaptors, Database Adaptors & The Registry • Coordinate Systems & Slices • Features • Genes, Transcripts, Exons & Translations • External References • Coordinate Mappings

  3. The Ensembl Core databases • The Ensembl Core databases store: genomic sequence assembly information gene, transcript and protein models cDNA and protein alignments cytogenetic bands, markers, repeats, CpG islands etc. external references • homo_sapiens_core_52_36n species group data version assembly version software version

  4. The Ensembl Core Perl API • Used to retrieve data from and store data in the Ensembl Core databases • Written in Object-Oriented Perl • Partly based on and compatible with BioPerl objects (http://www.bioperl.org) • Used by the Ensembl analysis and annotation pipeline and the Ensembl web code • Robust and well-supported • Forms the basis for the other Ensembl APIs

  5. Documentation & Help • Installation instructions, web-browsable version of the POD (Perldoc) and tutorial: http://www.ensembl.org/info/docs/api/core/index.html • Inline Perl POD (Plain Old Documentation) • ensembl-dev mailing list: http://www.ensembl.org/info/about/contact/mailing.html • Ensembl helpdesk: helpdesk@ensembl.org

  6. Data Objects • Data Objects model biological entities, e.g. Genes, Transcripts, Translations, … • Each Data Object encapsulates information from one or a few specific MySQL tables • Data Objects are retrieved from and stored in the database using Objects Adaptors

  7. Object Adaptors • Object Adaptors are Data Object factories • Each Object Adaptor is responsible for creating Data Objects of only one particular type

  8. Database Adaptors • Database Adaptors are Object Adaptor factories • Database Adaptors are used to connect to a single database

  9. The Registry • The Registry is a container for all Database Adaptors • The Registry handles all database connections • The Registry is an Object Adaptor factory • The Registry can be initialised via a configuration file or by automatically discovering databases on a RDBMS instance

  10. Variation Genotype Gene Gene Gene Gene Gene Marker Marker Marker Marker Marker Object Object Object Human Variation DB Mouse Variation DB Mouse Core DB Human Core DB System Architecture GeneAdaptor MarkerAdaptor ObjectAdaptor VariationAdaptor GenotypeAdaptor Core DBAdaptor Variation DBAdaptor Ensembl Registry

  11. Code Example # Obtain the Ensembl Gene IDs for all human genes use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); my $gene_adaptor = $registry->get_adaptor( ‘Human’, ‘Core’, ‘Gene’ ); my $genes = $gene_adaptor->fetch_all; while ( my $gene = shift @{$genes} ){ print $gene->stable_id, “\n”; }

  12. Code Example OUTPUT: ENSG00000208234 ENSG00000199674 ENSG00000221622 ENSG00000207604 ENSG00000207431 ENSG00000221312 ENSG00000223135 ENSG00000223136 ENSG00000200159 ENSG00000200131 ENSG00000206672 ENSG00000212552 ENSG00000201452 ENSG00000202016 ENSG00000200455 ENSG00000201916 ENSG00000212228 ENSG00000202261 ENSG00000207742 ENSG00000223137 ENSG00000212550 ENSG00000223138 ENSG00000200827 ENSG00000221638 ENSG00000201937 ENSG00000212205 ENSG00000221428 ENSG00000202470 ENSG00000200236 ENSG00000223139 ENSG00000207932 ENSG00000223140 ENSG00000221791 ENSG00000199102 ENSG00000199960 ENSG00000208013 ENSG00000223141 ENSG00000223142 ENSG00000221363 ENSG00000213177 ENSG00000216774 ENSG00000213194 ENSG00000207492 ENSG00000219252 ENSG00000222962 ENSG00000206963 ENSG00000207934 ENSG00000199814 ENSG00000199796 ENSG00000212450 ENSG00000207603 ENSG00000202474 ENSG00000206831 ENSG00000207331 ENSG00000221740 ENSG00000222963 ENSG00000201923 ENSG00000198928 ENSG00000208225 ENSG00000208227 ENSG00000208243 ENSG00000177117 ENSG00000208245 ENSG00000209219 ENSG00000213184 ENSG00000211401 ENSG00000208228 ENSG00000208229 ENSG00000208233 ENSG00000208235

  13. Exercise 1 • Load all databases and print their names. • What is the name of the human core database? There are several solutions possible! Use Perldoc! (http://www.ensembl.org/info/docs/api/core/index.html)

  14. Coordinate Systems • Sequences stored in Ensembl are associated with Coordinate Systems • Coordinate Systems vary from species to species: human: chromosome, supercontig, clone, contig zebrafish: chromosome, scaffold, contig • Sequence information is directly stored in the database for the ‘sequence level’ Coordinate System • The Coordinate System of the highest level in a given region is the ‘top level’ Coordinate System • Features are stored in a single Coordinate System

  15. Coordinate Systems Top level Chromosome Contigs Clones (Tiling path) Sequence level

  16. Slices • A Slice Data Object represents an arbitrary region of a genome • Slices are not directly stored in the database • Slices are used to obtain sequences or features from a specific region in a specific coordinate system

  17. Code Example # Obtain a slice covering the entire human Y chromosome my $slice_adaptor = $registry->get_adaptor( ‘Human’, ‘Core’, ‘Slice’ ); my $slice = $slice_adaptor->fetch_by_region( ‘chromosome’, ‘Y’ ); printf( “Slice: %s %s %s-%s (%s)\n”, $slice->coord_system_name $slice->seq_region_name $slice->start $slice->end $slice->strand ); OUTPUT: Slice: chromosome Y 1-57772954 (1)

  18. Exercise 2 • Obtain the names of the coordinate systems for rat. • Obtain a slice covering the first 10 MB of chromosome 20 of human and print its sequence. • Obtain a slice covering the human gene with Ensembl Gene ID ‘ENSG00000101266’ with 2 kb of flanking sequence and print its sequence. • Print the name, start, end and strand of the obtained slices as well as their coordinate system. If you want to output your sequences to a file, have a look at BioSeq:IO at http://doc.bioperl.org/releases/bioperl-1.2.3/

  19. Features • Features are Data Objects with a defined location on the genome • All Features have a start, end, strand and slice • The start coordinate of a Feature is always less than its end coordinate, irrespective of the strand on which it is located (exception: insertion features)

  20. Features Some examples of Features: • Gene, Transcript and Exon • ProteinFeature • PredictionTranscript and PredictionExon • DNAAlignFeature and ProteinAlignFeature • RepeatFeature • MarkerFeature • OligoFeature • SimpleFeature • MiscFeature

  21. Code Example # Obtain all markers on human chromosome 1 my $slice_adaptor = $registry->get_adaptor( ‘Human’, ‘Core’, ‘Slice’ ); my $slice = $slice_adaptor->fetch_by_region( ‘chromosome’, ‘1’ ); my $markers = $slice->get_all_MarkerFeatures; while ( my $marker = shift @{$markers} ){ printf( “%s\t%s\n”, $marker->slice->name, $marker->feature_Slice->name ); } OUTPUT: chromosome:NCBI36:1:1:247249719:1 chromosome:NCBI36:1:1237:1488:1 chromosome:NCBI36:1:1:247249719:1 chromosome:NCBI36:1:2585:2812:1 chromosome:NCBI36:1:1:247249719:1 chromosome:NCBI36:1:4284:5085:1

  22. Exercise 3 • Obtain all the CpG islands on the first 5 Mb of dog chromosome 20. Print the total number of CpG islands and the position and sequence of each CpG island. • Obtain all the protein alignment features on the first 5 Mb of dog chromosome 20. Print for each alignment the name of the aligned protein, the start and end coordinates of the matching region on the protein and on the genome and the name of the analysis resulting in the alignment. Hint: CpG islands are stored as SimpleFeatures with logic_name ‘cpg’.

  23. Genes, Transcripts & Exons • Genes, Transcript and Exons are Feature Data Objects • A Gene is a grouping of Transcripts which share any (partially) overlapping Exons • A Transcript is a set of Exons • Introns are not explicitly defined in the database

  24. Translations • Translations are not Feature Data Objects • Translations define the Untranslated Region (UTR) and Coding Sequence (CDS) composition of Transcripts • Protein sequences are not stored in the database, but computed on the fly using Transcript(!) objects

  25. Exercise 4 • Obtain the gene with Ensembl Gene ID ‘ENSG00000101266’ and its transcripts. Print the total number of exons in the gene and the number of exons in each individual transcript. Why do the found numbers disagree with each other? • Print for each transcript of the above gene the coding sequence and the protein sequence.

  26. External References • External References (Xrefs) are cross references of Ensembl Genes, Transcripts or Translations with identifiers from other databases, e.g. HGNC, WikiGenes, UniProtKB/Swiss-Prot, RefSeq, MIM etc. etc.

  27. Code Example # Obtain external references for Ensembl gene ENSG00000139618 my $gene = $gene_adaptor->fetch_by_stable_id( 'ENSG00000139618' ); my $gene_xrefs = $gene->get_all_DBEntries; print "Xrefs on the gene: \n\n"; while ( my $gene_xref = shift @{$gene_xrefs} ){ printf( "%s: %s\n”, $gene_xref->dbname, $gene_xref->display_id ); } my $all_xrefs = $gene->get_all_DBLinks; print "\nXrefs on the gene, transcript and protein: \n\n"; while ( my $all_xref = shift @{$all_xrefs} ){ printf( "%s: %s\n”, $all_xref->dbname, $all_xref->display_id ); }

  28. Code Example Output: Xrefs on the gene: HGNC: BRCA2 DBASS3: BRCA2 UCSC: uc001uub.1 HGNC_curated_gene: BRCA2 Xrefs on the gene, transcript and protein: shares_CDS_with_OTTT: OTTHUMT00000046000 AFFY_HC_G110: 1503_at AFFY_HG_U95A: 1503_at AFFY_HG_U95Av2: 1503_at AFFY_HC_G110: 1990_g_at AFFY_HG_U95A: 1990_g_at AFFY_HG_U95Av2: 1990_g_at AFFY_HuGeneFL: X95152_rna1_at AFFY_HG_Focus: 214727_at AFFY_HG_U133A: 214727_at AFFY_HG_U133A_2: 214727_at AFFY_HG_U133_Plus_2: 214727_at AFFY_HG_U133A: 208368_s_at AFFY_HG_U133A_2: 208368_s_at AFFY_HG_U133_Plus_2: 208368_s_at AFFY_U133_X3P: g4502450_3p_a_at AFFY_U133_X3P: 208368_3p_s_at AFFY_U133_X3P: Hs.34012.1.S1_3p_at AFFY_HC_G110: 1989_at AFFY_HG_U95A: 1989_at AFFY_HG_U95Av2: 1989_at RefSeq_dna: NM_000059 HGNC: BRCA2 UniGene: Hs.34012 AgilentCGH: A_14_P131744 AgilentCGH: A_14_P109686 AgilentProbe: A_23_P99452 Codelink: GE60169 Illumina_V1: GI_4502450-S Illumina_V2: ILMN_139227 HGNC_curated_transcript: BRCA2-001 CCDS: CCDS9344.1 EntrezGene: BRCA2 MIM_MORBID: 114480 MIM_MORBID: 155720 MIM_MORBID: 227650 MIM_GENE: 600185 MIM_MORBID: 600185 MIM_MORBID: 605724 RefSeq_peptide: NP_000050.2 Uniprot/SPTREMBL: A1YBP1_HUMAN EMBL: DQ897648 protein_id: ABI74674.1 Uniprot/SPTREMBL: B2ZAH0_HUMAN EMBL: EU625579 protein_id: ACD01217.1 EMBL: AL445212 Uniprot/SPTREMBL: Q5TBJ7_HUMAN EMBL: AL137247 protein_id: CAI13195.1 protein_id: CAI40479.1 Uniprot/SPTREMBL: Q8IU64_HUMAN EMBL: AY151039 protein_id: AAN28944.1 EMBL: AF489725 protein_id: AAN61409.1 EMBL: AF489726 protein_id: AAN61410.1 EMBL: AF489727 protein_id: AAN61411.1 EMBL: AF489728 protein_id: AAN61412.1 EMBL: AF489729 protein_id: AAN61413.1 EMBL: AF489730 protein_id: AAN61414.1 EMBL: AF489731 protein_id: AAN61415.1 EMBL: AF489732 protein_id: AAN61416.1 EMBL: AF489733 protein_id: AAN61417.1 EMBL: AF489734 protein_id: AAN61418.1 EMBL: AF489735 protein_id: AAN61419.1 EMBL: AF489736 protein_id: AAN61420.1

  29. Exercise 5 • Obtain the Ensembl gene(s) that correspond(s) to UniProtKB/Swiss-Prot entry BRCA2_HUMAN. Print its Ensembl Gene ID, name and description. • Obtain all external references for the above gene. Print their names and databases.

  30. Coordinate Mappings • The API provides the means to convert between any related coordinate systems in the database • The Feature methods transfer, transform and project and the Slice method project are used to map features between coordinate systems

  31. Transfer • Transfer moves a feature on a slice in a given coordinate system to another slice in the same or another coordinate system • Transfer needs the feature to be defined in the requested coordinate system, i.e. it cannot overlap an undefined region

  32. Transfer Chr 1 Chr 1 Chr 1 Chr Y

  33. Transform • Like transfer, but transform places the feature on a slice that spans the entire sequence that the feature is on in the requested coordinate system

  34. Transform

  35. Project • Project doesn’t move a feature, but it provides a definition of where a feature or slice lies in another coordinate system

  36. Project

  37. Code Example # Project gene ENSG00000155657 to the clone coordinate system my $gene = $gene_adaptor->fetch_by_stable_id( 'ENSG00000155657' ); my $projection = $gene->project( 'clone’ ); foreach my $segment ( @{$projection} ) { my $to_slice = $segment->to_Slice; printf( "%s %s-%s projects to %s %s:%s-%s(%s)\n", $gene->stable_id, $segment->from_start, $segment->from_end, $to_slice->coord_system_name, $to_slice->seq_region_name, $to_slice->start, $to_slice->end, $to_slice->strand ); }

  38. Code Example Output: ENSG00000155657 1-65908 projects to clone AC023270.7:1-65908(-1) ENSG00000155657 65909-241384 projects to clone AC010680.10:1-175476(-1) ENSG00000155657 241385-281434 projects to clone AC009948.3:132579-172628(-1)

  39. Exercise 6 • Obtain a gene located on clone ‘AL049761.11’ and print out its coordinate system and gene coordinates. Then transform the gene to ‘toplevel’ and again print out the coordinate system and gene coordinates.

  40. Other Ensembl Core APIs • Ruby (by Jan Aerts): http://bioruby-annex.rubyforge.org/ • Python (by Jenny Qing Qian): http://code.google.com/p/pygr/wiki/PygrOnEnsembl

  41. Acknowledgements The Ensembl Core Team Glenn Proctor Ian Longden Andreas Kahari Daniel Rios

More Related