1 / 42

The Ensembl Variation API

The Ensembl Variation API. Daniel Rios. March 2006. Variation data. Two human genomes differ by ~1% Polymorphism: DNA variation where each version of the sequence is present in >1% of the population

jagger
Download Presentation

The Ensembl Variation API

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Ensembl Variation API Daniel Rios March 2006

  2. Variation data • Two human genomes differ by ~1% • Polymorphism: DNA variation where each version of the sequence is present in >1% of the population • About 90% of polymorphisms are SNPs (Single Nucleotide Polymorphism). These variations that involve just one nucleotide. • ~1 out of every 300 bases in the human genome • ~10 million SNPs in the human genome March 2006

  3. Variation data in Ensembl Data imported from dbSNP: • SNPs, in-dels (Variations) • Locations for SNPs (Variation features) • Alleles • Populations • Genotypes Calculated data: • Consequence (synonmyous, nonsense etc) • Linkage disequilibrium information • Tagged SNPs • Read coverage data March 2006

  4. Database overview (1) March 2006

  5. Database overview (2) March 2006

  6. The Ensembl Variation API Database Layer User Layer API Layer Variation Perl PerlApplications Variation Compara Perl Core Perl Ensembl Web Site Compara Pipeline Perl EnsemblPipeline Ensembl martp Perl EnsMart Apollo BioDas Perl { ProServer Dazzle LDAS JavaApplications ensj Java martj Java March 2006

  7. The Variation API • Used to retrieve data from Ensembl variation database • Ensembl Perl API; • Written in Object-Oriented Perl, • Foundation for Ensembl Web interface. • Ensembl Java API; • Written in Java, but similar in layout to the Perl API, • Development lags behind the Perl API. March 2006

  8. Object Adaptors • Object Adaptors are factories for Data Objects. (e.g. variation adaptor create variation objects) • Data Objects are retrieved from and stored in the database using Object Adaptors. • Each Object Adaptor is responsible for creating objects of only one particular type. March 2006

  9. Variation Population … VariationDB The DBAdaptor • The Database Adaptor is a factory for Object Adaptors. • It is used to connect to the database and to obtain Object Adaptors. VariationAdaptor PopulationAdaptor ... DBAdaptor March 2006

  10. DBAdaptor - Code use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor; # connect to the database: $dbVar = Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new (-host => 'ensembldb.ensembl.org', -dbname => 'homo_sapiens_variation_36_35i', -species => 'human', -group => ’variation', -user => 'anonymous'); # get a VariationAdaptor $variation_adaptor = $dbVar->get_VariationAdaptor(); # get a PopulationAdaptor $population_adaptor = $dbVar->get_PopulationAdaptor(); March 2006

  11. API overview (1) Variation • 2 modules to represent a Variation • API calls: • name, source, five/three_flanking_seq, get_all_validation_states, get_all_Alleles, get_all_Individual_Genotypes, get_all_Population_Genotypes • get_all_synonyms - names the variation receives • ambig_code - ambiguity code for the Alleles • var_class - class for the variation, according to dbSNP March 2006

  12. Exercise 1 • Give me variation name, type and source for a list of mouse SNP IDs Variation rs3674319 is a snp and comes from dbSNP Variation rs3676100 is a snp and comes from dbSNP Variation rs6268424 is a snp and comes from dbSNP Variation rs6269420 is a snp and comes from dbSNP No SNP available for:rs618015 No SNP available for:rs60185 No SNP available for:rs68085 Variation CE1 is a snp and comes from Sanger Variation CE5623 is a snp and comes from Sanger Variation CE3257 is a snp and comes from Sanger No SNP available for:rs18015 March 2006

  13. API overview (2) Allele • 2 modules to represent allele information • API calls: • allele, sample, frequency • (e.g. allele A in North African with 80 % frequency) March 2006

  14. Exercise 2 • Same list as before, give me allele frequencies and their populations SNP name : rs3674319 in population 129S1/SvImJ allele C with frequency 1 SNP name : rs3674319 in population C57BL/6J allele A with frequency 1 SNP name : rs3674319 in population WI:MOUSE allele A with frequency 0.5 SNP name : rs3674319 in population WI:MOUSE allele C with frequency 0.5 SNP name : rs3676100 in population 129S1/SvImJ allele C with frequency 1 SNP name : rs3676100 in population C57BL/6J allele A with frequency 1 SNP name : rs3676100 in population WI:MOUSE allele A with frequency 0.5 SNP name : rs3676100 in population WI:MOUSE allele C with frequency 0.5 SNP name : rs6268424 in population C57BL/6J allele G with frequency 1 SNP name : rs6268424 in population CZECHII/Ei allele C with frequency 1 SNP name : rs6268424 in population WI:MOUSE allele G with frequency 0.5 SNP name : rs6268424 in population WI:MOUSE allele C with frequency 0.5 March 2006

  15. API overview (3) Population Individual Strain SAMPLE • Merge similar concepts in a more general idea • API calls: • name, size, description • Implement particular methods in subclasses • API calls : • Individual : get_all_child_Individuals • Population: get_all_super_Populations • Strain: is_strain March 2006

  16. Exercise 3 • Return how many SNPs in TSC-CSHL:CEL_asian human population: Number of SNPs in population TSC-CSHL:CEL_asian 5288 March 2006

  17. API overview (4) Variation Feature • VariationFeature is the Variation in a location • Some pre-calculated information in VariationFeature: allele_string, variation_name, get_consequence_type, source • API calls: • coordinates (region, start, end, strand) • get_all_TranscriptVariations - consequences of Variation • map_weight - number times Variation maps in genome • is_tagged - populations where Variation is tagged March 2006

  18. Exercise 4 • Return list of SNPs (snp_name, alleles, chromosome, position) for zebrafish in chromosome 25: Variation rs3728028 with alleles G/A in chromosome 25 and position 4153624-4153624 Variation rs3728027 with alleles T/A in chromosome 25 and position 4153720-4153720 Variation rs3729090 with alleles A/T in chromosome 25 and position 27556388-27556388 March 2006

  19. exon1 exon2 exon3 exon4 exon5 A A Variation C/G STOP_GAIN S/* G G Variation T/A5’ UTR API overview (5) TranscriptVariation • Precalculated effect of a Variation in a transcript ref seq: ATCG …. ATGTG…. CTCAG….CGTAA …. TGCA transcript: ATCG ATG TGC TCA GCG TAA TGCA 3’UTR 5’UTR translation M C S A * March 2006

  20. Type Change Consequence Non-synonymous or nonsense SNPs in coding areas Alters the function and / or structure of the encoded protein Cause of most monogenic disorders: Hemochromatosis (HFE) Cystic fibrosis (CFTR) Hemophilia (F8) Synonymous SNPs in coding areas No change in amino acid sequence of the protein May alter splicing Non-coding Promoter or regulatory regions May affect the level, location or timing of gene expression Non-coding No direct known impact on the phenotype Useful as markers Functional consequences March 2006

  21. API overview (6) TranscriptVariation (cont.) • API calls: • transcript, variation_feature, consequence_type • cdna_start, cdna_end • translation_start, translation_end, pep_allele_string March 2006

  22. Exercise 5 • Create a function to print CODING SNPs for a certain region &check_SNPs(13,qw[50000000 51000000]); # call the function • The rsID rs6021695 in position 50134561-50134561 in your region is a NON_SYNONYMOUS_CODING SNP • The rsID rs6096740 in position 50138363-50138363 in your region is a SYNONYMOUS_CODING SNP • The rsID rs7265957 in position 50138528-50138528 in your region is a SYNONYMOUS_CODING SNP • The rsID rs17845354 in position 50202786-50202786 in your region is a SYNONYMOUS_CODING SNP • The rsID rs17858201 in position 50202786-50202786 in your region is a SYNONYMOUS_CODING SNP March 2006

  23. Exercise 6 • For a list of human rsID, return a list with information: • SNPid Allele SNPConsequence RefSeq SNP_position_in_RefSeq AAChange GeneName rs106,A/G,INTERGENIC,ATAGAGTAGC[A/G]AGATATTTGG,7:24227273-24227273 • rs204962,T/A,INTERGENIC,TAGTGCTATA[T/A]ATAGTATTAC,22:33188508-33188508 • rs204968,G/A,INTERGENIC,ATATTTCTTG[G/A]TTTATCTATT,22:33193249-33193249 • rs204969,T/A,INTERGENIC,TTTTAATAAA[T/A]GCTGACATCT,22:33194654-33194654 • rs1367827,A/G,INTERGENIC,TTGTGAGAGC[A/G]TGGCTGGAGA,8:124924285-124924285 • rs1367830,A/G,SYNONYMOUS_CODING,CAAAGTTGAA[A/G]GGCCGTGTTA,X:57358153-57358153,P,ENSG00000165591 • rs2853516,G/A,NON_SYNONYMOUS_CODING,CATACCCATG[G/A]CCAACCTCAT,MT:3317-3317,A/T,ENSG00000198888,ENSG00000198763,ENSG00000198804,ENSG00000198712 • rs2854133,A/G,NON_SYNONYMOUS_CODING,TCTACTATGA[A/G]CCCCCCTCCC,MT:3566-3566,T/A,ENSG00000198888,ENSG00000198763,ENSG00000198804,ENSG00000198712,ENSG00000198744,ENSG00000198899 • rs2854134,C/T,SYNONYMOUS_CODING,ACCCCCTGGT[C/T]AACCTCAACC,MT:3595-3595,V,ENSG00000198888,ENSG00000198763,ENSG00000198804,ENSG00000198712,ENSG00000198744,ENSG00000198899 March 2006

  24. LD calculation • LD values for r2 and D' were calculated by a pairwise estimation between SNPs, genotyped in the same individual, within a 100kb window. An established method was used to estimate the maximum likelihood of the proportion that each possible haplotype contributed to the double heterozygote March 2006

  25. API overview (7) LDFeatureContainer: • API calls: • get_all_r_square_values, get_all_d_prime_values, get_all_ld_values • get_all_populations • get_variations March 2006

  26. Exercise 7 • For a human region in chromosome 6:25_834_000-25_854_000, return SNPs with high LD (r2 > 0.8) in CSHL-HAPMAP:HapMap-CEU and print if those variations are tagged or not in this population • High LD between variations rs9393665-rs6941933 • Variation rs9393665 is not tagged • Variation rs6941933 is not tagged • High LD between variations rs9393662-rs9393665 • Variation rs9393662 is not tagged • Variation rs9393665 is not tagged • High LD between variations rs9393662-rs6941933 • Variation rs9393662 is not tagged • Variation rs6941933 is not tagged • High LD between variations rs4464787-rs9393662 • Tagged variation: rs4464787 • Variation rs9393662 is not tagged March 2006

  27. Introduction March 2006

  28. $1000 genome project $1000 genome project • Whole human genome sequenced in 2003 • Next challenge March 2006

  29. BARGEN project • Project involving Solexa Ltd, ICL and Ensembl • Solexa Ltd generate sequencing data • ICL provide statistical tools • Ensembl responsible for storing, managing and visualising March 2006

  30. Virtual Strain Sequence • Apply strain variations to reference sequence ref seq : G C _ C C G A G T T T A Variation : insertion A between 2 and 3 Variation : ……. strain seq : G C A C C G A G T T T A March 2006

  31. Strain variations • View slice/strain sequence differences referen seq : G C _ C C G A G T T T A strain1 seq : G C A C C G A G A T T A strain2 seq: G C _ C C C A G A T T A • View strain transcript differences transcript referen : C G A G T T transcript strain2 : C C A G A T translate R V PD March 2006

  32. John D: G C C C C A G A T T A John D: G C C C G A G T T T A John D: G C C C C A G T T T A John D: G C C C G A G A T T A • Conflict with heterozygosity of individuals !!! Individual variations (1) • Apply individual variations to reference sequence ref seq : G C C C G A G T T T A G/C Variation in position 5 T/A Variation in position 8 March 2006

  33. Individual variations (2) • Solution: use of ambiguity codes ref seq : G C C C G A G T T T A G/C Variation in position 5 T/A Variation in position 8 John D: G C C C S A G W T T A • Problems creating peptide sequences translate PX X John D transcript: C C C S A G W T T March 2006

  34. API overview (8) • StrainSlice • Idea: slice from reference sequence plus differences • Basic behaviour similar as Slice • API calls: • get_all_differences_Slice • get_all_differences_StrainSlice • IndividualSlice • Idea: similar as the StrainSlice • Basic behaviour similar as Slice • API calls: • get_all_differences_Slice • get_all_differences_IndividualSlice March 2006

  35. API overview (9) Allele Feature • AlleleFeature is the allele of a sample in a location (e.g. strain DBA/2J has an A in position 3:214_000_231) • Information calculated “on the fly”: variation_name, source, allele_string • API calls: • coordinates (region, start, end, strand) • individual, population - sample where the allele is present • variation - Variation object March 2006

  36. Exercise 8 • Get differences between strain DBA/2J and reference genome in exon ENSMUSE00000581647 • Reference sequence: CTGCGTAG...GAAAT...AG... • Strain sequence: CTGCTTAG...TAAAT...AT... • AlleleFeature start-end-allele_string: 112-112-T • AlleleFeature start-end-allele_string: 61-61-T • AlleleFeature start-end-allele_string: 30-30-T March 2006

  37. 1 2 3 1 10 11 11 10 read3 read2 read4 read1 Coverage level 6 between 3 and 5 1 5 read5 2 8 read6 Read coverage • Read coverage calculation Coverage level 1 between 1 and 11

  38. API overview (10) ReadCoverage • ReadCoverage region in sample covered by at least n reads (e.g. DBA/2J has at least 1 read in region 3:214000-215000) • API calls: • coordinates (region, start, end, strand) • sample - sample where the read coverage is present • level - minimum number of reads covering the region March 2006

  39. Exercise 9 • For region 1:3012359-3017051 in strain 129X1/SvJ, get me the different regions covered • Level 1 has the following regions covered: • 3012359-3017051 • Level 2 has the following regions covered: • 3012906-3013686 • 3013977-3015178 • 3015357-3015562 • 3015762-3017031 March 2006

  40. TSV March 2006

  41. Getting More Information • database schema- PDF of the different tables in variation database: ~/ensembl-variation/schema/database-schema.pdf • perldoc – Viewer for inline API Variation documentation. Also online at: http://www.ensembl.org/info/software/Pdoc/ensembl-variation/index.html • Perl API Tutorialdocument: http://www.ensembl.org/info/software/core/core_tutorial.html • ensembl-dev mailing list: ensembl-dev@ebi.ac.uk March 2006

  42. Acknowledgements Ensembl Variation Database Team: • Arne Stabenau • Yuan Chen • Graham McVicker The Rest of the Ensembl Team. March 2006 Presentation adapted from an original by Graham McVicker (EBI).

More Related