420 likes | 614 Views
The Ensembl Variation API. Daniel Rios. March 2006. Variation data. Two human genomes differ by ~1% Polymorphism: DNA variation where each version of the sequence is present in >1% of the population
E N D
The Ensembl Variation API Daniel Rios March 2006
Variation data • Two human genomes differ by ~1% • Polymorphism: DNA variation where each version of the sequence is present in >1% of the population • About 90% of polymorphisms are SNPs (Single Nucleotide Polymorphism). These variations that involve just one nucleotide. • ~1 out of every 300 bases in the human genome • ~10 million SNPs in the human genome March 2006
Variation data in Ensembl Data imported from dbSNP: • SNPs, in-dels (Variations) • Locations for SNPs (Variation features) • Alleles • Populations • Genotypes Calculated data: • Consequence (synonmyous, nonsense etc) • Linkage disequilibrium information • Tagged SNPs • Read coverage data March 2006
Database overview (1) March 2006
Database overview (2) March 2006
The Ensembl Variation API Database Layer User Layer API Layer Variation Perl PerlApplications Variation Compara Perl Core Perl Ensembl Web Site Compara Pipeline Perl EnsemblPipeline Ensembl martp Perl EnsMart Apollo BioDas Perl { ProServer Dazzle LDAS JavaApplications ensj Java martj Java March 2006
The Variation API • Used to retrieve data from Ensembl variation database • Ensembl Perl API; • Written in Object-Oriented Perl, • Foundation for Ensembl Web interface. • Ensembl Java API; • Written in Java, but similar in layout to the Perl API, • Development lags behind the Perl API. March 2006
Object Adaptors • Object Adaptors are factories for Data Objects. (e.g. variation adaptor create variation objects) • Data Objects are retrieved from and stored in the database using Object Adaptors. • Each Object Adaptor is responsible for creating objects of only one particular type. March 2006
Variation Population … VariationDB The DBAdaptor • The Database Adaptor is a factory for Object Adaptors. • It is used to connect to the database and to obtain Object Adaptors. VariationAdaptor PopulationAdaptor ... DBAdaptor March 2006
DBAdaptor - Code use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor; # connect to the database: $dbVar = Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new (-host => 'ensembldb.ensembl.org', -dbname => 'homo_sapiens_variation_36_35i', -species => 'human', -group => ’variation', -user => 'anonymous'); # get a VariationAdaptor $variation_adaptor = $dbVar->get_VariationAdaptor(); # get a PopulationAdaptor $population_adaptor = $dbVar->get_PopulationAdaptor(); March 2006
API overview (1) Variation • 2 modules to represent a Variation • API calls: • name, source, five/three_flanking_seq, get_all_validation_states, get_all_Alleles, get_all_Individual_Genotypes, get_all_Population_Genotypes • get_all_synonyms - names the variation receives • ambig_code - ambiguity code for the Alleles • var_class - class for the variation, according to dbSNP March 2006
Exercise 1 • Give me variation name, type and source for a list of mouse SNP IDs Variation rs3674319 is a snp and comes from dbSNP Variation rs3676100 is a snp and comes from dbSNP Variation rs6268424 is a snp and comes from dbSNP Variation rs6269420 is a snp and comes from dbSNP No SNP available for:rs618015 No SNP available for:rs60185 No SNP available for:rs68085 Variation CE1 is a snp and comes from Sanger Variation CE5623 is a snp and comes from Sanger Variation CE3257 is a snp and comes from Sanger No SNP available for:rs18015 March 2006
API overview (2) Allele • 2 modules to represent allele information • API calls: • allele, sample, frequency • (e.g. allele A in North African with 80 % frequency) March 2006
Exercise 2 • Same list as before, give me allele frequencies and their populations SNP name : rs3674319 in population 129S1/SvImJ allele C with frequency 1 SNP name : rs3674319 in population C57BL/6J allele A with frequency 1 SNP name : rs3674319 in population WI:MOUSE allele A with frequency 0.5 SNP name : rs3674319 in population WI:MOUSE allele C with frequency 0.5 SNP name : rs3676100 in population 129S1/SvImJ allele C with frequency 1 SNP name : rs3676100 in population C57BL/6J allele A with frequency 1 SNP name : rs3676100 in population WI:MOUSE allele A with frequency 0.5 SNP name : rs3676100 in population WI:MOUSE allele C with frequency 0.5 SNP name : rs6268424 in population C57BL/6J allele G with frequency 1 SNP name : rs6268424 in population CZECHII/Ei allele C with frequency 1 SNP name : rs6268424 in population WI:MOUSE allele G with frequency 0.5 SNP name : rs6268424 in population WI:MOUSE allele C with frequency 0.5 March 2006
API overview (3) Population Individual Strain SAMPLE • Merge similar concepts in a more general idea • API calls: • name, size, description • Implement particular methods in subclasses • API calls : • Individual : get_all_child_Individuals • Population: get_all_super_Populations • Strain: is_strain March 2006
Exercise 3 • Return how many SNPs in TSC-CSHL:CEL_asian human population: Number of SNPs in population TSC-CSHL:CEL_asian 5288 March 2006
API overview (4) Variation Feature • VariationFeature is the Variation in a location • Some pre-calculated information in VariationFeature: allele_string, variation_name, get_consequence_type, source • API calls: • coordinates (region, start, end, strand) • get_all_TranscriptVariations - consequences of Variation • map_weight - number times Variation maps in genome • is_tagged - populations where Variation is tagged March 2006
Exercise 4 • Return list of SNPs (snp_name, alleles, chromosome, position) for zebrafish in chromosome 25: Variation rs3728028 with alleles G/A in chromosome 25 and position 4153624-4153624 Variation rs3728027 with alleles T/A in chromosome 25 and position 4153720-4153720 Variation rs3729090 with alleles A/T in chromosome 25 and position 27556388-27556388 March 2006
exon1 exon2 exon3 exon4 exon5 A A Variation C/G STOP_GAIN S/* G G Variation T/A5’ UTR API overview (5) TranscriptVariation • Precalculated effect of a Variation in a transcript ref seq: ATCG …. ATGTG…. CTCAG….CGTAA …. TGCA transcript: ATCG ATG TGC TCA GCG TAA TGCA 3’UTR 5’UTR translation M C S A * March 2006
Type Change Consequence Non-synonymous or nonsense SNPs in coding areas Alters the function and / or structure of the encoded protein Cause of most monogenic disorders: Hemochromatosis (HFE) Cystic fibrosis (CFTR) Hemophilia (F8) Synonymous SNPs in coding areas No change in amino acid sequence of the protein May alter splicing Non-coding Promoter or regulatory regions May affect the level, location or timing of gene expression Non-coding No direct known impact on the phenotype Useful as markers Functional consequences March 2006
API overview (6) TranscriptVariation (cont.) • API calls: • transcript, variation_feature, consequence_type • cdna_start, cdna_end • translation_start, translation_end, pep_allele_string March 2006
Exercise 5 • Create a function to print CODING SNPs for a certain region &check_SNPs(13,qw[50000000 51000000]); # call the function • The rsID rs6021695 in position 50134561-50134561 in your region is a NON_SYNONYMOUS_CODING SNP • The rsID rs6096740 in position 50138363-50138363 in your region is a SYNONYMOUS_CODING SNP • The rsID rs7265957 in position 50138528-50138528 in your region is a SYNONYMOUS_CODING SNP • The rsID rs17845354 in position 50202786-50202786 in your region is a SYNONYMOUS_CODING SNP • The rsID rs17858201 in position 50202786-50202786 in your region is a SYNONYMOUS_CODING SNP March 2006
Exercise 6 • For a list of human rsID, return a list with information: • SNPid Allele SNPConsequence RefSeq SNP_position_in_RefSeq AAChange GeneName rs106,A/G,INTERGENIC,ATAGAGTAGC[A/G]AGATATTTGG,7:24227273-24227273 • rs204962,T/A,INTERGENIC,TAGTGCTATA[T/A]ATAGTATTAC,22:33188508-33188508 • rs204968,G/A,INTERGENIC,ATATTTCTTG[G/A]TTTATCTATT,22:33193249-33193249 • rs204969,T/A,INTERGENIC,TTTTAATAAA[T/A]GCTGACATCT,22:33194654-33194654 • rs1367827,A/G,INTERGENIC,TTGTGAGAGC[A/G]TGGCTGGAGA,8:124924285-124924285 • rs1367830,A/G,SYNONYMOUS_CODING,CAAAGTTGAA[A/G]GGCCGTGTTA,X:57358153-57358153,P,ENSG00000165591 • rs2853516,G/A,NON_SYNONYMOUS_CODING,CATACCCATG[G/A]CCAACCTCAT,MT:3317-3317,A/T,ENSG00000198888,ENSG00000198763,ENSG00000198804,ENSG00000198712 • rs2854133,A/G,NON_SYNONYMOUS_CODING,TCTACTATGA[A/G]CCCCCCTCCC,MT:3566-3566,T/A,ENSG00000198888,ENSG00000198763,ENSG00000198804,ENSG00000198712,ENSG00000198744,ENSG00000198899 • rs2854134,C/T,SYNONYMOUS_CODING,ACCCCCTGGT[C/T]AACCTCAACC,MT:3595-3595,V,ENSG00000198888,ENSG00000198763,ENSG00000198804,ENSG00000198712,ENSG00000198744,ENSG00000198899 March 2006
LD calculation • LD values for r2 and D' were calculated by a pairwise estimation between SNPs, genotyped in the same individual, within a 100kb window. An established method was used to estimate the maximum likelihood of the proportion that each possible haplotype contributed to the double heterozygote March 2006
API overview (7) LDFeatureContainer: • API calls: • get_all_r_square_values, get_all_d_prime_values, get_all_ld_values • get_all_populations • get_variations March 2006
Exercise 7 • For a human region in chromosome 6:25_834_000-25_854_000, return SNPs with high LD (r2 > 0.8) in CSHL-HAPMAP:HapMap-CEU and print if those variations are tagged or not in this population • High LD between variations rs9393665-rs6941933 • Variation rs9393665 is not tagged • Variation rs6941933 is not tagged • High LD between variations rs9393662-rs9393665 • Variation rs9393662 is not tagged • Variation rs9393665 is not tagged • High LD between variations rs9393662-rs6941933 • Variation rs9393662 is not tagged • Variation rs6941933 is not tagged • High LD between variations rs4464787-rs9393662 • Tagged variation: rs4464787 • Variation rs9393662 is not tagged March 2006
Introduction March 2006
$1000 genome project $1000 genome project • Whole human genome sequenced in 2003 • Next challenge March 2006
BARGEN project • Project involving Solexa Ltd, ICL and Ensembl • Solexa Ltd generate sequencing data • ICL provide statistical tools • Ensembl responsible for storing, managing and visualising March 2006
Virtual Strain Sequence • Apply strain variations to reference sequence ref seq : G C _ C C G A G T T T A Variation : insertion A between 2 and 3 Variation : ……. strain seq : G C A C C G A G T T T A March 2006
Strain variations • View slice/strain sequence differences referen seq : G C _ C C G A G T T T A strain1 seq : G C A C C G A G A T T A strain2 seq: G C _ C C C A G A T T A • View strain transcript differences transcript referen : C G A G T T transcript strain2 : C C A G A T translate R V PD March 2006
John D: G C C C C A G A T T A John D: G C C C G A G T T T A John D: G C C C C A G T T T A John D: G C C C G A G A T T A • Conflict with heterozygosity of individuals !!! Individual variations (1) • Apply individual variations to reference sequence ref seq : G C C C G A G T T T A G/C Variation in position 5 T/A Variation in position 8 March 2006
Individual variations (2) • Solution: use of ambiguity codes ref seq : G C C C G A G T T T A G/C Variation in position 5 T/A Variation in position 8 John D: G C C C S A G W T T A • Problems creating peptide sequences translate PX X John D transcript: C C C S A G W T T March 2006
API overview (8) • StrainSlice • Idea: slice from reference sequence plus differences • Basic behaviour similar as Slice • API calls: • get_all_differences_Slice • get_all_differences_StrainSlice • IndividualSlice • Idea: similar as the StrainSlice • Basic behaviour similar as Slice • API calls: • get_all_differences_Slice • get_all_differences_IndividualSlice March 2006
API overview (9) Allele Feature • AlleleFeature is the allele of a sample in a location (e.g. strain DBA/2J has an A in position 3:214_000_231) • Information calculated “on the fly”: variation_name, source, allele_string • API calls: • coordinates (region, start, end, strand) • individual, population - sample where the allele is present • variation - Variation object March 2006
Exercise 8 • Get differences between strain DBA/2J and reference genome in exon ENSMUSE00000581647 • Reference sequence: CTGCGTAG...GAAAT...AG... • Strain sequence: CTGCTTAG...TAAAT...AT... • AlleleFeature start-end-allele_string: 112-112-T • AlleleFeature start-end-allele_string: 61-61-T • AlleleFeature start-end-allele_string: 30-30-T March 2006
1 2 3 1 10 11 11 10 read3 read2 read4 read1 Coverage level 6 between 3 and 5 1 5 read5 2 8 read6 Read coverage • Read coverage calculation Coverage level 1 between 1 and 11
API overview (10) ReadCoverage • ReadCoverage region in sample covered by at least n reads (e.g. DBA/2J has at least 1 read in region 3:214000-215000) • API calls: • coordinates (region, start, end, strand) • sample - sample where the read coverage is present • level - minimum number of reads covering the region March 2006
Exercise 9 • For region 1:3012359-3017051 in strain 129X1/SvJ, get me the different regions covered • Level 1 has the following regions covered: • 3012359-3017051 • Level 2 has the following regions covered: • 3012906-3013686 • 3013977-3015178 • 3015357-3015562 • 3015762-3017031 March 2006
TSV March 2006
Getting More Information • database schema- PDF of the different tables in variation database: ~/ensembl-variation/schema/database-schema.pdf • perldoc – Viewer for inline API Variation documentation. Also online at: http://www.ensembl.org/info/software/Pdoc/ensembl-variation/index.html • Perl API Tutorialdocument: http://www.ensembl.org/info/software/core/core_tutorial.html • ensembl-dev mailing list: ensembl-dev@ebi.ac.uk March 2006
Acknowledgements Ensembl Variation Database Team: • Arne Stabenau • Yuan Chen • Graham McVicker The Rest of the Ensembl Team. March 2006 Presentation adapted from an original by Graham McVicker (EBI).