260 likes | 1.02k Views
The R genetics package: T ools for statistical genetics. Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT. Outline. Project Goals Simplify Population Genetic Analysis Design Details Extend R ‘Factor’ objects Functions Included
E N D
The R genetics package:Tools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT
Outline • Project Goals Simplify Population Genetic Analysis • Design Details Extend R ‘Factor’ objects • Functions Included • Genetic data: Importing & Creation, Manipulation, Information, Annotation, Transformation, Export • Statistical Functions: Hardy-Weinberg (Dis-)Equilibrium, Linkage Disequlibrium, Haplotype Imputation, Sample-size tools • Simple Examples • Creating Genotype Objects • Example Session • Future Development: • Emulate BioConductor Project • Large scale SNP analysis • Formal Object Class • Multi-team collaboration CT ASA Mini Conference: 2005-03-05
Problem • At each genetic position within a gene, diploid cells have two alleles. • This suggests storing each allele as separate variable. • However, most laboratory methods cannot distinguish between A/B and B/A, yielding three observed genotypes at each position: (A/A), (A/B or B/A), (B/B). Consequently, the observed alleles are confounded, This suggests the use of a single genotype variable. • This duality is not directly handled by standard statistical packages. As a consequence, the need to handle both views creates complexity when manipulating or including genotype data in statistical analysis. CT ASA Mini Conference: 2005-03-05
Initial Project Goals Simplify Statistical Analysis using Genetic Data by providing: • A genotype object class that appropriately captures the single variable / separate allele duality • Methods to import and manipulate genotype objects without string manipulation • Simple tools including different ‘views’ of genotype variables in standard statistical models • Dominant ( at least one copy of X) • Recessive ( both alleles are X) • Additive ( Number of copies of X) • Heterozygote Effect (Differing Alleles) • Independent ( separate effect for each allele combination: A/A, A/B=B/A, B/B) • Functions for computing and visualizing common genetic summaries and statistical tests • Allele Frequencies • Hardy-Weinberg Equilibrium • Linkage Disequilibrium • Other statistical methods CT ASA Mini Conference: 2005-03-05
Design Details • Design: • Genotypes are stored in ‘Factor’ objects, with factor levels formatted as ‘A/C’. • A translation table is constructed to quickly extract individual allele information: • Consequences • Can be stored in standard data frames • Can be efficiently manipulated (space & time) • Permits both biallelic (C/T) and multi-allelic genetic markers (SSLP’s) CT ASA Mini Conference: 2005-03-05
Genotype Manipulation • Importing & Creation genotype(), as.genotype(), makeGenotypes(), … haplotype(), as.haplotype(), makeHaplotypes(), … • Manipulation [] (subsetting), []<- (subset assignment), == (equality) • Information summary() (Allele and genotype counts and frequencies), allele.names(), allele() (Extract individual alleles), nallele() (Number of distinct allele values) • Annotation locus(), gene(), marker(), … • Transformation carrier(), homozygote(), heterozygote(), allele.count() • Export write.marker.file(), write.pedigree.file(), write.pop.file() CT ASA Mini Conference: 2005-03-05
Installation Windows GUI: Command Line: > install.packages(“genetics”, dependencies=TRUE) CT ASA Mini Conference: 2005-03-05
Statistical Functions • Hardy-Weinberg (Dis-)Equilibrium: D, D’, r, r2, X2 diseq(), diseq.ci() (Confidence Intervals!) HWE.test(), HWE.chisq(), HWE.exact() • Linkage Disequlibrium: D, D’, r, r2 LD(), LDplot(), LDtable() • Haplotype Imputation: hap(), hapambig(), hapmcmc(), hapenum(), hapshuffle() • Sample-size tools gregorius() (Probability of observing a marked of given frequency with specified sample size) power.casectrl() • Utilities Bootstrap.ci CT ASA Mini Conference: 2005-03-05
Simple Examples : Creating Genotype Objects A single vector with a character separator: > g1 <- genotype( c('A/A','A/C','C/C','C/A', + NA,'A/A','A/C','A/C') ) > g3 <- genotype( c('A A','A C','C C','C A', + '','A A','A C','A C'), + sep=' ', remove.spaces=F) CT ASA Mini Conference: 2005-03-05
Simple Examples : Creating Genotype Objects A single vector with a positional separator > g2 <- genotype( c('AA','AC','CC','CA','', + 'AA','AC','AC'), sep=1 ) Two separate vectors > g4 <- genotype( + c('A','A','C','C','','A','A','A'), + c('A','C','C','A','','A','C','C') + ) CT ASA Mini Conference: 2005-03-05
Simple Examples : Creating Genotype Objects A dataframe or matrix with two columns > gm <- cbind( + c('A','A','C','C','','A','A','A'), + c('A','C','C','A','','A','C','C') ) > gm [,1] [,2] [1,] "A" "A" [2,] "A" "C" [4,] "C" "A" … > g5 <- genotype( gm ) > g5 [1] "A/A" "A/C" "C/C" "A/C" NA "A/A" "A/C" "A/C" Alleles: A C CT ASA Mini Conference: 2005-03-05
Simple Examples : Creating Genotype Objects Convert 1-column genotype variables read from a file: > gm1 <- makeGenotypes( + read.csv("gm1.csv")) > gm1 Age Sex G1 V2 1 31 M A/A G/T 2 27 F A/C G/G 3 35 M C/C G/T 4 19 M A/C G/T 5 55 M <NA> G/G 6 34 F A/A G/G 7 45 F A/C T/T 8 32 M A/C G/T > gm1$G1 [1] "A/A" "A/C" "C/C" "A/C" NA "A/A" "A/C" "A/C" Alleles: A C _ gm1.csv __ Age,Sex,G1,G2 31,M,A/A,G/T 27,F,A/C,G/G 35,M,C/C,G/T 19,M,A/C,G/T 55,M,,G/G 34,F,A/A,G/G 45,F,A/C,T/T 32,M,A/C,G/T CT ASA Mini Conference: 2005-03-05
Simple Examples : Creating Genotype Objects Convert 2-column genotype variables read from a file > gm2 <- makeGenotypes( + read.csv("gm2.csv"), + convert=list(3:4,5:6)) > gm2 Age Sex G1.1/G1.2 V2.1/V2.2 1 31 M A/A G/T 2 27 F A/C G/G 3 35 M C/C G/T 4 19 M A/C G/T 5 55 M <NA> G/G 6 34 F A/A G/G 7 45 F A/C T/T 8 32 M A/C G/T ______ gm2.csv _____ Age,Sex,G1.1,G1.2,G2.1,G2.2 31,M,A,A,G,T 27,F,A,C,G,G 35,M,C,C,T,G 19,M,C,A,G,T 55,M,,,G,G 34,F,A,A,G,G 45,F,A,C,T,T 32,M,A,C,T,G CT ASA Mini Conference: 2005-03-05
“Raw” > g5 [1] "A/A" "A/C" "C/C" [4] "A/C" NA "A/A“ [5] "A/C" "A/C" Alleles: A C “Summary” > summary(g5) Allele Frequency: Count Proportion A 8 0.57 C 6 0.43 NA 2 NA Genotype Frequency: Count Proportion A/A 2 0.29 A/C 4 0.57 C/C 1 0.14 NA 1 NA Simple Examples : Displaying Genotype Information CT ASA Mini Conference: 2005-03-05
Genotypes (Independent factor levels): > g5 [1] "A/A" "A/C" "C/C" "A/C" [5] NA "A/A" "A/C" "A/C" Alleles: A C Allele Counts (Additive Effect): > allele.count(g5, "A") [1] 2 1 0 1 NA 2 1 1 attr(,"allele") [1] "A" Allele presence (Dominant Effect): > carrier(g5,'A') [1] TRUE TRUE FALSE TRUE [5] NA TRUE TRUE TRUE Allele Homozygote (Recessive Effect): > homozygote(g5,'A') [1] TRUE FALSE FALSE FALSE [5] NA TRUE FALSE FALSE Heterozygote (Heterozygote Advantage Effect): > heterozygote(g5,'A') [1] FALSE TRUE FALSE TRUE [5] NA FALSE TRUE TRUE Simple Examples: Extracting allele information CT ASA Mini Conference: 2005-03-05
First allele: > allele(g5, 1) [1] "A" "A" "C" "A" NA "A" [7] "A" "A" attr(,"which") [1] 1 attr(,"allele.names") [1] "A" "C“ Both alleles: > allele(g5) [,1] [,2] [1,] "A" "A" [2,] "A" "C" [3,] "C" "C" [4,] "A" "C" [5,] NA NA [6,] "A" "A" [7,] "A" "C" [8,] "A" "C" attr(,"which") [1] 1 2 attr(,"allele.names") [1] "A" "C" Simple Examples: Extracting allele information CT ASA Mini Conference: 2005-03-05
Example Session CT ASA Mini Conference: 2005-03-05
Future Development R GeneticsNG • Mission: GeneticsNG is a collaborative project to develop a core set of data structures and analytic tools for the management, visualization, and analysis of genetic data. This core will provide sufficient ease of use, stability, features, documentation, and community supportto inspire users and developers to utilize, contribute and extend the system. • Goals: • Scalable to Whole-Genome genetic analysis (>1e5 SNPs) • Read/Write common genetics data storage formats • Port existing open-source genetics codes • Current R genetics packages (genetics, haplo.score, gap, …) • Other open-source packages… • Provide good documentation, including tutorials and training • Engage the entire R genetics user/developer community CT ASA Mini Conference: 2005-03-05
Future Development R GeneticsNG • Current Team • Pfizer: Gregory Warnes, Nitin Jain • Channing Laboratory (Harvard): Ross Lazarus • BMS: Scott D Chasalow, Giovanni Montana • Insightful: Michael O'Connell • Univ. Chicago: Junsheng Cheng • Join us! • Project Page: http://r-genetics.sf.net/ CT ASA Mini Conference: 2005-03-05
References • R Project: • http://www.r-project.org • R genetics package: • http://cran.r-project.org/contrib/main/Descriptions/genetics.html • R-News article: • Warnes GR. ``The Genetics Package,'' R News, Volume 3, Issue 1, June 2003. • R GeneticsNG project: • http://r-genetics.sf.net/ • Me: • http://www.warnes.net • Gregory.R.Warnes@Pfizer.com CT ASA Mini Conference: 2005-03-05