180 likes | 574 Views
R Packages for Genome-Wide Association Studies. Qunyuan Zhang Division of Statistical Genomics Statistical Genetics Forum March 10,2008. What is R ?. R is a free software environment for statistical computing and graphics.
E N D
R Packages for Genome-Wide Association Studies Qunyuan Zhang Division of Statistical Genomics Statistical Genetics Forum March 10,2008
What is R ? R is a free software environment for statistical computing and graphics. Run s on a wide variety of UNIX platforms, Windows and MacOS (interactive or batch mode) Free and open source, can be downloaded from cran.r-project.org Wide range of packages (base & contributed), novel methods available Concise grammar & good structure (function, data object, methods and class) Help from manuals and email group Slow, time and memory consuming (can be overcome by parallel computation, and/or integration with C) Popular, used by 70~80% statisticians
Statistical Genetics Packages in Rhttp://cran.r-project.org/web/views/Genetics.html Population Genetics: genetics (basic), Geneland (spatial structures of genetic data), rmetasim (population genetics simulations), hapsim (simulation), popgen (clustering SNP genotype data and SNP simulation), hierfstat (hierarchical F-statistics of genetic data), hwde (modeling genotypic disequilibria), Biodem (biodemographical analysis), kinship (pedigree analysis), adegenet (population structure), ape & apTreeshape (Phylogenetic and evolution analyses), ouch (Ornstein-Uhlenbeck models), PHYLOGR (simulation and GLS model), stepwise (recombination breakpoints) Linkage and Association: gap (both population and family data, sample size calculations, probability of familial disease aggregation, kinship calculation, linkage and association analyses, haplotype frequencies) tdthap (TDT for haplotypes, powerpkg (power analyses for the affected sib pair and the TDT design),hapassoc (likelihood inference of trait associations with haplotypes in GLMs), haplo.ccs (haplotype and covariate relative risks in case-control data by weighted logistic regression), haplo.stats (haplotype analysis for unrelated subjects), tdthap (haplotype transmission/disequilibrium tests), ldDesign (experiment design for association and LD studies), LDheatmap (heatmap of pairwise LD),. mapLD (LD and haplotype blocks), pbatR (R version of PBAT), GenABEL & SNPassoc for GWAS QTL mappingfor the data from experimental crosses: bqtl (inbred crosses and recombinant inbred lines), qtl (genome-wide scans), qtlDesign (designing QTL experiments & power computations), qtlbim (Bayesian Interval QTL Mapping) Sequence & Array Data Processing: seqinr, BioConductor packages
GenABELAulchenko Y.S., Ripke S., Isaacs A., van Duijn C.M. GenABEL: an R package for genome-wide association analysis. Bioinformatics. 2007, 23(10):1294-6. GenABEL: genome-wide SNP association analysis a package for genome-wide association analysis between quantitative or binary traits and single-nucleotides polymorphisms (SNPs). Version: 1.3-5 Depends: R (≥ 2.4.0), methods, genetics, haplo.stats, qvalue, MASS Date: 2008-02-17 Author: Yurii Aulchenko, with contributions from Maksim Struchalin, Stephan Ripke and Toby Johnson Maintainer: Yurii Aulchenko <i.aoultchenko at erasmusmc.nl> License: GPL (≥ 2) In views: Genetics CRAN checks: GenABEL results
nbytes: number of bytes used to store data on a SNP nids: number of people male: male code idnames: ID names nsnps: number of SNPs nsnpnames: list of SNP names chromosome: list chromosomes corresponding to SNPs coding: list of nucleotide coding for SNP names strand: strands of the SNPs map: list SNPs’ positions gtps: genotypes (snp.mx-class) snp.data() phdata: phenotypic data (data frame) gtdata: genotypic data (snp.data-class) GenABEL: Data Objects gwaa.data-class • 2-bit storage • 0 00 • 1 01 • 2 10 • 11 • Save 75% load.gwaa.data(phenofile = "pheno.dat", genofile = "geno.raw“) convert.snp.text() from text file (GenABEL default format) convert.snp.ped() from Linkage, Merlin, Mach, and similar files convert.snp.mach() from Mach format convert.snp.tped() from PLINK TPED format convert.snp.illumina() from Illumina/Affymetrix-like format
GenABEL: Data Manipulation snp.subset(): subset data by snp names or by QC criteria add.phdata(): merge extra phenotypic data to the gwaa.data-class. ztransform(): standard normalization of phenotypes rntransform(): rank-normalization of phenotypes npsubtreated(): non-parametric adjustment of phenotypes for medicated subjects
GenABEL: QC & Summarization summary.snp.data(): summary of snp data (Number of observed genotypes, call rate, allelic frequency, genotypic distribution, P-value of HWE test check.trait(): summary of phenotypic data and outlier check based on a specified p/FDR cut-off check.marker(): SNP selection based on call rate, allele frequency and deviation from HWE HWE.show(): showing HWE tables, Chi2 and exact HWE P-values perid.summary(): call rate and heterozygosity per person ibs(): matrix of average IBS for a group of people & a given set of SNPs hom(): average homozygosity (inbreeding) for a set of people, across multiple markers
GenABEL: SNP Association Scans scan.glm(): snp association test using GLM in R library scan.glm((“y~x1+x2+…+CRSNP", family = gaussian(), data, snpsubset, idsubset) scan.glm((“y~x1+x2+…+CRSNP", family = binomial (), data, snpsubset, idsubset) scan.glm.2D(): 2-snp interaction scan Fast Scan (call C language) ccfast(): case-control association analysis by computing chi-square test from 2x2 (allelic) or 2x3 (genotypic) tables emp.ccfast(): Genome-wide significance (permutation) for ccfast() scan qtscore(): association test (GLM) for a trait (quantitative or categorical) emp.qtscore(): Genome-wide significance (permutation) for qscaore() scan mmscore(): score test for association between a trait and genetic polymorphism, in samples of related individuals (needs stratification variable, scores are computed within strata and then added up) egscore(): association test, adjusted for possible stratification by principal components of genomic kinship matrix(snp correlation matrix)
GenABEL: Haplotype Association Scans scan.haplo(): haplotype association test using GLM in R library scan.haplo.2D(): 2-haplotype interaction scan (haplo.stats package required) Sliding window strategy Posterior prob. of Haplotypes via EM algorithm GLM-based score test for haplotype-trait association (Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA. 2002. Score tests for association of traits with haplotypes when linkage phase is ambiguous Am J Hum Genet 70: 425-434. )
GenABEL: GWAS results from scan.glm, scan.haplo, ccfast, qtscore, emp.ccfast,emp.qtscore scan.gwaa-class Names: snpnames list of names of SNPs tested P1df: p-values of 1-d.f. (additive or allelic) test for association P2df: p-values of 2-d.f. (genotypic) test for association Pc1df: p-values from the 1-d.f. test for association between SNP and trait; the statistics is corrected for possible inflation effB: effect of the B allele in allelic test effAB: effect of the AB genotype in genotypic test effBB: effect of the BB genotype in genotypic test Map: list of map positions of the SNPs Chromosome: list of chromosomes the SNPs belong to Idnames: list of subjects used in analysis Lambda: inflation factor estimate, as computed using lower portion (say, 90%) of the distribution, and standard error of the estimate Formula: formula/function used to compute p-values Family: family of the link function / nature of the test
GenABEL: Table & Graphic Functions descriptives.marker(): table of marker info. descriptives.trait(): table of trait info. descriptives.scan(): table of scan results plot.scan.gwaa(): plot of scan results plot.check.marker(): plot of marker data (QC etc.)
2000 subjects x 500K chip Memory: ~3.2 G Loading time: ~4 Min. SNP summary: ~1 Min. Call ccfast: ~0.5 Min. Call qtscore: ~2 Min. Total: < 10 Min. Permutation test N=10,000 73~ 120 hrs, 3~5 days GenABEL:Computer Efficiency Intel Xeon 2.8GHz processor,SuSE Linux 9.2, R 2.4.1
SNPassocAn R package to perform whole genome association studies, Juan R. González 1, et al. Bioinformatics, 2007 23(5):654-655 SNPassoc: SNPs-based whole genome association studies This package carries out most common analysis when performing whole genome association studies. These analyses include descriptive statistics and exploratory analysis of missing values, calculation of Hardy-Weinberg equilibrium, analysis of association based on generalized linear models (either for quantitative or binary traits), and analysis of multiple SNPs (haplotype and epistasis analysis). Permutation test and related tests (sum statistic and truncated product) are also implemented. Version:1.4-9 Depends:R (≥ 2.4.0), haplo.stats, survival, mvtnorm Date:2007-Oct-16 Author:Juan R González, Lluís Armengol, Elisabet Guinó, Xavier Solé, and Víctor MorenoMaintainer:Juan R González <jrgonzalez at imim.es> License:GPL version 2 or newerURL:http://www.r-project.org and http://davinci.crg.es/estivill_lab/snpassoc; In views:Genetics CRAN checks:SNPassoc results
SNPassoc: Data & Summary setupSNP(data=snp-pheno.table, info=map.table, colSNPs=, sep = "/", ...) summary() allele frequencies percentage of missing values HWE test
SNPassoc: Association Tests WGassociation(y~x1+x2, data=, model = (codominant, dominant, recessive, overdominant, log-additive or all),quantitative = , level = 0.95) scanWGassociation(): only p values association(): only for selected snps, can do stratified, GxE interaction analyses Results Summary: a summary table by genes/chromosomes Wgstats: detailed output(case-control numbers, percentages, odds ratios/ mean differences, 95% confidence intervals, P-value for the likelihood ratio test of association, and AIC, etc.) Pvalues: a table of p-values for each genetic model for each SNP Plot: p values in the -log scale for plot.Wgassociation() Labels: returns the names of the SNPs analyzed
SNPassoc: Multiple-SNP Analysis SNP–SNP Interaction interactionPval(): epistasis analysis between all pairs of SNPs (and covariates). Haplotype Analysis haplo.glm(): using the R package haplo.stats: association analysis of haplotypes with a response via GLM haplo.interaction(): interactions between haplotypes (and covariates)
SNPassoc: Computer Efficiency 1000 subjects X 3000 SNPs 5 min. import data 40 min. setupSNP() 30 min. scanWGassociation(): only p values (including permutation test) Memory usage: 750 MB