180 likes | 314 Views
Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project. Presented by Qing Duan Dr. Yun Li group UNC at Chapel Hill 09-13-2012. Outline. Imputation Study samples: WHI African Americans and Hispanics samples
E N D
Genotype Imputation for American Americans and Hispanics in WHI using reference haplotypes from the 1000 Genomes Project Presented by Qing Duan Dr. Yun Li group UNC at Chapel Hill 09-13-2012
Outline • Imputation • Study samples: WHI African Americans and Hispanics samples • Reference haplotypes: 1000 Genomes Project (version 3 March 2012 release) • Number of markers in reference haplotypes: ~38M • Post imputation quality assessment • Evaluation of imputation quality by comparing with actual genotypes from Metabochip genotyping • Estimation of total number of QC+ markers and number of QC+ indels
QC on WHI Genotypes • QC was performed within African American and Hispanics samples separately for autosomes and chromosome X. • We excluded markers having: • Hardy-Weinberg equilibrium (HW p-value < 1e-6) • Genotype completeness (< 90%) • Minor allele frequency • Chromosome 1-22: MAF < 1% • Chromosome X: singleton or monomorphic markers With thanks to Eric Yi Liu
Summary of samples and GWAS QC+ markers • Number of Individuals • WHI_AA: 8,421 / WHI_HA: 3,587 • Number of markers Note: chromosome X is currently under imputation, so the results on chromosome X will be available soon.
Reference Haplotypes • The complete set of 1000G Phase I Integrated Release version 3 haplotypes in vcf format (March 2012 release) • A total of 2184 haplotypes • A total of ~38M markers • including singleton and monomorphic sites • About 1.4M markers are short indels and large deletions, the rest SNPs.
Note on reference haplotypes • A latest reduced set ofreference haplotypes with singletons and monomorphic markers removed are also available. • Number of markers: ~30M • Every marker in the reduced set is included in the complete set of reference haplotypes. • We expect little influence on imputation quality from singleton and monomorphic markers, because: • Phasing of the reference haplotypes were performed with the singleton and monomorphic markers included • Our previous evaluation shows little effect of singletons on the quality of imputation (Liu, EY, et al., Genetic Epidemiology, 2012, 36:107-117).
Two-step genotype imputation-- Procedure • Step 1: Pre-phasing (MaCH1) • WHI African American and Hispanics samples were phased separately • Step 2: Genotype imputation (minimac) • WHI African Americans and Hispanics samples were imputed separately. • Haplotype to haplotype imputation: the pre-phased haplotypes in step 1 are used to impute into the complete set of reference haplotypes from the 1000 Genomes Project.
Two-step genotype imputation-- Computational costs • Phasing and imputation strategy • Split chromosomes into segments • Phase / impute each segment • Ligate segments back to chromosomes
Summary of imputation results -- Before QC Note: Markers with quality filter missing in the 1000G reference haplotypes are excluded from imputation. We found all markers excluded are of type “MERGED_DEL”.
Evaluation of imputation quality-- Introduction • Main idea • Compare imputed dosages with actual genotypes • Quality metric • Dosage r2: squared correlation coefficient between imputed dosages (continuous value ranging between 0 and 2) and actual genotypes (coded as 0, 1 and 2) • True imputation accuracy (range 0 ~ 1) • Rsq: estimated dosage r2 • Estimated imputation accuracy
Evaluation of imputation quality-- Study design Actual genotype (Metabochip) Imputed dosage Calculate dosage r2 • Individuals used in evaluation • 1962 WHI African American samples • Markers used in evaluation • Overlapping markers between 1000G and Metabochip but not on Affymetrix 6.0 (All 22 autosomes) • Minor allele frequency (MAF) is defined within the 1962 individuals
Estimation of imputation quality-- Summary • We recommend QC threshold 0.7, 0.6and 0.3 for MAF 0.1~0.5%, 0.5~1%, and >1% category, respectively • The thresholds are chosen such that an average Rsq greater than 0.8 in each MAF category is achieved (Liu, EY, et al., Genetic Epidemiology, 2012, 36:107-117). • Estimation based on imputation quality assessment • Total number of markers passing QC • Total number of indels passing QC
Estimation based on imputation quality assessment-- Note • The values are estimated because: • Estimated Rsq cutoffs • Evaluation isbased onmarkers on Metabochip • Estimated MAF • MAF of imputed markers is calculated based on imputed dosages
Estimation based on imputation quality assessment-- Note (cont’d) • The values are estimated because: • Estimated QC thresholds for WHI Hispanics samples • We assumed WHI Hispanics has similar Rsq cutoff in each MAF category to WHI African Americans • We will do similar quality assessment in Hispanics samples once we have their QC+ metabochipdata • Estimated QC thresholds for indels • Rsq is set based on evaluation on SNPs. We assumed indels has similar Rsq cutoff in each MAF category to SNPs
Estimation based on imputation quality assessment-- Total number of markers passing QC Note: Markers includes both SNPs and indels
Estimation based on imputation quality assessment-- Number of indels passing QC
Summary • We conducted genotype imputationfor 8,421 African American and 3,587 Hispanics samples in the Women’s Health Initiative (WHI) study using reference haplotypes from the 1000 Genomes Project (version 3, March 2012 release) • Summary of imputation results before and after QC