Why I chose: First reading results seemed counterintuitive

Why I chose: First reading results seemed counterintuitive Introduction full of references I didn’t know Useful? Or Gee Whizz so what?...Needed to read in detail Seemed relevant to our MND study GWAS + imputation + sequencing Nicely laid our for journal club presentation

Localisation success rate = probability that the causal SNP is top ranked within an associated region depends on joint effects of selection based on p-value, tagging and genotyping accuracy Consider 2 SNPs One causal from sequencing or imputation – imperfect genotyping accuracy One tag from GWAS perfect genotyping accuracy MAF both SNPs = 0.12 Causal SNP OR =1.25 Selection at tag SNP based on p-value < 0.05 in 1000 cases & 1000 controls Association test statistic at causal or genotyped SNP Generates Fig 1-3 Call rate at causal SNP Correlation between actual genotype at causal and genotyped SNPs Correlation between actual and estimated genotype at the causal SNP

Figure 1.Tagging effect decreases localization success rates with or without the selection effect. A& B Tight linkage disequilibrium between SNPs can obscure the causal SNP C&D Selection at the tag SNP inflates the association evidence at the tag, increasing the probability that it outranks the causal SNP Localisation success rate = probability that the causal SNP is top ranked within an associated region Causal MAF 0.12 Correlation causal & non-causal seq SNP 0.9 OR=1.25 Perfect genotyping accuracy Tag MAF 0.12

Fig S8: Tagging effect decreases localization success rates with or without the selection effect, 3 SNPs:1 tag, 1 causal, 1 noncausal sequencing SNP. Fig S9: Tagging effect decreases localization success rates withor without the selection effect 5 SNPs: 1 tag, 1 causal, 3 non-causalsequencingSNPs. Causal MAF 0.02 Correlation causal & non-causal seq SNP 0.9 OR=1.5 Perfect genotyping accuracy Tag MAF 0.02 Causal MAF 0.02 Correlation causal & non-causal seq SNP 0.9 OR=1.5 Perfect genotyping accuracy Tag MAF 0.02 Causal MAF 0.12 Correlation causal & non-causal seq SNP 0.9 OR=1.25 Perfect genotyping accuracy Tag MAF 0.12

Figure 2. Low genotyping accuracy at causal SNP further reduces localization success rates with or without the selection effect. Sequencing or imputation error decreases the localization success rate, with or without tag selection Causal MAF 0.12 OR=1.25 Tag MAF 0.12 Perfect genotyping accuracy for tag SNP

S4. Low genotyping accuracy at causal SNP further reduces localization success rates with or without the selection effect RARE causal SNP Causal MAF 0.02 OR=1.5 Tag MAF 0.02 Perfect genotyping accuracy for tag SNP

S5. Low genotyping accuracy at causal SNP further reduces localization success rates with or without the selection effect common causal SNP Causal MAF 0.25 OR=1.25 Tag MAF 0.25 Perfect genotyping accuracy for tag SNP

Figure 3. Counter-intuitively, sample size can reduce localization success rate Well-tagged causal SNPs sequenced with low accuracy are unlikely to be correctly identified even as sample size increases. Causal MAF 0.12 Correlation causal & non-causal seq SNP 0.9 OR=1.25 Perfect genotyping accuracy Tag MAF 0.12 When the causal SNP is less accurately genotyped than one of its highly correlated proxies (i.e. rC< rGand rCGis large), the proxy SNP may capture the association better than the causal SNP. As a result, this proxy SNP will out-rank the causal SNP more than 50% of the time.

MAF = 0.02 MAF = 0.12 MAF=0.25 Results so far demonstrate the need to correct for the joint effects of selection, tagging and genotyping accuracy on the localization success rate. How to correct?

Correlation between genotyped and sequenced in sample when no errors Call rates i.emissingness Joint vs individual G=tag S=seq Test statistic at sequenced SNP Is zero if independent samples are used for sequencing and identification of tag SNP Estimate of selection bias of genetic effect at tag SNP – form of winner’s curse Revised test statistic at sequenced SNP Missingness rate When low difference between test statistic and revised test statistic increases Correlation between true genotype and sequenced genotype in the sample

G= genotyped C=causal rCG = correlation between genotyped and causal SNPs Selection effect most pronounced when low power at the tag SNP

Unconditional expected association at the sequenced SNP Distortion due to the tag SNP selection propogated through correlation The higher the correlation between the genotyped and sequenced SNP, the higher the test statistic at the sequenced SNP and the lower its variance SNPs in high LD with the tag are more likely to be top-ranked = “tagging effect”

Counts of missingness Estimate from sample Boot strap resampling at the genome-wide level Incorporates information across the whole genome to account for effects of LD and rank on bias Mean posterior genotype eg MACH ratio of variance estimate or full genotype posterior probabilities eg BEAGLE r2

Scenario 1: GWAS used for discovery, and sequencing/ imputation used for fine-mapping around GWAS ‘‘hits’’ using the same GWAS sample. GWAS-focused design based on the WTCCC Type 1 Diabetes A significant region is identified by a significant GWAS tag SNP (p < 5x10-7) and followed by fine-mapping with post-GWAS data (sequenced or imputed SNPs) in the region surrounding the tag SNP. The SNP with the largest test statistic in the region is selected as the best candidate causal SNP. Scenario 2: All GWAS and sequenced/imputed SNPs used for discovery and fine-mapping in the same dataset. Scenario 3: Discovery and fine-mapping using different datasets. Scenario 4: Discovery and fine-mapping using different datasets + Multiple causal SNPs. Scenario 5 Discovery and fine-mapping using different datasets + missing data (imperfect call rate)

Table 2. Parameters and parameter values of the main simulation studies.

Table 3. Localization success rates for simulation Scenarios 1, 2, 3, 4. No good if tag is causal After re-rankiglocalisation success rate “similar” to when tag is not causal. “Minor tradeoff” as GWAS SNP unlikely to be causal Scenario 1: GWAS used for discovery, and sequencing/ imputation used for fine-mapping around GWAS ‘‘hits’’ using the same GWAS sample. Adverse effect of tagging (down table) and genotyping accuracy (across table) are highest when causal SNP is well tagged (larger r) and less accurately sequenced (low rho) e.g. high density GWAS followed by low density sequencing Well-tagged causal SNPs suffer lower localisation success rates because perfectly genotyped tag captures the association better than the imperfectly sequenced/imputed causal SNP Down table Across table

Table 3. Localization success rates for simulation Scenarios 1, 2, 3, 4. Scenario 2: All GWAS and sequenced/imputed SNPs used for discovery and fine-mapping in the same dataset. Scenario 2: All GWAS and sequenced/imputed SNPs used for discovery and fine-mapping in the same dataset ie significance is not required at the GWAS SNP. Impact of sample size, correlation between tag and causal SNP fixed Genotyping accuracy alone impacts Big impact of re-ranking when low seq cover and large sample size

Table 3. Localization success rates for simulation Scenarios 1, 2, 3, 4. Scenario 3: Discovery and fine-mapping using different datasets. Very simialar rates to scenario 2

Table 3. Localization success rates for simulation Scenarios 1, 2, 3, 4. Improves re-ranking for both causal SNPs Scenario 4: Discovery and fine-mapping using different datasets (as 3)+ Multiple causal SNPs

Table 4. Localization success rates for simulation Scenarios 5a. Scenario 5 Discovery and fine-mapping using different datasets + missing data (imperfect call rate) (across table changed) Missing data affect localisation success rates in a similar manner to imperfect genotyping accuracy

Summary from simulation • GWAS-based region selection or moderate genotype error substantially reduces the probability of correctly identifying the causal SNP • Proposed re-ranking can recover lost power increasing localisation success rates by 1.5 to 3 times • When genotypig accuracy is high power lost due to tagging is small so re-ranking has no effect

Figure 4. Naïve test statistics and re-ranking statistics for regions surrounding rs78246868 in the 8q24.21 region for association with prostate cancer risk. Michaela et al Prostate cancer Consortium different genotyping platforms Imputed to 1000 Genomes Fixed-effect meta-analysis Cohorts excluded from assocation analysis if imputation r2 < 0.8 Report 5 statistically independent regions within 8q24.21 locus plus 11q13.3 and 17q24.3 Selected all SNPs in LD r2 > 0.2 with index SNP Didn’t exclude studies based on imputation r2 Only correct for imputation accuracy iedeltaG =0 New top SNPs for 8q24.21 and 17q24.3 8q24.21: 2 SNPs move from lower ransks to top 10%

Figure 5. Naïve test statistics and re-ranking statistics for regions surrounding rs8071558 in the 17q24.3 region for association with prostate cancer risk. 8 SNPs move from lower ranks to top 10% SNPs naively ranked in top 10% stay highly ranked When most SNPs are well genotyped re-ranking only makes subtle changes One poorly imputed SNP (yellow) moves form rank 245 to 16. Association driven by one study (rank 10) , when removed SNP rank is 306 changing to 106

DISCUSSION • Tagging and genotyping accuracy are non-trivial sources of bias that could obscure association evidence at the causal SNP • Proposed re-ranking is simple to implement and can substantially increase the probability of identifying the causal SNP • For low coverage sequencing we recommend the re-ranking method • For imputation and high coverage sequencing we recommend that unfiltered SNPs in associated regions be used with the re-ranking method • Large changes in rank should be carefully examined for heterogeneity between studies • Re-ranking is most beneficial when genotyping accuracy is low • High density genotyping followed by low density sequencing can generate misleading results- Don’t do it • Imputation and sequencing software output accurate estimates of rho needed for the re-ranking

DISCUSSION • Re-ranking important when study specific factors exacerbate GWAS-based selection and genotyping error • High genetic diversity so sequence read are difficult to align • Low LD among SNPs or lack of population-specific reference panel so poor imputation • Low MAF SNPs tend to suffer from both low power and high genotyping error When genotyping accuracy is very poor, re-ranking may not be able to generate useful results- first consider accuracy thresholds recommended by genotype calling or imputation algorithm Re-ranking only improves localization success when applied to SNPs under the alternative, ie SNPs that re themselves causal or in LD with a causal SNP

Existing methods that incorporate genotyping uncertainty into tests for association do not completely recover lost power This paper considered frequentist and Bayesian methods of incorporating uncertainty We anticipate that re-ranking to correct for the adverse effects of selection, tagging and differential genotyping accuracy rates will continue to be important because cost-effective designs are for low-coverage large sample sizes

Why I chose: First reading results seemed counterintuitive