Genome-wide association studies

Genome-wide association studies Usman Roshan

SNP • Single nucleotide polymorphism • Specific position and specific chromosome

SNP genotype Suppose this is the DNA on chromosome 1 starting from position 1. There is a SNP C/G on position 5, C/T on position 14, and G/T on position 21. This person is heterozygous in the first SNP and homozygous in the other two. F: AACACAATTAGTACAATTATGAC M: AACAGAATTAGTACAATTATGAC

SNP genotype representation The example F: AACACAATTAGTACAATTATGAC M: AACAGAATTAGTACAATTATGAC is represented as CG CC GG …

SNP genotype • For several individuals A/T C/T G/T … H0: AA TT GG … H1: AT CC GT … H2: AA CT GT … . . .

SNP genotype encoding • If SNP is A/B (alphabetically ordered) then count number of times we see B. • Previous example becomes A/T C/T G/T … A/T C/T G/T … H0: AA TT GG … 0 2 0 … H1: AT CC GT … =>1 0 1 … H2: AA CT GT … 0 1 1 … Now we have data in numerical format

Genome wide association studies (GWAS) • Aim to identify which regions (or SNPs) in the genome are associated with disease or certain phenotype. • Design: • Identify population structure • Select case subjects (those with disease) • Select control subjects (healthy) • Genotype a million SNPs for each subject • Determine which SNP is associated.

Example GWAS A/T C/G A/G … Case 1 AA CC AA Case 2 AT CG AA Case 3 AA CG AA Control 1 TT GG GG Control 2 TT CC GG Control 3 TA CG GG

Encoded data A/T C/G A/G A/T C/G A/G Case1 AA CC AA 0 0 0 Case2 AT CG AA 1 1 0 Case3 AA CG AA => 0 1 0 Con1 TT GG GG 2 2 2 Con2 TT CC GG 2 0 2 Con3 TA CG GG 1 1 2

Ranking SNPs SNP1 SNP2 SNP3 SNP1 SNP2 SNP3 A/T C/G A/G A/T C/G A/G Case1 AA CC AA 0 0 0 Case2 AT CG AA 1 1 0 Case3 AA CG AA => 0 1 0 Con1 TT GG GG 2 2 2 Con2 TT CC GG 2 0 2 Con3 TA CG GG 1 1 2 A good ranking strategy would produce SNP3, SNP1, SNP2

Chi-square test • Gold standard is the univariate non-parametric chi-square test with two degrees of freedom. • Search for SNPs that deviate from the independence assumption. • Rank SNPs by p-values

Statistical test of association (P-values) • P-value = probability of the observed data (or worse) under the null hypothesis • Example: • Suppose we are given a series of coin-tosses • We feel that a biased coin produced the tosses • We can ask the following question: what is the probability that a fair coin produced the tosses? • If this probability is very small then we can say there is a small chance that a fair coin produced the observed tosses. • In this example the null hypothesis is the fair coin and the alternative hypothesis is the biased coin

Binomial distribution • Bernoulli random variable: • Two outcomes: success of failure • Example: coin toss • Binomial random variable: • Number of successes in a series of independent Bernoulli trials • Example: • Probability of heads=0.5 • Given four coin tosses what is the probability of three heads? • Possible outcomes: HHHT, HHTH HTHH, HHHT • Each outcome has probability = 0.5^4 • Total probability = 4 * 0.5^4

Binomial distribution • Bernoulli trial probability of success=p, probability of failure = 1-p • Given n independent Bernoulli trials what is the probability of k successes? • Binomial applet: http://www.stat.tamu.edu/~west/applets/binomialdemo.html

Hypothesis testing under Binomial hypothesis • Null hypothesis: fair coin (probability of heads = probability of tails = 0.5) • Data: HHHHTHTHHHHHHHTHTHTH • P-value under null hypothesis = probability that #heads >= 15 • This probability is 0.021 • Since it is below 0.05 we can reject the null hypothesis

#Allele1 (risk) #Allele2 (wildtype) Case c1 (X1) c2 (X2) c3 (X3) c4 (X4) Control Chi-square statistic • Define four random variables Xi each of which is binomially distributed Xi ~ B(n, pi) where n=c1+c2+c3+c4 is the total number of subjects and pi is the probability of success of Xi. • Each variable Xi represents the number of case and control subjects with number of risk and wildtype alleles. • The expected value E(Xi) = npi since each Xi is binomial.

Chi-square statistic Define the statistic: where ci = observed frequency for ith outcome ei = expected frequency for ith outcome n = total outcomes The probability distribution of this statistic is given by the chi-square distribution with n-1 degrees of freedom. Proof can be found at http://ocw.mit.edu/NR/rdonlyres/Mathematics/18-443Fall2003/4226DF27-A1D0-4BB8-939A-B2A4167B5480/0/lec23.pdf Great. But how do we use this to get a SNP p-value?

#Allele1 (risk) #Allele2 (wildtype) Case c1 c2 Control c3 c4 Null hypothesis for case control contingency table • We have two random variables: • D: disease status • G: allele type. • Null hypothesis: the two variables are independent of each other (unrelated) • Under independence • P(D,G)= P(D)P(G) • P(D=case) = (c1+c2)/n • P(G=risk) = (c1+c3)/n • Expected values • E(X1) = P(D=case)P(G=risk)n • We can calculate the chi-square statistic for a given SNP and the probability that it is independent of disease status (using the p-value). • SNPs with very small probabilities deviate significantly from the independence assumption and therefore considered important.

#Allele1 #Allele2 Case 15 35 Control 2 48 Chi-square statistic exercise • Compute expected values and chi-square statistic • Compute chi-square p-value by referring to chi-square distribution

GWAS problems and applications • Detect causal SNPs • Chi-square • Multivariate approaches • Predict case and control from genotypes • Machine learning algorithms • A simple algorithm based on Euclidean distances

Genome-wide association studies