What is an association study? Define linkage disequilibrium.

What is an association study? Define linkage disequilibrium. Martina Owens 17.12.10

Essay Plan • Association - definition - causes (direct causation, epistatic effect, population stratification, LD) • Linkage Disequilibrium • definition • HapMap project • Genomewide Association studies • rationale (to test the hypothesis that one or more common variants explain part of the genetic risk for a disease) • Overview of GWAS • GWAS in identifying disease susceptibility (WTCCC study) • Limitations of association studies • - large copy number variants (CNVs) in the genomes of healthy people • - common disease-common variant and mutation-selection hypotheses hypothesis • - problem of identifying the actual causal variant • - detects only alleles that are common (>5%) in a population • - large numbers of cases and controls are required • - need for replication studies

Association - definition A tendency of two characters (diseases, marker alleles etc) to occur together at non-random frequencies Statistical statement about the co-occurrence of alleles or phenotypes: allele A is associated with disease D if people who have disease D also have allele A significantly more often (or perhaps left often) than would be predicted from the individual frequencies of D and A in the population. E.g. HLA-DR4 is found in 36% of the population but in 78% of people with rheumatoid arthritis. Thus, HLA-DR4 is associated with RA • Strength of the association is measured by: • Relative risk • - the amount by which having allele A multiplies the baseline risk of developing the disease, D. • - calculated by ascertaining the incidence of D in people with and without allele A. • - a relative risk of 1 means that having allele A confers no additional risk of RA • Odds ratio • - alternative to relative risk • - can be calculated directly from the results of a case control study, without the need for any • information about the population incidence • - odds ratio of 1 mean that the factor has no effect on risk

Association - causes • Direct causation • - having allele A makes you susceptible to disease D. • may not be sufficient to cause D but increases likelihood of developing the • disease. • An epistatic effect • people who have disease D might be more likely to survive and have children of • they have allele A. • Population stratification • allele A and disease D happen to be particularly frequent in a subset of a • population containing several genetically distinct subsets. • Linkage disequilibrium • - the disease-associated allele A marks an ancestral chromosome segment that • carries a sequence variant causing susceptibility to the disease • - association studies depend on LD • Type I error (false positives)

Linkage Disequilibrium (LD) The association of two alleles at separate but linked loci more frequently than would be expected by chance, normally as the result of a particular ancestral haplotype being common in the population studied. Two unrelated individuals with a disease susceptibility allele inherited from an ancient common ancestor will be separated by many meisosis and recombination events – the shared chromosomal segment will therefore be very small. Only alleles at loci tightly linked to the disease susceptibility locus will still be shared. To study LD, a sample of individuals from the chosen population is genotyped for a series of linked polymorphic markers and the data phased (converted into haplotypes) prior to analysis. Computer programs can be used to convert genotypes to haplotypes. The international HapMap project involved typing several million SNPs in individuals from four human populations (CEU, YRI, CHB, JPT) and identified blocks of strong LD of ~5-15kb containing 30-70 SNPs - genotyping ~2-4 tag-SNPs in each block may enable capture most of the genetic variability

GenomeWide Association Studies (GWAS) The ability of SNPs to tag surrounding blocks of ancient DNA (haplotypes) underlies the rationale for GWAS. Used to test the hypothesis that one or more common variants explain part of the genetic risk for a disease Typically case–control design:

The pooling of results obtained in genomewide association studies under the auspices of large consortia is often required for the detection of variants with small effects on the risk of disease.

GWAS in identifying disease susceptibility Nearly 600 genomewide association studies covering 150 distinct diseases and traits have been published, with nearly 800 SNP–trait associations reported as significant (P<5×10−8). Consortia have been established to perform case-control studies with 1000 or more subjects in each arm funded by, for example, the Genetic Association Information Partnership (GAIN) in the US, and the Wellcome Trust Case-Control Consortium (WTCCC) in the UK. Typed 2000 well-phenotyped cases of each of seven diseases: bipolar disorder, coronary heart disease, Crohn disease, hypertension, rheumatoid arthritis, T1DM and T2DM. 3000 healthy controls were also collected.

All samples were typed for >500,000 SNPs using the Affymetrix GeneChip 500K Mapping Array Set Case-control comparisons identified 24 independent association signals at P<5 x 10-7 All signals identified reflect genuine susceptibility effects based on prior findings and replication studies completed Observed association at many previously identified loci, and found compelling evidence that some loci confer risk for more than one of the diseases studied Replication studies are required to establish a definitive relationship with disease.

The importance of appropriately large samples was confirmed by the modest effect sizes observed at most loci identified. • The estimated power of the study to detect risk factors was 80% for a relative risk • of 1.5 (risk of developing the disease). • In a simple comparison between an experimental group and a control group: • A relative risk of 1 means there is no difference in risk between the two groups. • An RR of < 1 means the event is less likely to occur in the experimental group than in the control group. • An RR of > 1 means the event is more likely to occur in the experimental group than in the control group. • Therefore an RR of 1.5 = 1.5 times more likely to develop the disease. • A common story is emerging from this and other similar studies: the ancient • susceptibility factors can be identified by current methods, but the relative risks • they confer are quite small.

Limitations of Association Studies Large copy number variants (CNVs) in the genomes of healthy people - account for a greater number of variable nucleotides than do all the SNPs. - possible that these are susceptibility factors for common disease. - newer generations of SNP genotping assays include assays for common CNVs. The common disease-common variant hypothesis proposes that susceptibility factors have ancient origins - the fact that GWAS are identifying associations that can be replicated indicates that some susceptibility factors are common variants but modest size of their effects leaves open the question of how much of the total susceptibility they will explain. - E.g. GWAS by Easton et al identified 5 new factors in Breast cancer but despite the very large scale of the study (>20,000 patients) these factors together only accounted for 3.6% of the excess familial risk.

Limitations of Association Studies The mutation-selection hypothesis suggests that a heterogeneous collection of recent mutations accounts for most disease susceptibility • - alternative view to common disease-common variant hypothesis • hypothesises that common diseases are common because of mutation-selection balance (i.e. many or most susceptibility factors are deleterious enough to be removed by natural selection and are replaced by new deleterious variants generated by recurrent random mutation) • - DMD: rapid turnover but balanced with high de novo mutation rate • - mild deleterious variants may persist long enough to be present on ancient haplotype blocks and so may be identifiable through association studies.

Limitations of Association Studies A complete account of genetic susceptibility will require contributions from both the common disease-common variant and mutation-selection hypothesis and they should therefore not be seen as being mutually exclusive Under either hypothesis, the problem of distinguishing pathogenic from neutral remains Association studies only flag the chromosomal location of the causative factor to a relatively large region or haplotype block. Therefore there is always a problem of identifying the actual causal variant. Large numbers of cases and controls are required – can be challenging to organise Detects only alleles that are common (>5%) in a population Replication studies are needed

References Strachan and Read 4th edition chapter 15 Hardy & Singleton 2009, NEJM, 360(17): 1759 Feero & Guttmacher 2010, NEJM, 363(2): 166

What is an association study? Define linkage disequilibrium.