Real data and GWAS Case Study

Real data and GWAS Case Study CSCI2820 – Medical Bioinformatics

Outline • Introduction to Biology • Introduction to CS • Data Generation • Data Acquisition and Databases • A closer look: Linkage Disequilibrium • GWAS Case Study

DNA DNA: the chemical inside the nucleus of a cell that carries the genetic instructions for making living organisms.

DNA Organization in the Human Genome Genome facts The pair of sex chromosomes determine gender. 2 copies of each autosome~3.2 billion base pairs Around 2.9 billion bases organized into scaffolds Only about 90% of the genome has been sequenced!

Gene • A gene is the functional and physical unit of heredity passed from parent to offspring. • Genes are pieces of DNA, and most genes contains the information for making a specific protein. http://en.wikipedia.org/wiki/Gene

Central Dogma http://www.dnalc.org/resources/3d/ • gene • Unit of inheritance • Transcribed into mRNA • mRNA • messenger RNA • blueprint for protein • proteins • Essential molecules that are active in practically all cellular processes • Genes – RNA – Proteins • Useful video: http://www.dnalc.org/resources/3d/central-dogma.html

Variation • Single base mutation, indels • Structural Variation • Deletion • Duplication • Translocation • Inversion • Recombination http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism https://sites.google.com/site/lifesciencesinmaine/5-cell-division-reproduction-and-dna

Recombination

Intro to CS • Algorithm • “a procedure for solving a mathematical problem in a finite number of steps…” • Input-> Computation -> Output • E.g. sorting n numbers • Theory • Analysis of algorithms • Application • For biologists: mathematica programming! • Mathematica demo http://commons.wikimedia.org/wiki/File:Selection-Sort-Animation.gif

Bioinformatics • Regardless of your profession, it is important to study both the biological and computational aspects of the problem • Understanding the biology may help computational researchers create more accurate models, more accurate solutions, help identify biases, etc… • Understanding the computation may help biologists compute better results, create a better study design, develop fine-tuned solutions to unresolved problems, etc…

Data Generation • Types of Data • Variation (SNPs, structural) • Genotype • Haplotype • Sequence Reads • Protein Structure • Genes • Technologies • SNP Array • Sequencing Genotype {A,C} C {G,T} {C,T} {C,A} {T,A} ACGCCT TGCGGA Haplotype ACGCCT CCTTAA CCTTAA GGAATT Algorithmic Opportunity! Input: Genotypes Output: Haplotypes

Haplotype Phasing • Haplotype phasing: separate an individual’s paired chromosomes (genotypes) into the maternal and paternal chromosomes (haplotypes) explanation 1 explanation 2 genotype hap 1 hap 1 100111100 100101000 100111000 100101100 100121200 hap 2 hap 2

SNP Arrays SNP array intensity allele calls Probe intensities Allele 0 Allele 1 http://www.sanger.ac.uk/resources/software/illuminus/ http://www-microarrays.u-strasbg.fr/base.php?page=affySNPsE.php

Sanger Sequencing Long reads: ~500-1000bp Low error rates Very slow

High-throughput Sequencing • Also termed next-generation sequencing • Illumina • 454 • SOLiD • DNA is fractured, amplified, fixated onto an array, bases are added • Single molecule or 3rd generation technologies Source of bias Error signature Short reads: ~50-200bp (454 can get up to 1kb) Generally more error than Sanger Extremely fast and parallel

NCBI • http://www.ncbi.nlm.nih.gov/

EBI • http://www.ebi.ac.uk/

HapMap • http://hapmap.ncbi.nlm.nih.gov/

GWAS Data • International Multiple Sclerosis Genetics Consortium • MS Data: • 931 Trios (Mother-Father Infected Child) • ~350k SNPs • Wellcome Trust Case-Control Consortium • Covers many diseases • dbGaP • Repository for association studies

1000 Genomes • Aims to sequence the genomes of 1000 individuals • Many individuals taken from HapMap samples • Data available from 3 pilot studies • High coverage, full genome sequencing of 2 trios • Low coverage, genome sequencing on several individuals • High coverage, exome sequencing on several individuals

Protein Data Bank

PDB File HEADER CHROMOSOMAL PROTEIN 02-JAN-87 1UBQ TITLE STRUCTURE OF UBIQUITIN REFINED AT 1.8 ANGSTROMS RESOLUTION COMPND MOL_ID: 1; COMPND 2 MOLECULE: UBIQUITIN; COMPND 3 CHAIN: A; … … …ATOM 1 N MET A 1 27.340 24.430 2.614 1.00 9.67 N ATOM 2 CA MET A 1 26.266 25.413 2.842 1.00 10.38 C ATOM 3 C MET A 1 26.913 26.639 3.531 1.00 9.62 C ATOM 4 O MET A 1 27.886 26.463 4.263 1.00 9.62 O ATOM 5 CB MET A 1 25.112 24.880 3.649 1.00 13.77 C ATOM 6 CG MET A 1 25.353 24.860 5.134 1.00 16.29 C ATOM 7 SD MET A 1 23.930 23.959 5.904 1.00 17.17 S ATOM 8 CE MET A 1 24.447 23.984 7.620 1.00 16.11 C ATOM 9 N GLN A 2 26.335 27.770 3.258 1.00 9.27 N ATOM 10 CA GLN A 2 26.850 29.021 3.898 1.00 9.07 C ATOM 11 C GLN A 2 26.100 29.253 5.202 1.00 8.72 C

Linkage Disequilibrium • D’ in real data • HLA-DRA: Chromosome 6 bases 32515-32520kb • Surrounding area: 32400-32600kb • LD in different populations • LD in different phasings • LD in different regions of the genome

Linkage Disequilibrium heat maps. • The markers are distributed along the x-axis. • Each cell represents two SNPs, the darker the red color the higher the LD between the markers. • CEU = Utah residents of northern and western European ancestry • YRI = 30 trios from Ibadan, Nigeria

A GWAS Case Study: Risk Alleles for Multiple Sclerosis Identified by a Genomewide Study

The Biology of Multiple Sclerosis • A chronic inflammatory disease of the central nervous system (CNS), the brain and the spinal cord. • A malfunction of the immune system which leads to attacks against, and causes destruction of the myelin sheath. • Symptoms range from mild muscle weakness to partial or complete paralysis.

Previous Associations • In 1972, the association between multiple sclerosis and the HLA region of the genome was established. • HLA-DRB1 gene on chromosome 6p21 was identified. The human leukocyte antigen system (HLA) is the name of the human major histocompatibility complex (MHC). This group of genes resides on chromosome 6, and encodes cell-surface antigen-presenting proteins and many other genes. The major HLA antigens are essential elements in immune function

Genome-wide Association Studies (GWAS) • GWAS Goal • Identify patterns of polymorphisms that vary systematically between individuals with different disease states (in particular, healthy and disease) and could therefore represent the effect of risk-enhancing or protective alleles. • Let’s follow the paper Risk Alleles for Multiple Sclerosis Identified by a Genomewide Study

GWAS Workflow

Genotypes • Critical Issues • SNP tagging • Include other versions of polymorphism? • microsatellites • copy number variation • How is the data collected? • What types of data? Sequncing? SNP array? Which platform? • MS Study • 334,923 single-nucleotide polymorphisms • 931 trios (screening phase)

GWAS Workflow

Quality Control • Critical Issues • Hardy-Weinberg equilibrium: significant deviation from HW needs to be addressed/scrutinized (carried out using Pearson χ2 or Fisher exact test • Sampling Bias? • Population stratification (substructure) • Genotyping efficiency (missing data)? • Inference of missing data • MS Study • 72 trios removed • Around 150k SNPs not used • STRUCTURE used to remove individuals with non-European ancestry

Quality Control MAF: Minor Allele Frequency HW: Hardy Weinberg Equilibrium ME: Mendelian Errors

Population Substructure Example Individual Locus 1 Locus 2 Locus 3 Locus 4 1 A,A A,A A,C A,A 2 A,B A,A A,B A,A 3 B,B A,B A,A A,A 4 C,C D,E D,E B,C 5 C,C C,D D,D B,D 6 B,C E,E A,E C,E 7 A,C D,D C,D A,D {A,B,C,D,E} are labels for the different gene alleles for 4 different loci These genotypes might suggest that individuals 1,2,3 draw their alleles from a different gene pool than do individuals 4,5,6,7, suggesting the presence of 2 distinct populations.

GWAS Workflow

Statistical Analysis • Critical Issues • Inference of phase and missing data • Single SNP test of association • Multi SNP test of association • What if individual SNPs do not contribute additively to disease? • MS Study • TDT • UNPHASED program used for genetic association analysis with missing data and unknown phase

MS Study Statistics • P values (shown as –log values) for results of transmission disequilibrium testing are plotted across the genome. • The classic HLA-DR risk locus on chromosome 6p21 stands out with strong statistical significance (P<1×10−81).

Screening Analysis WTCCC: Wellcome Trust Case Control Consortium NIMH: National Institute of Mental Health IMSGC: International Multiple Sclerosis Genetics Consortium

GWAS Workflow

Rankings, Filter, Results • Critical Issues • Multiple Testing Correction • SNP Arrays • The hope is that by typing a dense set of markers, we will observe markers in direct association with unobserved causal locus, and in indirect association with disease phenotypes. • Is the common-disease common-variant the correct model for this disease? • MS Study • SNPs in loci: HLA-DRA, IL2RA, IL2RA, IL7R

GWAS Workflow

Analysis • Critical Issues • Alleles of IL2RA and IL7RA and those in the HLA locus are identified as heritable risk factors for multiple sclerosis • Environmental factors? • Where are the associative SNPs found? • MS Study • Association found and LD used to identify markers • More trios and controls recruited for replication (targeted SNPs)

The Biology: IL2RA and IL7RA • Both are important in are important in T-cell mediated immunity • IL2RA • The interleukin-2 receptor (IL-2R) is heterotrimeric protein expressed on the surface of certain immune cells that binds and responds to a cytokine called interleukin 2. • Linked to two other autoimmune diseases: type 1 diabetes and autoimmune thyroid disease. • IL7RA • The protein encoded by this gene is a receptor for interleukine 7 • Helps to control the activity of a class of immune cells called regulatory T cells. • IL7RA variant indicate an effect on gene expression with a change in the ratio of soluble to cell-bound interleukin-7 receptor

Replication Analysis

Odds Ratios • Measure of effect size • Proportion of people in case group with allele divided by the proportion of people in control group with allele • Example 100 cases, 100 controls • 75 cases with allele 0 • 25 controls with allele 0 • Odds ratio = (75/100)/(25/100)=3.00 • Very few studies have implicated SNPs with odds ratios > 3

Regional Plots for Associations in IL2RA

Regional Plots for Associations in IL7RA

Real data and GWAS Case Study

Real data and GWAS Case Study

Presentation Transcript

Real World Case Study

Data Warehousing Case Study

REAL WORLD CASE STUDY

Analysis and presentation of Case-control study data

REAL INVESTIGATIONS CASE STUDY

Beyond GWAS

Network-based Analysis of Genome-wide Association Study (GWAS) Data

Class GWAS

Study Designs in GWAS

GWAS Data Status: ATTREX 2013

GWAS and R

Case Study: Data Diary 1

Genome-Wide Association Study (GWAS)

Statistical Genetics 6 GWAS Data QC

Data Transfer Case Study: TCP

Tribal Case Study: Data and Tribal Sovereignty

Case Study Voice and Data Systems babyTEL

Data Appending Services - Case Study

Data Transfer Case Study: TCP

Data Archive: A Case Study