800 likes | 1.45k Views
Real data and GWAS Case Study. CSCI2820 – Medical Bioinformatics. Outline. Introduction to Biology Introduction to CS Data Generation Data Acquisition and Databases A closer look: Linkage Disequilibrium GWAS Case Study. DNA.
E N D
Real data and GWAS Case Study CSCI2820 – Medical Bioinformatics
Outline • Introduction to Biology • Introduction to CS • Data Generation • Data Acquisition and Databases • A closer look: Linkage Disequilibrium • GWAS Case Study
DNA DNA: the chemical inside the nucleus of a cell that carries the genetic instructions for making living organisms.
DNA Organization in the Human Genome Genome facts The pair of sex chromosomes determine gender. 2 copies of each autosome~3.2 billion base pairs Around 2.9 billion bases organized into scaffolds Only about 90% of the genome has been sequenced!
Gene • A gene is the functional and physical unit of heredity passed from parent to offspring. • Genes are pieces of DNA, and most genes contains the information for making a specific protein. http://en.wikipedia.org/wiki/Gene
Central Dogma http://www.dnalc.org/resources/3d/ • gene • Unit of inheritance • Transcribed into mRNA • mRNA • messenger RNA • blueprint for protein • proteins • Essential molecules that are active in practically all cellular processes • Genes – RNA – Proteins • Useful video: http://www.dnalc.org/resources/3d/central-dogma.html
Variation • Single base mutation, indels • Structural Variation • Deletion • Duplication • Translocation • Inversion • Recombination http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism https://sites.google.com/site/lifesciencesinmaine/5-cell-division-reproduction-and-dna
Intro to CS • Algorithm • “a procedure for solving a mathematical problem in a finite number of steps…” • Input-> Computation -> Output • E.g. sorting n numbers • Theory • Analysis of algorithms • Application • For biologists: mathematica programming! • Mathematica demo http://commons.wikimedia.org/wiki/File:Selection-Sort-Animation.gif
Bioinformatics • Regardless of your profession, it is important to study both the biological and computational aspects of the problem • Understanding the biology may help computational researchers create more accurate models, more accurate solutions, help identify biases, etc… • Understanding the computation may help biologists compute better results, create a better study design, develop fine-tuned solutions to unresolved problems, etc…
Data Generation • Types of Data • Variation (SNPs, structural) • Genotype • Haplotype • Sequence Reads • Protein Structure • Genes • Technologies • SNP Array • Sequencing Genotype {A,C} C {G,T} {C,T} {C,A} {T,A} ACGCCT TGCGGA Haplotype ACGCCT CCTTAA CCTTAA GGAATT Algorithmic Opportunity! Input: Genotypes Output: Haplotypes
Haplotype Phasing • Haplotype phasing: separate an individual’s paired chromosomes (genotypes) into the maternal and paternal chromosomes (haplotypes) explanation 1 explanation 2 genotype hap 1 hap 1 100111100 100101000 100111000 100101100 100121200 hap 2 hap 2
SNP Arrays SNP array intensity allele calls Probe intensities Allele 0 Allele 1 http://www.sanger.ac.uk/resources/software/illuminus/ http://www-microarrays.u-strasbg.fr/base.php?page=affySNPsE.php
Sanger Sequencing Long reads: ~500-1000bp Low error rates Very slow
High-throughput Sequencing • Also termed next-generation sequencing • Illumina • 454 • SOLiD • DNA is fractured, amplified, fixated onto an array, bases are added • Single molecule or 3rd generation technologies Source of bias Error signature Short reads: ~50-200bp (454 can get up to 1kb) Generally more error than Sanger Extremely fast and parallel
NCBI • http://www.ncbi.nlm.nih.gov/
EBI • http://www.ebi.ac.uk/
HapMap • http://hapmap.ncbi.nlm.nih.gov/
GWAS Data • International Multiple Sclerosis Genetics Consortium • MS Data: • 931 Trios (Mother-Father Infected Child) • ~350k SNPs • Wellcome Trust Case-Control Consortium • Covers many diseases • dbGaP • Repository for association studies
1000 Genomes • Aims to sequence the genomes of 1000 individuals • Many individuals taken from HapMap samples • Data available from 3 pilot studies • High coverage, full genome sequencing of 2 trios • Low coverage, genome sequencing on several individuals • High coverage, exome sequencing on several individuals
PDB File HEADER CHROMOSOMAL PROTEIN 02-JAN-87 1UBQ TITLE STRUCTURE OF UBIQUITIN REFINED AT 1.8 ANGSTROMS RESOLUTION COMPND MOL_ID: 1; COMPND 2 MOLECULE: UBIQUITIN; COMPND 3 CHAIN: A; … … …ATOM 1 N MET A 1 27.340 24.430 2.614 1.00 9.67 N ATOM 2 CA MET A 1 26.266 25.413 2.842 1.00 10.38 C ATOM 3 C MET A 1 26.913 26.639 3.531 1.00 9.62 C ATOM 4 O MET A 1 27.886 26.463 4.263 1.00 9.62 O ATOM 5 CB MET A 1 25.112 24.880 3.649 1.00 13.77 C ATOM 6 CG MET A 1 25.353 24.860 5.134 1.00 16.29 C ATOM 7 SD MET A 1 23.930 23.959 5.904 1.00 17.17 S ATOM 8 CE MET A 1 24.447 23.984 7.620 1.00 16.11 C ATOM 9 N GLN A 2 26.335 27.770 3.258 1.00 9.27 N ATOM 10 CA GLN A 2 26.850 29.021 3.898 1.00 9.07 C ATOM 11 C GLN A 2 26.100 29.253 5.202 1.00 8.72 C
Linkage Disequilibrium • D’ in real data • HLA-DRA: Chromosome 6 bases 32515-32520kb • Surrounding area: 32400-32600kb • LD in different populations • LD in different phasings • LD in different regions of the genome
Linkage Disequilibrium heat maps. • The markers are distributed along the x-axis. • Each cell represents two SNPs, the darker the red color the higher the LD between the markers. • CEU = Utah residents of northern and western European ancestry • YRI = 30 trios from Ibadan, Nigeria
A GWAS Case Study: Risk Alleles for Multiple Sclerosis Identified by a Genomewide Study
The Biology of Multiple Sclerosis • A chronic inflammatory disease of the central nervous system (CNS), the brain and the spinal cord. • A malfunction of the immune system which leads to attacks against, and causes destruction of the myelin sheath. • Symptoms range from mild muscle weakness to partial or complete paralysis.
Previous Associations • In 1972, the association between multiple sclerosis and the HLA region of the genome was established. • HLA-DRB1 gene on chromosome 6p21 was identified. The human leukocyte antigen system (HLA) is the name of the human major histocompatibility complex (MHC). This group of genes resides on chromosome 6, and encodes cell-surface antigen-presenting proteins and many other genes. The major HLA antigens are essential elements in immune function
Genome-wide Association Studies (GWAS) • GWAS Goal • Identify patterns of polymorphisms that vary systematically between individuals with different disease states (in particular, healthy and disease) and could therefore represent the effect of risk-enhancing or protective alleles. • Let’s follow the paper Risk Alleles for Multiple Sclerosis Identified by a Genomewide Study
Genotypes • Critical Issues • SNP tagging • Include other versions of polymorphism? • microsatellites • copy number variation • How is the data collected? • What types of data? Sequncing? SNP array? Which platform? • MS Study • 334,923 single-nucleotide polymorphisms • 931 trios (screening phase)
Quality Control • Critical Issues • Hardy-Weinberg equilibrium: significant deviation from HW needs to be addressed/scrutinized (carried out using Pearson χ2 or Fisher exact test • Sampling Bias? • Population stratification (substructure) • Genotyping efficiency (missing data)? • Inference of missing data • MS Study • 72 trios removed • Around 150k SNPs not used • STRUCTURE used to remove individuals with non-European ancestry
Quality Control MAF: Minor Allele Frequency HW: Hardy Weinberg Equilibrium ME: Mendelian Errors
Population Substructure Example Individual Locus 1 Locus 2 Locus 3 Locus 4 1 A,A A,A A,C A,A 2 A,B A,A A,B A,A 3 B,B A,B A,A A,A 4 C,C D,E D,E B,C 5 C,C C,D D,D B,D 6 B,C E,E A,E C,E 7 A,C D,D C,D A,D {A,B,C,D,E} are labels for the different gene alleles for 4 different loci These genotypes might suggest that individuals 1,2,3 draw their alleles from a different gene pool than do individuals 4,5,6,7, suggesting the presence of 2 distinct populations.
Statistical Analysis • Critical Issues • Inference of phase and missing data • Single SNP test of association • Multi SNP test of association • What if individual SNPs do not contribute additively to disease? • MS Study • TDT • UNPHASED program used for genetic association analysis with missing data and unknown phase
MS Study Statistics • P values (shown as –log values) for results of transmission disequilibrium testing are plotted across the genome. • The classic HLA-DR risk locus on chromosome 6p21 stands out with strong statistical significance (P<1×10−81).
Screening Analysis WTCCC: Wellcome Trust Case Control Consortium NIMH: National Institute of Mental Health IMSGC: International Multiple Sclerosis Genetics Consortium
Rankings, Filter, Results • Critical Issues • Multiple Testing Correction • SNP Arrays • The hope is that by typing a dense set of markers, we will observe markers in direct association with unobserved causal locus, and in indirect association with disease phenotypes. • Is the common-disease common-variant the correct model for this disease? • MS Study • SNPs in loci: HLA-DRA, IL2RA, IL2RA, IL7R
Analysis • Critical Issues • Alleles of IL2RA and IL7RA and those in the HLA locus are identified as heritable risk factors for multiple sclerosis • Environmental factors? • Where are the associative SNPs found? • MS Study • Association found and LD used to identify markers • More trios and controls recruited for replication (targeted SNPs)
The Biology: IL2RA and IL7RA • Both are important in are important in T-cell mediated immunity • IL2RA • The interleukin-2 receptor (IL-2R) is heterotrimeric protein expressed on the surface of certain immune cells that binds and responds to a cytokine called interleukin 2. • Linked to two other autoimmune diseases: type 1 diabetes and autoimmune thyroid disease. • IL7RA • The protein encoded by this gene is a receptor for interleukine 7 • Helps to control the activity of a class of immune cells called regulatory T cells. • IL7RA variant indicate an effect on gene expression with a change in the ratio of soluble to cell-bound interleukin-7 receptor
Odds Ratios • Measure of effect size • Proportion of people in case group with allele divided by the proportion of people in control group with allele • Example 100 cases, 100 controls • 75 cases with allele 0 • 25 controls with allele 0 • Odds ratio = (75/100)/(25/100)=3.00 • Very few studies have implicated SNPs with odds ratios > 3