Class 10. In what ways does DNA sequence matter? Topics to cover: What does 23andme report?

Class 10. In what ways does DNA sequence matter? Topics to cover: What does 23andme report? carriage of disease-causing mutations disease-risks based on SNP analysis terminology – gene, trait , mutation, allele, polymorphism, SNP, carrier status, recessive/dominant, penetrance How they assay ~106 SNPs at once Basis for ascribing disease risk – GWAS, statistical tests of association, p-values, odds ratios, what to take seriously What additional info. is in complete seq.

What do we mean by DNA sequence? ..GACATGGGGCAGGAUGAT..TGA.. ..CTGTACCCCGTCCTACTA..ACT.. What is protein, amino acid? How is info in DNA sequence converted into protein? What is RNA, a codon, stop codon? GAC ATG GGG CAG GAT GAT..TGA.. DNA GAC AUG GGG CAG GAU GAU… UGA… mRNA met glygln asp glustop protein

Consequences of some seq. changes GAC AUG GGG CAG GAU GAU… UGA… wild type RNA met gly gln asp glustopa change here would not change aa since gga, ggc, ggt also encode gly; “synonymous” change GAC AUG GGG CAG UAU GAU… UGA SNP met gly gln tyr glu stop 1 aa subst., “non-synonymous” D GAC AUG GGG UAG GAU GAU… UGA… SNP met gly stop early termination GAC AUG GGG CCC CAG GAU GAU… UGA…3b inser. met glyprogln asp glu stop1 aainser. GAC AUG GGG C^GA GGA UGA U… 1b insertion -> frame shift met gly arg valstopchanges all following aa’s Frame-shifting “Indels” (non-mult. of 3b) can have large effect

What is a gene? Functional unit – usually encodes protein genome – all the genes/DNA in an organism Do we get 2 copies of most genes – 1 from each parent? What is an allele? Sequence variant at some location (locus) How big is the human genome? ~3x109bp How much of it codes for proteins? only ~1.5%, 20K genes What does the rest of it do? ~3% involved in regulating gene expression Up to 50% “selfish” (? parasitic) DNA, repeated seq. Most of unknown function – even some highly conserved regions

What is a polymorphism? – sequence variant How much does DNA seq. vary between individuals? single nucleotide polymorphisms (SNPs) ~0.1%, ~106-7 insertions, deletions (indels) ~105 copy # variants (CNVs) ~102 What is a mutation – polymorphism that causes disease e.g. CF, sickle cell hemoglobin, BRCA

What is a recessive (dominant) mutation? recessive - need 2 mutalleles at same locus for disease dominant - 1 mutallele -> disease heterozygote: 2 different alleles (at some gene locus) homozygote: mat. and pat. alleles are identical some mutations always cause disease (fully penetrant), others only in combination with environmental trigger Are carrier states recessive or dominant mutations? How many carrier states does 23andme report?

What technology does 23andme use for its testing? Bead array – hybridization – one base extension method Array lots of ~3mm plastic beads into wells etched in fiberoptic bundle Each bead has manycopies of a single short DNA primer; there are ~106diff. bead types, each w/ diff. primer seq. Hybridize your DNA to the array; extend the primer DNA on the bead to copy your hybridized template DNA using DNA polymerase and fluorescent bases, one at a time, to see if A, C, G, or T is the next base in the template

How do you know which bead, with which seq., is where? Company sends you map info. How do they know? Method based on serial hybridization with fluorescent probes, recording which hybridize to which bead, then melting off these probes, but not bead primer dna If there are 106 bead types, each with diff. seq., impracticalto do 106hybridizations, so make green fluor-labeled and red fluor-labeled version of each probe; make pools of probes; hybridize each pool to beads and record bead color, then melt off and repeat with new pool; serial color signatures reveal which seq. is on which bead after ~ln106 = 19 hybridizations

Example – if 8 different bead types # 0-7 bead with this hyb. history must have seqcompl. to 0 1 2 3 4 5 6 7 pool 1 pool 2 pool 3 circled bead should have seqcompl. to 2

In general, if k probes, S pools, n bead-types, need to do S = lnkn hybridizations note S goes up slowly, ~ln (n) Parity check – make more pools than you need. Now lots of unused code numbers. Choose pool combinations such that most likely errors -> beads with unused codes, so you know these beads were mistyped and you can disregard them Example Note all parity codes sum to 0, but any single error would would lead to sum = 1 pool 4 with 4 pools, 24 = 16 codes; 8 extras are used to spot errors

A polymorphism can be associated with a disease without causing it. How is that possible? Inheritance – eggs and sperm (germ cells) get oneof each pair of chromosomes randomly from a parent MP SNP 987-a b x suppose mutation x arises on a chr. carrying allele b at a nearby polymorphism SNP987 offspring that inherit x (tend to) inherit SNP987 allele b, -> disease has higher freq. in people with allele b vs allele a at SNP987 = “Founder” effect to egg Mom’s chr’s #5

Minor complication – chromosomes duplicate and recombine during formation of sperm and egg, which sometimes -> x being inherited with allele from other chromosome, but this is rare if x and b are close MP Snp 987-a b x to egg x x Snp 987- b a x Chance of recombination ~ distance between b and x

details for afficionados How often does recombination occur (as a function of dist. in bases)? ~1% per Mb (106b) (per generation) How many generations does it take for probability of assoc. to fall to ½, for mutations/SNPs separated by 1Mb? .99n = .5 n ~69 generations How long is association likely maintained? If ~25 yrs/generation, 69gen -> 1700yrs Implication – such associations can persist for long time (~100,000 years for mutations within 1000 bases of SNP)

Implications for SNP-disease associations Nearby SNPs may not themselves be responsible for increased risk of disease, just assoc. “markers” Disease-SNP allele associations may be specific to certain ethnic groups in which mutation arose; there may be associations with different SNP alleles in different ethnic groups if disease mutations arose multiple times Disease-associated SNPs provide locational clues to causative mutations, useful for research

Genome-wide association studies (GWAS) = source of data • for attributing disease risk to presence of some variants • Basic idea – • search for chr. regions with SNPs with diff. allele freq. (orgenotype freq.) in cases vscontrols, e.g. • let A and a designate diff. alleles at some locus • assume each person has 2 alleles • genotype • aaaAAA sum A allele freq • dis. cases 45 510 1445 2000 3400/4000=.85 • controls 120 960 1920 3000 4800/6000=.80 • If you inherit A, are you really more likely to get the disease?

How much do frequencies have to differ to be statistically significant? Basic idea – see if data are reasonably likely if 2 groups (e.g., cases and controls) do not differ in allele or genotype frequencies If groups are not really different, you could pool the data and ask if you randomly chose 2 groups (of the size of the cases and controls) from this one population, how likely would the means of the 2 groups differ by as much as you observe. A “t” test gives you this probability. If it is very low, you may have reason to reject the null hypothesis.

The chi sq test is very like the “t” test Chi sq = S(Exp-Obs)2/Exp It’s probability distribution is known for randomly selected groups from a single population. If p(chi sq) < small # a, e.g. a =.05, you might choose to conclude the groups really do differ in allele freq.since these data areunlikely (p<0.05) if groups are the same Usinga= 0.05 as a cut-off means that you’ll make a mistake and declare identical groups different 5% of the time (5% FP rate). You pick the cut-off for whatever error rate you feel appropriate

Complication – if one tests for association with 20 • independent things, expect ~1 to have p(chi sq) <.05 • even when no assoc. exists (i.e. expect 1FP). • Testing for assoc with any of ~106genes, one needs much • stricter criterion than a=.05 in order to avoid lots of FP’s • Simplest correction – Bonferroni: divide a by n = # of • SNPs tested; e.g. require p(chi sq.) < 0.05/106~10-8 • in order that probability that anyassoc. is FP be < .05 • Rationale – prob. that an apparent assoc. is not a FP = 1-p; • prob. that n apparent assocs are not FP’s = (1-p)n~ 1-pn; • you can make this ~1 by choosing pn < a i.e. p < a/n

Example chi sq calculation • hypothetical #'s with each genotype • aaaAAA sum • dis. cases 45 510 1445 2000 • controls 120 960 1920 3000 • totals 165 1470 3365 5000 • If H0 true, can pool groups for best est. of probabilities • p(aa) = 165/5000; p(aA)=1470/5000, p(AA)=3365/5000

Then expected # aa among dis. cases = p(aa)*2000 = 66 • Expected # of aA among dis. cases = p(aA)*2000 = 588 • Compute remaining expected #’s same way or from totals • -> • Expected # aaaAAA sum • dis. cases 66 588 1346 2000 • Controls 99 882 2019 3000 • totals 165 1470 3365 5000 • Chi sq = S(exp-obs)2/exp = (66-45)2/66 + … = 40.52 • p(chi sq, 2df) = 1.59x10-9 (from table, or web) < a = 10-8 • So assoc. is “statistically significant” • For confirmation, repeat study in independent groups

Relative risk might be measured as p(D|AA)/p(D) but frequently expressed in terms of “odds ratio” Odds = p(event)/[1-p(event)] e.g. “2:1” if p(event)=.67 Odds ratio = odds(D|AA)/odds(D) (assume A is hi risk allele) = {p(D|AA)/[1-p(D/AA)]} / {p(D)/[1-p(D)]} Odds ratios frequently larger than relative risk (so make risk seem slightly greater). Need to know p(D) in your population for accurate odds ratio

Example of GWAS paper (basis of 23andme disease risk predictions)Nature 447:661 (2007) appreciate the magnitude, expense, complexity – and limitations ~100 authors, 106 SNPs tested in each of 17,000 samples (@ $1000) could study have been done if each test cost $1?

raw dis p(chi sq) ORs Note most disease risks measured by OR only ~1.2-2

Example of hit region Large # of SNPs in hit region associated with disease lends credence to the finding

Most GWAS assoc. now “confirmed” in repeat studies But does this prove that disease risk is increased? What additional type of study might you want? Prospective studies have not yet been done.

What additional information is in full sequence that SNP analysis misses? SNPs were selected based on their being common in the population, e.g. SNP allele freq. >5% Full sequence picks up all variants in an individual, many of which (?~10%) will be new Many of these will be sequencing errors: with 99.99% accuracy expect ~105 errors

Typical mutation counts in sequenced human genomes 3 individual’s genomes sequenced They estimate ~10% of variants are false positives

Challenges to clinical interpretation Many variants are new, hence no clinical experience Even considering only those most likely to have biological effect (e.g. frameshifts in coding seq.), there will be hundreds of mutations per person Mutations may have no clinical effect because: if heterozygous, other allele provides enough protein; gene is non-essential; other proteins do same thing; gene function is only important in special circumstances Mutations with no effect in heterozygous parent could affect offspring that inherit 2 variant copies

At observed mutational load, ~15% of pregnancies should carry ominous mutations in both copies of at least 1 gene p(FS mutation/allele) ~200/40,000 = 0.5% p(couple don’t share gene with FS mutation) = .995200 = .37 p(>1 gene for which both are carriers) = 1-.37 = .63 p(fetus of such a couple gets both mutations) = ¼ p(random fetus has FS muts in both alleles of some gene) = .63/4 Will this cause consternation among parents-to-be? If <<15% of normal individuals have such double mutations, they may be responsible for spontaneous abortions If ~15% of normal individuals have such double mutations, they are usually innocuous, but won’t be sure in any particular case

Bottom line – sequencing will likely provide huge amount of information of uncertain clinical significance for foreseeable future Rapidly decreasing costs (now ~$1000) will likely make wide-scale genome sequencing inevitable

Summary Companies like 23andme provide info about carrier status for some common mutations disease risk based on GWAS studies Testing is based on SNP analysis using random bead arrays clever method of identifying which bead is where based on pooled hybridization probes – using extra information (parity check) to eliminate errors

Disease risk information based on genome-wide association studies (GWAS) caveats: only associations, not confirmed prospectively statistical evaluation – chi sq., p-values, correction for multiple comparisons, Bayesian method to get p(hypothesis|data) given p(data|hyp.), odds ratios Additional information in genome sequence mutation load – uncertain clinical significance FYI papers and math exercises on Blackboard Next 3 classes will be on sequencing technologies

!! Advertisement !! Next semester I will teach class on ethical, legal, social issuesin engineering (MAE???), focusing on issues raised by new biomedical technologies like sequencing, high cost of medical technology, FDA regulation, social/economic costs of patents, cost-benefit and cost-effectiveness evaluation of biomedical technology…. Help recruiting students will be greatly appreciated!

Class 10. In what ways does DNA sequence matter? Topics to cover: What does 23andme report?