An Introduction to Sequence Variation

An Introduction to Sequence Variation Chris Lee Dept. of Chemistry & Biochemistry UCLA

Types of Polymorphism • Single nucleotide polymorphisms (SNP) constitute about 90% of polymorphisms. • Insertions, deletions. • Microsatellite repeats: a locus where different numbers of copies of a short repeat sequence are found in different people. • Gross genetic losses or rearrangements.

Large-scale Polymorphism

Single Nucleotide Polymorphisms • Each person different at 1 in 1000 letters. • SNPs responsible for human individuality! • Some SNPs cause human diseases (e.g. cancer, cystic fibrosis, Alzheimer’s). • Enormous efforts have been made to identify specific mutations that cause disease.

Single Nucleotide Polymorphism

Mutation can occur as easily as the loss of a single chemical group from one nucleotide base, e.g. the amino group of cytosine.

Creating a Mutation

Genomic Density of SNPs • Comparing two random chromosome, one SNP per 1000 bp. • Comparing 40 people (2 chromosomes each), expect 17 million SNPs in the complete human genome (3 billion bp). • In coding region (5% of genome) expect 500,000 cSNPs, perhaps 6 per gene.

SNPs: a detailed record of human genetic history • Each SNP is typically a single mutation event, that occurred in a context of certain pre-existing SNPs. • As time passes this context is gradually lost due to recombination. SNP C initially created linked to SNPs ABDEF... A B C D E F time “Island” of linkage shrinks... B C D

A record of the origins, migrations, and mixing of the world’s peoples • The size of the “island” of strong linkage around a SNP indicates its age (small = old) • The SNPs it’s linked to give a “genetic fingerprint” of the original person it’s from. • In principle each SNP can be used to track all his descendants. • Each person has 300,000 common SNPs-- a very rich record of their genetic history.

SNPs in lipoprotein lipase (LPL) gene.

SNP genotypes in 71 individuals in the LPL gene heterozygote (X/y) homozygote (y/y)

SNP Allele Frequency

SNP Haplotypes reconstructed from LPL genotype data

SNP Linkage Disequilibrium

The Hunt for Disease Genes • Currently: finding a disease gene can take years, because there are very few markers, forcing researchers to search dozens of genes. • SNPs are a powerful tool for discovering genes that cause disease: with a SNP in every gene, could directly map a disease to a single gene.

Mapping Disease Genes microsatellite • Look for genetic linkage of disease to marker • Microsatellite markers are too widely spaced to get to the individual gene level. • There are common SNPs in every gene. chromosome disease gene genes SNPs

Identification of SNPs • In 1998, Wang et al. reported ~ 3000 SNPs. • Currently about 200,000 SNPs have been identified in total by experiment (in public databases). • A pharmaceutical industry SNP consortium has been formed to fund identification of 300,000 SNPs to be shared publically.

SNPs for Pharmacogenomics • Differences in efficacy and side effects from person to person can be a big problem for drug clinical trials / approval. • If SNPs that correlate with these differences can be identified, the clinical trial could be limited to patients where efficacy is likely to be best, with least side effects. • These SNPs would then also have to be tested on prospective patients for the drug.

Single-nucleotide polymorphism in the human mu opioid receptor gene alters -endorphin binding and activity: Possible implications for opiate addiction Bond et al. PNAS 95:9608 The mu opioid receptor is the primary site of action forthe most commonly used opioids, including morphine, heroin, fentanyl,and methadone. The A118G variant receptorbinds -endorphin, an endogenous opioid that activates the muopioid receptor, approximately three times more tightly than themost common allelic form of the receptor. Furthermore, -endorphinis approximately three times more potent at the A118G variantreceptor than at the most common allelic form in agonist-inducedactivation of G protein-coupled potassium channels.

Comprehensive EST Analysis of Single Nucleotide Polymorphism in the Human Genome Chris Lee Dept. of Chemistry & Biochemistry UCLA

Targeting Functional Polymorphism via Expressed Sequences • Only 5% of the human genome corresponds to coding “genes” coding functional protein. • Look for functional SNPs by targeting these gene sequence regions. • Genes are “expressed” by transcription into mRNA, which is spliced, poly-adenylated and transcribed. • Purify polyA-mRNA, make cDNA, sequence.

SNP Detection from ESTs • 1.4 million Expressed Sequence Tag (EST) sequences, 300-500 bp, from 950 people. • How to put together all the ESTs from the same gene, without mixing up related genes? • How to distinguish sequencing errors (very common) from genuine Single Nucleotide Polymorphisms?

SNP Detection Approaches • Experimentally: random sampling of DNA. Very expensive, slow. • Computationally: find SNPs from existing experimental data. Sort out real SNPs from experimental sequencing errors. Difficult statistical and computational problems. • This experimental data was sitting around for years...

Distinguishing SNPs from Sequencing Errors A T The frequency and pattern in which a polymorphism is observed, must rise above the rate of background, random error. Single-pass read sequences contain many errors which complicate the reliable detection of SNPs. There are miscalls (N), and frequent letter duplications / losses in runs (repeats of a single letter). These non-uniform error rates are critical in assessing the statistical significance of candidate SNPs like A (not in a run) vs. T (problematic because it involves a GG run).

How to address this? • Adopt rigorous statistical approach based on measured frequencies from very large data. • Bayesian inference: carefully separate observations from hidden states you want to make inferences about. • “Integrate out” all assumptions by considering all possible values of the assumptions. • Explicitly measure degree of uncertainty in the predictions due to poor data, ambiguity.

Odds ratio: SNP model vs. sequencing error model Consider both models: are the observations more consistent with a SNP or sequencing error?

Error Model: treat True gene sequence as unknown • Treat all sequences T as equally likely (before you • consider the actual observations (chromatograms). • Sum error model probability over all possible T.

SNP Model • Rather than summing SNP model probability over all possible T, T*, calculate the probability for a specific SNP T* in a specific consensus T.

Sequencing error model Treat individual observed sequences i as independent; treat alignment (what errors occurred) as uncertain. Treat true gene sequence T as uncertain: sum over all possible T

Hidden Markov Model Discrimination of SNP vs. Error The match states (M) of a profile is the equivalent of the true population sequence, and deletion (D), insertion (I) and emission probabilities are set to be the observed frequencies of sequencing errors conditioned on local sequence context. The sum probability for the SNP model, vs. the sum probability for the error-only model, yields an odds-ratio for the SNP.

To assess putative SNP, consider all alternative possibilities • Sequencing error: calculate odds ratio SNP vs. error. Use PHRED score, local context. • Orientation errors: ESTs reported backwards? • Chimeras, mixed clusters: ESTs may not be properly clustered. Some ESTs chimeric? • Alignments: all possible ways EST could have been emitted from true sequence T. • “true” sequence: all possible T for the gene.

SNP Model: “Local” allele frequency qz in one person z = 0, 1, 2 … qz = z/N, where N = 2 typically Assuming Hardy-Weinberg

Use Library information: which sequences are from same person! Combine observations from all libraries L, and treat population allele frequency q as uncertain (so take integral over q= (0,1) ).

Posterior probability for population allele frequency q Gives posterior distribution for q, taking into account all error rates in the observations, amount of sequence and library availability, ambiguities in the sequence, etc.

6 SNP observations from one library

6 SNP observations scattered over all libraries

Alignment Accuracy Challenges • Automatic Multiple Sequence Alignment of 1000+ sequences is problematic. • Alignment accuracy is much more of a problem for SNP detection than for simply getting the right consensus. Consensus merely requires that the majority be aligned, whereas even a single alignment error will result in an incorrect SNP prediction.

Sequencing Error Analysis • We have produced a dataset of 400,000,000 bp where we have reliable consensus, and therefore can identify all the sequencing errors. This could provide “corrected” EST sequences, or alternatively consensus, assembled gene sequences for a large fraction of human genes. • This also provides detailed statistics on the frequency of different types of sequencing errors, which show a startling variation depending on local sequence context. Background error rates of 0.3% substitution, 0.3% insertion, 0.7% deletion, rise dramatically

Example SNP: GGA C/T CAA Cluster AA702884 C vs. T polymorphism Novel SNP, not previously identified.

Automated SNP Detection Input Unigene: 1,400,000 Human ESTs, 300-500 bp long Word frequency based overlap & orientation detection Try all possible orientations; Don’t trust Unigene! Many errors in the reported data e.g. reversals, in majority of clusters! Reorient ESTs: catch reversals, place in 5’ -> 3’ orientation EST Alignment: accuracy predict gene consensus & SNPs 10-5000 ESTs per gene, 80,000 genes, 500-5000 bp long Statistical Assessment of candidate SNPs >50,000 believable SNPs hidden among >10,000,000 sequencing errors.

Sequence Alignment

Current Status: Results • 400,000,000 bp aligned w/ reliable consensus. • 83,000 consensus gene sequences produced. • 20,000 show significant homology to known proteins, almost all in expected + orientation. • 75,000 SNPs above LOD score of 3. • 30000 SNPs above LOD score of 6. • current estimate: 60,000 high frequency SNPs.

Megakaryocyte Potentiating Factor (Unigene Cluster Hs.155981) gagg..cccactcccttg.ctggccccagccctgctgan.at.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact Hs#S785496 gagggccccactcccttg.ctggccccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtgatccccgttccaccccaagagaact Hs#S1065649 gagggccccactcccttg.ctagtgtcagccctgctggggat.ccccgcctggccaggagcagagcacgggtggtccccattccaccccaagagaact Hs#S706294 gagggccc.actcccttg.ctggccccagcc.tgctgga.gt.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact Hs#S730843 gagggccc.actcccttg.ctggccccagccctgctgna.nt.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact Hs#S751356 gagggccccactcccttg.ctaggac.agcc.tgctggggat.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact Hs#S786081 gagggccccactcccttg.ctggccccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtgatccccgttccaccccaagagaact Hs#S417458 gagggccccactccctgggcttggcccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact Hs#S751274 gagggccccactcccttg.ctggccccagccctgctgga.atancccgcctggccaggagcag.gcacgggtnatccccgttccaccccaagagaact Hs#S483955 gagggccccactcccttg.ctggccccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact Hs#S1434119 aagggccccactcccttg.ctggccccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaaaagaact Hs#S1065241 gagggccccactcccttg.ctggccccagccctgctggggat.ccccgcctggccaggagcag.gcacgggtggtccccgttccaccccaagagaact CONSENS0 1970 1980 1990 2000 2010 2020 2030 2040 2050

Chromatographic Evidence G G TG G TC C C Hs#S785496 G zu42c08.r1 G G TG AT C C C Hs#S1065649 oz03ho7.x1* A

1 2 3 4 5 6 7 8 9 10 11 86 nt 67 bases 67 nt G / A 86 bases 32 bases 35 bases 35 nt 32 nt G TC GATC MboI [MboI] genotype GG GG GG AA GG GA GG GA GG AA GA RFLP Detection of SNPs

Verified 56 of 79 SNPs tested so far

Verification Test: Whitehead cSNPs • Whitehead Institute has systematically searched for SNPs in 106 genes, using 20 Europeans, 10 Africans, 10 Asians. • On 54 genes, our predicted cSNPs (score>3) are verified by their results at a 70% rate.The Whitehead set may be incomplete.

An Introduction to Sequence Variation