530 likes | 1.05k Views
Parametric linkage analysis and lod scores. Steve Horvath Depts. of Human Genetics & Biostatistics UCLA. Contents. the big picture: meiotic mapping techniques genetic distances and genetic maps map functions LOD ( l og of the od ds) score analysis 2-point analysis
E N D
Parametric linkage analysis and lod scores Steve Horvath Depts. of Human Genetics & Biostatistics UCLA
Contents • the big picture: meiotic mapping techniques • genetic distances and genetic maps • map functions • LOD (log of the odds) score analysis • 2-point analysis • testing for linkage between a marker and an affectation status locus • example: rare, fully penetrant, dominant Mendelian disease • more general disease models • parameters in parametric linkage analysis • multipoint analysis: algorithms for LOD scores • significance levels, thresholds and false positives
Meiotic mapping allows to identify DNA segments that contain disease genes trait 1 Reverse genetics: trait -> DNA trait 3 trait 2 • Mapping is part of the “positional cloning“ strategy. • works well for Mendelian diseases, • correspond to rare, highly penetrant disease alleles
Different ways of expressing the goal of genomics • goal: find stretches of DNA that are risk factors for a disease. • known as reverse genetics if you start with the phenotype (e.g. affectation status) • aka. positional cloning (Collins FS) • 3 step procedure (adapted) • first meiotic mapping (linkage, linkage disequilibrium) • second, physical mapping (includes sequencing) • third, find mutation and verify functional role
Different kinds of meiotic mapping methods • parametric (better model-based) lod score analysis • single point • multipoint • non-parametric (better model-free) linkage analysis • allele sharing methods • key concept: identity by descent • confusing factoid: non-parametric models sometimes equivalent to parametric methods (Knapp M, 1993?) • association studies, linkage disequilibrium mapping • family-based methods (TDT, FBAT) • population-based methods (chi-square test, log-linear model)
What do meiotic mapping methods have in common? • based on meiosis • made possible through the violation of Mendel’s law of independent assortment • crossing over effects, recombination, .... • recombination fraction • requires genetic markers, and sometimes the distances between them (“genetic map”) • usually test hypothesis of no linkage H: =1/2 • but sometimes test for no linkage disequilibrium
What is parametric linkage analysis? “A meiotic mapping technique based on constructing a disease gene transmission model to explain the inheritance of a disease in pedigrees.” Meaning will become clear....
Genetic markers • desirable properties of genetic markers • locus-specific • polymorphic in the studied population • many heterozygotes • easily genotyped • quality measures for markers • heterozygosity: homozygotes are uninformative! • or Polymorphism Information Content • = probability that the parent is heterozygous x probability that the offspring is informative
Important co-dominant genetic markers • microsatellites • variations in the number of tandem repeats • high level of polymorphism • even distribution across the genome • 2nd generation map • SNPs • single nucleotide polymorphisms • bi-allelic codominant marker • heterozygosity is limited at 50 percent • 3rd generation map
“Genetic“ distances and “genetic“ maps Will be very relevant for multipoint linkage studies.
The recombination fraction is a measure of distance between 2 loci • recombination fraction =the probability that a recombinant gamete is transmitted • If two loci are on different chromosomes, they will segregate independently • => recombination fraction =.5. • if two loci are right next to each other, they will segregate together during meiosis • => recombination fraction =0 • terminology • <.5 the loci are close (they are “linked”) • =.5 the loci are far apart (they are not linked)
Genetic distance (unit is Morgan)= expected no. of cross-over pts per gamete • notation: let a and b be 2 points in the genome. • N[ab] = number of chiasmata between them • chiasmata=crossing-over points • Definition: the genetic (map) distance is d=E(N[ab])/2 • Why factor of 2? Want no. of chiasmata per gamete. • Example: if on average 49 crossovers per per cell in meiosis • then total genetic map distance=49/2=24.5 Morgans • 1 Morgan=100 centimorgan
There is a relationship between crossing over and recombination fraction • Mather’s formula: θ=.5*P(N[ab]>0) • for small distances d approximately equal to θ, • since in this case E(N[ab])=P(N[ab]>0) • P(N[ab]>0) is related to d=E(N[ab])/2 • different probability models for N[ab] lead to different relationships between θ and d. • each “sensible” relationships between θ and d is called a map functions • Great reference: Lange K: “Mathematical and Statistical methods in genetic analysis” book, Springer
The mathematical relationship between recombination fraction and genetic distance is called mapping function • Haldane’s mapping function • d=-.5 ln(1-2) • the distance d is measure in centimorgan • perfect if crossovers occurred at random (no interference) • Kosambi’s mapping function • d=.25 ln[(1+2)/(1-2)] • again distance is measured in centimorgan • suitable if there is (crossover) interference: • one cross-over prevents another from taking place nearby • widely used
Note: for both mapping functions • if =.5, d = +infinite Morgans (infinite distance) • if =.0, d = 0 M (0 distance) • if =27%, Haldane=.39=39cM, Kosambi = .30 Morgans=30cM
Men are genetically shorter than women • Total male map length=2851cM • Total female map length=4296cM (excluding the X) • Thus over 3000Mb (megabases) autosomal genome • 1 male cM averages 1.05 Mb • 1 female cM averages 0.88Mb
Meiotic versus physical maps • meiotic maps measure distances in “genetic” distances, i.e. centimorgan • pretty coarse and often inaccurate • problem 1: which marker order? • problem2: which mapping function? • physical maps measure distances in base pairs • extremely high resolution allows you to find the actual mutation • Connection between the 2 maps • rule of thumb: 1cM equals 1 million base pairs • but this thumb is very crooked!!!
The likelihood • likelihood=probability of data given the parameters • likelihoods are useful for estimation and for testing • example: phase-known fully informative case • observed data: R=no. of recombinations, NR=no of non-recomb. • parameter: the recombination fraction =Pr(recombination) • likelihood is proportional to: R(1- )NR • maximum likelihood likelihood estimate • use the log of the likelihood for mathematical convenience
Advantagesof max. likelihood estimation • advantages • asymptotically most efficient, • high precision • asymptotically consistent it will converge closer and closer to the true value • asymptotically unbiased • corresponding likelihood ratio test enjoys similar optimality criteria
How to compute lod scores? Lod scores are computed for each pedigree (i) as: For a given value of , pedigree-specific lod scores are summed across the F families to yield an overall lod score:
Example: lod score calculation PEDIGREE DRAWING Message: disease status is not required....
2 point parametric linkage analysis • Setting • genotype of 1 marker locus is known for family members • the genotypes of the other locus (disease susceptibility locus) are unknown • but the disease locus phenotype (affectation status) is known • GOAL: • test whether the disease locus and marker are linked • Q: Why is it important? • A: If they are linked, the disease locus must be close to the marker, i.e. we have localized the disease gene.
Test for linkage is carried out in 3 steps Step 1: use the disease status to infer the underlying disease locus genotypes Step 2: count the number of recombinations and non-recombinations for the different possible paternal phases Step 3: compute the lod score and check whether it is bigger than 3.0
DATA for a single pedigree rare, fully penetrant, dominant disease Grandpa unaffected, 22, Grandma affected 11 father affected
Step 1-3 • STEP 1 • we assume that the disease locus carries 2 alleles • since the disease genotype is fully penetrant, the genotypes of the unaffecteds must equal dd • the genotype of the grandma is Dd or DD. Since the disase is rare, it is probably Dd. • thus we get the same pedigree as described earlier • STEPs 2-3 were already carried out earlier.
Glitch for non-Mendelian diseases • the relation between disease locus genotypes and affectation status is in general very complex and can no longer be solved by inspection • need powerful statistical and computation methods • start with likelihood (easy to write down) • compute the likelihood (hard)
Most general form of the likelihood of pedigree data • summation of j is over all founders (specify allele frequencies) • product (k,l,m) is taken over all parent-offspring triples. • transmission probabilities depend on θ • for multiple markers (multipoint analysis) need to specify • a mapping function, e.g., Kosambi
Marker parameters • notation: marker alleles denoted here by 1, 2, …. • relation between marker genotype and phenotype • usually known (example: ABO blood group) • SNPs and microsatellites are codominant=>relation is trivial • allele frequencies p1,p2, …. • if parents are unavailable, the results may depend critically on getting them right. Also homozygosity mapping. • vary between different populations • but can be estimated from the pedigree data • genetic marker map for multiple markers • marker order • genetic distance • increasingly accurate because of DNA sequencing
Disease locus parameters • notation: often 2 alleles D (bad) and d (normal) • allele frequencies pD and pd • pentrances=P(affected/genotype) • fDD=P(affected/genotype DD) • fDd=P(affected/genotype Dd) • fdd=P(affected/genotype dd) • liability classes • fancy terminology for letting penetrances between individuals • example: different penetrances for men and women, • or age dependence: young versus old
The biology is modeled through penetrance values • fully penetrant, dominant disease, no phenocopies • fDD=fDd=1, fdd=0 • fully penetrant, recessive disease, no phenocopies • fDD=1, fDd=fdd=0 • no effect • fDD=fDd=fdd • incomplete penetrance: fDD<1 • definition: phenocopies are affecteds without disease genes • phenocopies are present if fdd>0 • for the experts: imprinting is modeled by using 4 penetrances and keeping track of maternally and paternally transmitted alleles
Two point mapping • computerized lod score analysis is best way to analyze complex pedigrees for linkage with mendelian traits • use computer software, e.g., Mendel • the result of a linkage analysis is a table of lod scores at various recombination fractions • the result can be plotted to give curves, • region with lod>3 are linked and those with lod<-2 are excluded (exclusion mapping) • the curve will peak at the most likely recombination fraction
Output of a 2 point linkage analysis significant Equivalently, consider the table θ= 0.01, 0.10, 0.20, 0.30, 0.35, 0.40, 0.45, 0.50 lod= -5.0, -2.0, 1.0, 3.3, 4.0, 3.0, 1.0, 0.0 excluded
Multipoint mapping is more efficient than two point mapping • idea: analyze data for more than 2 loci simultaneously • helps overcome limited informativeness of markers • especially relevant for SNPs • peak heights depend crucially on the precise distances between markers and the mapping function->problematic • highest peak marks the most likely location • powerful method for scanning the genome in 20-Mb segments
Standard lod score analysis is not without problems • genotyping errors & misdiagnosis-> loss of power • lead to spurious recombinants -> inflates the length of the genetic map • multi-locus maps can detect such errors by checking for double recombinants • locus heterogeneity is always a pitfall • mutations in unlinked loci may produce the same clinical phenotype • use Genehunter of Homog to test for homogeneity • computational difficulties limit the pedigrees that can be analyzed (na not really....)
Algorithm Programs Solution Size Restrictions Elston-Stewart (Fast)Linkage, Mendel4, Vitesse, etc. exact varies: ~8 loci, less with loops Lander-Green Allegro, Cri-Map, GeneHunter, Mendel4, etc. exact ~20 people: 2n - f < 20 Markov chainMonte Carlo Loki, Pangea, SimWalk2, etc. estimate much larger: >200 people, >30 loci Limitations of the different methods Slide from webpage http://watson.hgen.pitt.edu/docs/simwalk2.html
General-Pedigree Linkage Analysis Packages Algorithm Approximate Increase in Computational Time with Increase in: People Markers Missing Data Elston-Stewart linear exponential severe Lander-Green exponential linear modest Markov chainMonte Carlo linear linear mild Computation times of the algorithms.
Distinction between pointwise (nominal) and genome-wide significance • pointwise p-value=probability of exceeding observed value at a given point, under H:=1/2 • genome-wide p-value=prob that the observed value will be exceeded anywhere in the genome • reality check about p-values • if the p-value < false positive rate alpha, the finding is significant • the smaller the p-value, the higher the statistical significance • genome-wide p-value>pointwise p value
Lod score thresholds should ensure a .05 genomwide false positive rate • genomwide false positive rate alpha=chance of a false positive result occurring anywhere during a whole genome scan • for single point, classically want lod> 3.0 • multipoint threshold for a Mendelian disease: 3.3 • Lander Schork 1994 • multipoint threshold for a complex disease • 3.3-4.0 (depends on the study design, Lander and Kruglyak 1995) • pointwise p value for significant linkage 5*10^(-5)
How to relate the pointwise (P) to the genome-wide false positive rate (G). • conservative Bonferroni correction: • P = G/(no of potential pointwise tests) • Example: no. of potential pointwise tests=no of potential SNPs=1 million, G=.05 => P = 5*10^(-8) • ignores dependencies (linkage) between markers • Lander and Kruglyak 1995 found the asymptotic relation • G(T)= [C+9.2*ρ*G*T]P(T) • T=threshold lod score • C=number of chromosomes=23 • ρ=crossover rate, depends on relationship being studied, e.g., sibs • G=length of the genome in Morgans=33 • for sibpairs use 3.6 for IBD testing and 4.0 for IBS testing
Linkage finding are controversial because of high false positive rate. • The smart money knows • want to see a lod score > 4 (or even 5) • meiotic mapping techniques fail at detecting complex disease genes • if the disease is complex, it is a false positive…. • if the effect is real, 2 point linkage analysis performs pretty well • How to avoid arguments over finding? • replicate the finding in a different sample • find the mutation