910 likes | 1.06k Views
Haplotyping algorithms and structure of human variation. EECS 458 CWRU Fall 2004 Readings: see papers on the course website. Roadmap. Definition: haplotype and haplotype inference Why infer haplotypes Infer haplotypes from pedigree data Most probable haplotype configurations
E N D
Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website
Roadmap • Definition: haplotype and haplotype inference • Why infer haplotypes • Infer haplotypes from pedigree data • Most probable haplotype configurations • Haplotype configurations with minimum recombinations • Infer haplotypes from population data • Combinatorial: Clark’s, Perfect Phylogeny • Statistical methods: EM, Bayesian (MCMC) • Infer haplotypes from pooled samples • Haplotype block partition • Tag SNP selection
. . . A A T G C C G C A A . . . G T C . . . . . . A G T G C C G C A A . . . T A C . . . {1 2} {1 2} Paternal Maternal Genotype and Haplotype
A A A A C C C C G A A G A G G A T C T C T C C T Typical Genotype Data Observation: • Two alleles for each individual • Chromosome origin for each alleleis unknown • Multiple haplotype pairs can fitobserved genotype • Molecular haplotyping isexpensive A C Marker1 Marker2 G A T C Marker3 Possible haplotypes:
Haplotypes are important! • Phase may determine phenotype • Phase helps exploit linkage disequilibrium Infer state of neighboring alleles • Phase clarifies identity-by-descent status
Common Uses of Haplotypes • Linkage disequilibrium studies • Summarize genetic variation • Selecting markers to genotype • Identify haplotype tag SNPs • Candidate gene association studies • Test haplotype associations • Help interpret single marker associations • Understanding evolution of human populations
The problem… • Haplotypes are hard to measure directly • X-chromosome in males • Sperm typing • Other molecular techniques • Often, statistical or combinatorial methods for reconstruction required
{1 2} {1 1} {1 2} {1 2} {1 2} {2 2} 1|2 1|1 1|2 1|2 1|2 2|2 2|1 1|1 1|2 1|2 1|2 2|2 1|2 1|1 2|1 1|2 1|2 2|2 2|1 1|1 2|1 1|2 1|2 2|2 2|1 1|1 2|1 2|1 2|1 2|2 …… m 2m’ Haplotype Inference on population data m=6, m’=4
Information on Relatives • Number ofambiguousindividualsincreasesrapidly with numberof markers • Family informationcan help, but manyambiguities remain
{1 2} {1 2} {1 2} 2|1 {1 1} {1 *} {1 2} 1|1 {1 1} {1 1} {1 2} 1|1 {1 2} {1 2} {1 2} 2|1 2 2 {1 2} {1 2} {2 2} Haplotype Inference on Pedigrees, Mendelian Law
Haplotype inference on pooled samples • The input contain n pools • Each pool contains k individuals, thus 2k haplotypes and m markers • At each marker, we are given the number of alleles for the k individuals for each pool • The goal is to find the haplotype frequencies • Example: n=3, k=2, m=5
Haptotyping pedigree data: statistical formulation • Statistical formulation: find the most probable haplotype configuration • Need to calculate the probability of a pedigree on every haplotype configuration • Recall for linkage analysis, we need to calculate the probability of a pedigree, that sums over all possible haplotype configs
Haptotyping pedigree data : statistical formulation • Thus the linkage programs like Genehunter, Allegro, Merlin could compute the most probable haplotypes • But, it is time consuming…. • In addition to exact computation, there are some approximation algorithms, mainly based on important sampling, e.g. SimWalk. • Still very time consuming, may consider many configurations with very small probabilities
1|2 1|2 1|2 1|2 {1 2} {1 2} 1|2 {1 2} 1|2 {1 2} 1|2 1|2 1|2 1|2 {1 2} {1 2} 1|1 {1 2} {1 1} {1 2} 1|1 1|2 1|1 1|2 1|2 2|1 1|2 1|2 {1 2} {1 2} {1 2} {1 2} Recombination and combinatorial formulation
{1 1} {1 2} {2 2} ... {1 2} {1 2} {1 2} ... {1 2} {1 2} {1 2} … {1 2} {1 2} {2 2} … {1 2} {1 2} {1 2} ... {1 1} {1 2} {2 2} ... {1 1} {1 2} {2 2} … {1 2} {2 2} {2 2} … MRHC Problem Find a minimum recombinant haplotype configuration from a given pedigree with genotype data. Assumptions: • Mendelian law (no mutations); • Recombination events are rare. Well supported from real data. Input
1|1 1|2 2|2 ... 1|2 1|2 2|1 ... 1|2 1|2 2|1 … 1|2 2|1 2|2 … 1|2 1|2 2|1 ... 1|1 1|2 2|2 ... A 1|1 2|1 2|2 … 1|2 2|2 2|2 … B GS2=1 GS2=0 MRHC Problem (cont’d) • PS: parental source of the two alleles at the locus (i.e. phase) • GS: grandparental source of an allele • Haplotype configuration = assignment of PS and GS values. PS=0 PS=1 GS2=1 Output
Previous Results • Genotype elimination (O’Connell’00). • For data requiring no recombinant, exhaustive elimination. • Genetic algorithm (Tapadar et al.’00). • Time consuming. • MRH (Qian & Beckmann’02). • Six step rule-based algorithm. • Locus by locus at every step, extremely slow for biallelic (e.g. SNP) markers.
Thm. MRHC is NP-Hard. • Idea: Reduction from a variant of set cover. • First complexity result. • Remains hard for two loci. • Remains hard when no loops. Li & Jiang’03, Doi, Li & Jiang’03
Block-Extension Algorithm Iterative, heuristic, five steps. Rules are derived from Mendelian law, MR principle, block concept and some greedy ideas based on the following observations: • Block structures are common in haplotypes. • Double recombination events are rare. • Common haplotype blocks shared in siblings. • … Advantages/Disadvantages Time complexity (BE: O(dmn) / MRH: O(2dm3n2)) Li & Jiang’03
1 1 1 2 2 2 1 1 1 2 2 3 3 4 * * * * * * * * 1 1 1 2 2 3 3 4 1 1 1 2 2 3 3 4 2 3 * * * * * * 2 3 3 4 1 4 2 * 3 3 3 4 4 4 5 5 5 3 3 3 2 2 4 3 4 3 3 3 2 2 4 3 4 3 3 3 2 2 4 3 4 1 2 3 2 2 1 3 4 1 3 4 2 2 4 3 2 1 2 3 2 2 1 3 4 1 2 3 2 2 1 3 4 1 3 4 2 2 4 3 2 1 3 4 2 2 4 3 2 6 6 6 1 3 4 2 2 4 3 2 1 3 4 2 2 4 3 2 1 * 4 2 2 4 3 2 Block-Extension Algorithm
1 1 2 2 1|1 1|2 2 3 3 4 2|3 3 4 1 4 2 * 3 3 4 4 5 5 1|3 3 2 2 4 3 4 1|2(-1,0) 2|3(1,-1) 2|1(-1,-1) 3 4 1|3(-1,1) 2|4(1,-1) 2|4(-1,-1) 3|2(-1,-1) 6 6 1 3 4|2(1,-1) 2 4 2|3(1,-1) Block-Extension Algorithm 1|1 1|2 2 3 3 4 2|3 3 4 1 4 2 * 1|3 3 2 2 4 3 4 1|2(-1,0) 2|3(1,-1) 2|1(-1,-1) 3 4 1|3(-1,1) 2|4(1,-1) 2|4(1,-1) 3|2(-1,-1) 3|1(1,0) 4|2(1,-1) 4|2(1,-1) 2|3(1,-1)
Dynamic Programming Algorithms • Locus-based dynamic programming algorithm • Linear time in the number of the members • Applicable to only tree pedigrees • Member-based dynamic programming algorithm • Linear time in the number of the loci • Applicable to general pedigrees with small sizes Doi, Li & Jiang’03
7 5 8 6 root 1 2 3 4 Locus-Based Dynamic Programming
Constraint-Finding Algorithm • Assumptions: • No missing alleles, no errors. • Zero recombinants. • Idea: finding all feasible (i.e. 0-recombinant) haplotype configurations is equivalent to reducing the degree of freedom in PS/GS assignment. Li & Jiang’03
1|1 1|2 2|2 ... 1|2 1|2 2|1 ... 1|2 1|2 2|1 … 1|2 2|1 2|2 … 1|2 1|2 2|1 ... 1|1 1|2 2|2 ... A PS=0 1|1 2|1 2|2 … 1|2 2|2 2|2 … B GS2=1 Four Levels of Constraints Based on Mendelian law (on single locus) : • Level 1: GS constraint • Level 2: PS constraint Based on 0-recombinant (for a pair of loci): • Level 3: Haplotype constraint • Level 4: Grouping constraint
4 5 2 1 {1 2} {1 2} {1 2} {1 2} {1 2} {1 2} {1 2} {1 2} 6 {1 2} {1 2} 3 4 5 4 5 4 5 {1 2} {1 2} {1 2} {1 2} {1 2} {1 2} 6 1 2 1 2 1 2 2 1 1 2 1 2 1 2 2 1 6 6 {1 1} {1 1} 2 1 2 1 1 2 2 1 Level 3 and Level 4 Constraints
The variables represent PS values and the equations are over Z2 Level 3 and Level 4 Constraints
Analysis of Constraint-Finding Algorithm Thm. Every solution consistent with the constraint equations is a feasible solution and vice versa. • Steps: • find all constraints, in the form of linear equations over Z2 • solve the equations by Gaussian elimination • enumerate all feasible haplotype configurations • Exact polynomial time (O(n3m3); genotype elimination: exponential)
Integer Linear Programming • Combines missing data imputation and haplotype inference. • Regardless of the pedigree structure, number of recombinants, number of variables are linear of problem size. • Implicitly checks the Mendelian consistency for pedigree genotype data with missing alleles, which is also an NPC problem. • Could find all possible optimal solutions. • Solved by a branch-and-bound algorithm. • Effective for practical size problems in terms of time efficiency. • Accurate in terms of missing alleles imputation and haplotype inference. Li & Jiang’04a
ILP for MRHC with Missing Data • Define variables . • Define linear constraints. • Define a linear objective function of the variables. • Preprocess constraints. • Apply branch-and-bound strategy to find solutions. (a partial order relationship and some other special relationships). • Estimate bounds. • Apply a maximum likelihood approach to multiple optimal solutions.
1 2 Define tjf vars for each paternal allele and tjm vars for each maternal allele at locus j of individual i: {1 2} {1 2} {1 0} {1 2} 3 4 Individual 4: {1 1} {1 2} {1 2} {1 0} … Formulation Mj:={mk} set of all possible alleles at marker locus j and let tj = |Mj|. M1 = {1, 2} , M2 = {1,2}
Formulation: Variables • Define 2 g vars for each paternal allele and maternal allele at locus j for individual i • Var g1 = 0 (or 1) iff paternal allele is copied from father’s paternal (or maternal) allele. Var g2 defined similarly. • Define r vars:
Formulation: Objective Function • Objective function: Subject to Genotype constraints:
Formulation: Constraints • Mendelian law of inheritance constraints (a child i and its father f ): • Constraints for r vars:
1 8 1 4 8 5 3’ 3 9 9 6 2 11 2 11 7 10 10 A Partial Order Relationship Denote: Inequalities with 2 variables:
Forced Variables • Rule 1: • Rule 2: • Rule 3:
Lower and Upper Bounds • Lower bounds • Linear relaxation. • Summation of the number of recombinants in each nuclear family. • Effective for data with large number of recombinants. • Upper bound • Obtained by block-extension algorithm. • Effective for data with small number of recombinants.
Statistical Assessment • E-M algorithm to estimate haplotype frequencies for data that consist of multiple pedigrees.
PedPhase software • Simulated data were generated to compare our algorithms, as well as MRH in terms of efficiency, accuracy. • Three different pedigree structures. • Multiallelic and biallelic data. • Numbers of loci: 10, 25 and 50. • Number of recombinants: 0-4. • 100 runs per data set.
Real Data Analysis • Data set (Gabriel et al.’02) • 93 members, 12 pedigrees (each with 7-8 members); • chromosome 3, 4 regions, each region 1-4 blocks.
Results From ILP on the Whole Dataset 3.82 4.00 0.45 0.034
What if there are no relatives? • Rely on linkage disequilibrium • Assume that population consists of smallnumber of distinct haplotypes • Haplotypes tend to be similar
Clark’s Haplotyping Algorithm • Clark (1990) Mol Biol Evol 7:111-122 • One of the first haplotyping algorithms • Computationally efficient • Very fast • Today, more accurate alternatives are oftenavailable
Clark’s Haplotyping Algorithm • Find homozygous individuals • Initialize a list of known haplotypes • Resolve ambiguous individuals • If possible, use two haplotypes from list • Otherwise, use one known haplotype and augment list • If unphased individuals remain • Assign phase randomly to one individual • Augment haplotype list and continue from previous step
Haplotyping via Perfect Phylogeny - Model, Algorithms, Empirical studies Dan Gusfield, Ren Hua Chung U.C. Davis Cocoon 2003