1 / 91

Haplotyping algorithms and structure of human variation

Haplotyping algorithms and structure of human variation. EECS 458 CWRU Fall 2004 Readings: see papers on the course website. Roadmap. Definition: haplotype and haplotype inference Why infer haplotypes Infer haplotypes from pedigree data Most probable haplotype configurations

minty
Download Presentation

Haplotyping algorithms and structure of human variation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website

  2. Roadmap • Definition: haplotype and haplotype inference • Why infer haplotypes • Infer haplotypes from pedigree data • Most probable haplotype configurations • Haplotype configurations with minimum recombinations • Infer haplotypes from population data • Combinatorial: Clark’s, Perfect Phylogeny • Statistical methods: EM, Bayesian (MCMC) • Infer haplotypes from pooled samples • Haplotype block partition • Tag SNP selection

  3. . . . A A T G C C G C A A . . . G T C . . . . . . A G T G C C G C A A . . . T A C . . . {1 2} {1 2} Paternal Maternal Genotype and Haplotype

  4. A A A A C C C C G A A G A G G A T C T C T C C T Typical Genotype Data Observation: • Two alleles for each individual • Chromosome origin for each alleleis unknown • Multiple haplotype pairs can fitobserved genotype • Molecular haplotyping isexpensive A C Marker1 Marker2 G A T C Marker3 Possible haplotypes:

  5. Haplotypes are important! • Phase may determine phenotype • Phase helps exploit linkage disequilibrium Infer state of neighboring alleles • Phase clarifies identity-by-descent status

  6. Common Uses of Haplotypes • Linkage disequilibrium studies • Summarize genetic variation • Selecting markers to genotype • Identify haplotype tag SNPs • Candidate gene association studies • Test haplotype associations • Help interpret single marker associations • Understanding evolution of human populations

  7. The problem… • Haplotypes are hard to measure directly • X-chromosome in males • Sperm typing • Other molecular techniques • Often, statistical or combinatorial methods for reconstruction required

  8. {1 2} {1 1} {1 2} {1 2} {1 2} {2 2} 1|2 1|1 1|2 1|2 1|2 2|2 2|1 1|1 1|2 1|2 1|2 2|2 1|2 1|1 2|1 1|2 1|2 2|2 2|1 1|1 2|1 1|2 1|2 2|2 2|1 1|1 2|1 2|1 2|1 2|2 …… m 2m’ Haplotype Inference on population data m=6, m’=4

  9. Information on Relatives • Number ofambiguousindividualsincreasesrapidly with numberof markers • Family informationcan help, but manyambiguities remain

  10. {1 2} {1 2} {1 2} 2|1 {1 1} {1 *} {1 2} 1|1 {1 1} {1 1} {1 2} 1|1 {1 2} {1 2} {1 2} 2|1 2 2 {1 2} {1 2} {2 2} Haplotype Inference on Pedigrees, Mendelian Law

  11. Haplotype inference on pooled samples • The input contain n pools • Each pool contains k individuals, thus 2k haplotypes and m markers • At each marker, we are given the number of alleles for the k individuals for each pool • The goal is to find the haplotype frequencies • Example: n=3, k=2, m=5

  12. Haptotyping pedigree data: statistical formulation • Statistical formulation: find the most probable haplotype configuration • Need to calculate the probability of a pedigree on every haplotype configuration • Recall for linkage analysis, we need to calculate the probability of a pedigree, that sums over all possible haplotype configs

  13. Haptotyping pedigree data : statistical formulation • Thus the linkage programs like Genehunter, Allegro, Merlin could compute the most probable haplotypes • But, it is time consuming…. • In addition to exact computation, there are some approximation algorithms, mainly based on important sampling, e.g. SimWalk. • Still very time consuming, may consider many configurations with very small probabilities

  14. 1|2 1|2 1|2 1|2 {1 2} {1 2} 1|2 {1 2} 1|2 {1 2} 1|2 1|2 1|2 1|2 {1 2} {1 2} 1|1 {1 2} {1 1} {1 2} 1|1 1|2 1|1 1|2 1|2 2|1 1|2 1|2 {1 2} {1 2} {1 2} {1 2} Recombination and combinatorial formulation

  15. {1 1} {1 2} {2 2} ... {1 2} {1 2} {1 2} ... {1 2} {1 2} {1 2} … {1 2} {1 2} {2 2} … {1 2} {1 2} {1 2} ... {1 1} {1 2} {2 2} ... {1 1} {1 2} {2 2} … {1 2} {2 2} {2 2} … MRHC Problem Find a minimum recombinant haplotype configuration from a given pedigree with genotype data. Assumptions: • Mendelian law (no mutations); • Recombination events are rare. Well supported from real data. Input

  16. 1|1 1|2 2|2 ... 1|2 1|2 2|1 ... 1|2 1|2 2|1 … 1|2 2|1 2|2 … 1|2 1|2 2|1 ... 1|1 1|2 2|2 ... A 1|1 2|1 2|2 … 1|2 2|2 2|2 … B GS2=1 GS2=0 MRHC Problem (cont’d) • PS: parental source of the two alleles at the locus (i.e. phase) • GS: grandparental source of an allele • Haplotype configuration = assignment of PS and GS values. PS=0 PS=1 GS2=1 Output

  17. Previous Results • Genotype elimination (O’Connell’00). • For data requiring no recombinant, exhaustive elimination. • Genetic algorithm (Tapadar et al.’00). • Time consuming. • MRH (Qian & Beckmann’02). • Six step rule-based algorithm. • Locus by locus at every step, extremely slow for biallelic (e.g. SNP) markers.

  18. Thm. MRHC is NP-Hard. • Idea: Reduction from a variant of set cover. • First complexity result. • Remains hard for two loci. • Remains hard when no loops. Li & Jiang’03, Doi, Li & Jiang’03

  19. Block-Extension Algorithm Iterative, heuristic, five steps. Rules are derived from Mendelian law, MR principle, block concept and some greedy ideas based on the following observations: • Block structures are common in haplotypes. • Double recombination events are rare. • Common haplotype blocks shared in siblings. • … Advantages/Disadvantages Time complexity (BE: O(dmn) / MRH: O(2dm3n2)) Li & Jiang’03

  20. 1 1 1 2 2 2 1 1 1 2 2 3 3 4 * * * * * * * * 1 1 1 2 2 3 3 4 1 1 1 2 2 3 3 4 2 3 * * * * * * 2 3 3 4 1 4 2 * 3 3 3 4 4 4 5 5 5 3 3 3 2 2 4 3 4 3 3 3 2 2 4 3 4 3 3 3 2 2 4 3 4 1 2 3 2 2 1 3 4 1 3 4 2 2 4 3 2 1 2 3 2 2 1 3 4 1 2 3 2 2 1 3 4 1 3 4 2 2 4 3 2 1 3 4 2 2 4 3 2 6 6 6 1 3 4 2 2 4 3 2 1 3 4 2 2 4 3 2 1 * 4 2 2 4 3 2 Block-Extension Algorithm

  21. 1 1 2 2 1|1 1|2 2 3 3 4 2|3 3 4 1 4 2 * 3 3 4 4 5 5 1|3 3 2 2 4 3 4 1|2(-1,0) 2|3(1,-1) 2|1(-1,-1) 3 4 1|3(-1,1) 2|4(1,-1) 2|4(-1,-1) 3|2(-1,-1) 6 6 1 3 4|2(1,-1) 2 4 2|3(1,-1) Block-Extension Algorithm 1|1 1|2 2 3 3 4 2|3 3 4 1 4 2 * 1|3 3 2 2 4 3 4 1|2(-1,0) 2|3(1,-1) 2|1(-1,-1) 3 4 1|3(-1,1) 2|4(1,-1) 2|4(1,-1) 3|2(-1,-1) 3|1(1,0) 4|2(1,-1) 4|2(1,-1) 2|3(1,-1)

  22. Dynamic Programming Algorithms • Locus-based dynamic programming algorithm • Linear time in the number of the members • Applicable to only tree pedigrees • Member-based dynamic programming algorithm • Linear time in the number of the loci • Applicable to general pedigrees with small sizes Doi, Li & Jiang’03

  23. 7 5 8 6 root 1 2 3 4 Locus-Based Dynamic Programming

  24. Constraint-Finding Algorithm • Assumptions: • No missing alleles, no errors. • Zero recombinants. • Idea: finding all feasible (i.e. 0-recombinant) haplotype configurations is equivalent to reducing the degree of freedom in PS/GS assignment. Li & Jiang’03

  25. 1|1 1|2 2|2 ... 1|2 1|2 2|1 ... 1|2 1|2 2|1 … 1|2 2|1 2|2 … 1|2 1|2 2|1 ... 1|1 1|2 2|2 ... A PS=0 1|1 2|1 2|2 … 1|2 2|2 2|2 … B GS2=1 Four Levels of Constraints Based on Mendelian law (on single locus) : • Level 1: GS constraint • Level 2: PS constraint Based on 0-recombinant (for a pair of loci): • Level 3: Haplotype constraint • Level 4: Grouping constraint

  26. 4 5 2 1 {1 2} {1 2} {1 2} {1 2} {1 2} {1 2} {1 2} {1 2} 6 {1 2} {1 2} 3 4 5 4 5 4 5 {1 2} {1 2} {1 2} {1 2} {1 2} {1 2} 6 1 2 1 2 1 2 2 1 1 2 1 2 1 2 2 1 6 6 {1 1} {1 1} 2 1 2 1 1 2 2 1 Level 3 and Level 4 Constraints

  27. The variables represent PS values and the equations are over Z2 Level 3 and Level 4 Constraints

  28. Analysis of Constraint-Finding Algorithm Thm. Every solution consistent with the constraint equations is a feasible solution and vice versa. • Steps: • find all constraints, in the form of linear equations over Z2 • solve the equations by Gaussian elimination • enumerate all feasible haplotype configurations • Exact polynomial time (O(n3m3); genotype elimination: exponential)

  29. Integer Linear Programming • Combines missing data imputation and haplotype inference. • Regardless of the pedigree structure, number of recombinants, number of variables are linear of problem size. • Implicitly checks the Mendelian consistency for pedigree genotype data with missing alleles, which is also an NPC problem. • Could find all possible optimal solutions. • Solved by a branch-and-bound algorithm. • Effective for practical size problems in terms of time efficiency. • Accurate in terms of missing alleles imputation and haplotype inference. Li & Jiang’04a

  30. ILP for MRHC with Missing Data • Define variables . • Define linear constraints. • Define a linear objective function of the variables. • Preprocess constraints. • Apply branch-and-bound strategy to find solutions. (a partial order relationship and some other special relationships). • Estimate bounds. • Apply a maximum likelihood approach to multiple optimal solutions.

  31. 1 2 Define tjf vars for each paternal allele and tjm vars for each maternal allele at locus j of individual i: {1 2} {1 2} {1 0} {1 2} 3 4 Individual 4: {1 1} {1 2} {1 2} {1 0} … Formulation Mj:={mk} set of all possible alleles at marker locus j and let tj = |Mj|. M1 = {1, 2} , M2 = {1,2}

  32. Formulation: Variables • Define 2 g vars for each paternal allele and maternal allele at locus j for individual i • Var g1 = 0 (or 1) iff paternal allele is copied from father’s paternal (or maternal) allele. Var g2 defined similarly. • Define r vars:

  33. Formulation: Objective Function • Objective function: Subject to Genotype constraints:

  34. Formulation: Constraints • Mendelian law of inheritance constraints (a child i and its father f ): • Constraints for r vars:

  35. 1 8 1 4 8 5 3’ 3 9 9 6 2 11 2 11 7 10 10 A Partial Order Relationship Denote: Inequalities with 2 variables:

  36. Forced Variables • Rule 1: • Rule 2: • Rule 3:

  37. Lower and Upper Bounds • Lower bounds • Linear relaxation. • Summation of the number of recombinants in each nuclear family. • Effective for data with large number of recombinants. • Upper bound • Obtained by block-extension algorithm. • Effective for data with small number of recombinants.

  38. Statistical Assessment • E-M algorithm to estimate haplotype frequencies for data that consist of multiple pedigrees.

  39. PedPhase software • Simulated data were generated to compare our algorithms, as well as MRH in terms of efficiency, accuracy. • Three different pedigree structures. • Multiallelic and biallelic data. • Numbers of loci: 10, 25 and 50. • Number of recombinants: 0-4. • 100 runs per data set.

  40. Pedigree Structures

  41. Accuracy Results of BE Algorithm

  42. Efficiency Results

  43. More Results from ILP

  44. Real Data Analysis • Data set (Gabriel et al.’02) • 93 members, 12 pedigrees (each with 7-8 members); • chromosome 3, 4 regions, each region 1-4 blocks.

  45. Common Haplotypes &Frequencies

  46. Results From ILP on the Whole Dataset 3.82 4.00 0.45 0.034

  47. What if there are no relatives? • Rely on linkage disequilibrium • Assume that population consists of smallnumber of distinct haplotypes • Haplotypes tend to be similar

  48. Clark’s Haplotyping Algorithm • Clark (1990) Mol Biol Evol 7:111-122 • One of the first haplotyping algorithms • Computationally efficient • Very fast • Today, more accurate alternatives are oftenavailable

  49. Clark’s Haplotyping Algorithm • Find homozygous individuals • Initialize a list of known haplotypes • Resolve ambiguous individuals • If possible, use two haplotypes from list • Otherwise, use one known haplotype and augment list • If unphased individuals remain • Assign phase randomly to one individual • Augment haplotype list and continue from previous step

  50. Haplotyping via Perfect Phylogeny - Model, Algorithms, Empirical studies Dan Gusfield, Ren Hua Chung U.C. Davis Cocoon 2003

More Related