1 / 23

SplittingHeirs : Inferring Haplotypes by Optimizing Resultant Dense Graphs

SplittingHeirs : Inferring Haplotypes by Optimizing Resultant Dense Graphs. Sharlee Climer , Alan R. Templeton, and Weixiong Zhang ACM-BCB, Niagara falls August 2010. Overview. Introduction Definition of haplotype inference problem Previous approaches SplittingHeirs

hanh
Download Presentation

SplittingHeirs : Inferring Haplotypes by Optimizing Resultant Dense Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SplittingHeirs:Inferring Haplotypes by Optimizing Resultant Dense Graphs SharleeClimer, Alan R. Templeton, and Weixiong Zhang ACM-BCB, Niagara falls August 2010

  2. Overview Introduction Definition of haplotype inference problem Previous approaches SplittingHeirs Experimental results

  3. Introduction • Only 0.1% of human DNA has variation • Most of this variation is due to Single Nucleotide Polymorphisms (SNPs) • Most SNPs have only two variants, or alleles, within a population • Broad definition of haplotype: A set of alleles for a given set of SNPs in relatively close proximity on a chromosome Image source: http://www.dnabaser.com/articles/SNP/SNP-Single-nucleotide-polymorphism.png

  4. Introduction • DNA is transcribed to produce RNA • RNA is translated, ultimately producing proteins • Variation in non-coding regions might have an effect on regulation • SNPs throughout the genome may be of interest Image source: http://www.cytochemistry.net/cell-biology/ribosome.htm

  5. Introduction • Humans are diploid • Pairs of chromosomes • Common sequencing produces a meld of the two haplotypes, referred to as a genotype • Computational methods used to infer a pair of haplotypes from a genotype • Phasing the genotype G C T T SNP1 SNP2 G C T T G T A C + C T A G ? C T A C + G T A G

  6. Importance of accuracy when inferring haplotypes from genotypes • Frequently an early step in expensive and vitally important studies SNP1 SNP2 SNP1 SNP2 C C T C G C T T

  7. Introduction • Possible to identify the separate haplotypes directly • Only feasible for very small studies • Useful for testing accuracy of computational methods • Andres et al. [Genet. Epi. 2007] found computational methods had poor accuracy and confidence levels were error prone • PHASE [Stephens et al., AJHG 2001] • fastPhase[Scheet and Stephens, AJHG 2006] • HAP [Halperin and Eskin, Bioinformatics 2004] • GERBIL [Kimmel and Shamir, PNAS 2005] • Errors in confidence levels suggest that the models might not fully capture biological properties

  8. Problem Definition • Let ‘0’ and ‘1’ represent the two possible alleles for a given SNP • Haplotype represented by a string of binary values • Genotype for a pair of haplotypes • ‘0’ if both alleles are ‘0’ • ‘1’ if both alleles are ‘1’ • ‘2’ if heterozygous G T A C C T A G 1 1 0 0 0 1 0 1 2 1 0 2

  9. Problem Definition • For k heterozygous sites, there are 2k-1 feasible solutions • Not apparent which solution is more likely than another • Population-level characteristics • There tends to be relatively few unique haplotypes • There tends to be clusters of haplotypes that are similar to each other • Some haplotypes are relatively common

  10. Problem Definition Given a set of genotypes drawn from a population: 1) Find the set of haplotypes that exist in the set 2) For each genotype, determine the pair of haplotypes that is mostly likely to exist in the given individual Image source: http://www.samepoint.com/blog/wp-content/uploads/2009/04/blog_group_of_people_1.jpg

  11. Example • Example problem • 5 individuals • 8 SNP sites • Display solutions as graphs • Each node represents a unique haplotype • Edge weight • Measure of difference between haplotypes • Set equal to the number of sites that differ between the haplotypes • Edges with smallest distances are shown g1: 1111 0001 g2: 2212 0202 g3: 2220 2102 g4: 2222 2121 g5: 2022 0222

  12. Example • Solution found by: • Clark’s Subtraction Method [Mol. Biol. And Evol. 1990] • Pure Parsimony [Gusfield, CPM’03] • EM [Excoffier and Slatkin, Mol. Biol.Evol. 1995] • 5 unique haplotypes • Haplotypes are not very similar to each other g1: 1111 0001 g2: 2212 0202 g3: 2220 2102 g4: 2222 2121 g5: 2022 0222

  13. Example No Perfect Phylogeny solution Solution found by HAP 6 unique haplotypes Haplotypes are slightly more similar to each other g1: 1111 0001 g2: 2212 0202 g3: 2220 2102 g4: 2222 2121 g5: 2022 0222

  14. Example Solution found by PHASE 9 unique haplotypes Haplotypes are more similar to each other g1: 1111 0001 g2: 2212 0202 g3: 2220 2102 g4: 2222 2121 g5: 2022 0222

  15. Example PHASE favors pair-wise similarities Essentially evaluating a nearest-neighbor graph g1: 1111 0001 g2: 2212 0202 g3: 2220 2102 g4: 2222 2121 g5: 2022 0222

  16. SplittingHeirs • SplittingHeirs favors cluster-wide similarities, as well as reduced cardinality • Cast as a Mixed Integer Linear Program (MIP) Minimize: • where • di = the weight of edge i • h = the cardinality of the haplotype set • u = a weighting factor

  17. SplittingHeirs Enforce cluster-wide similarities by requiring a minimum density of edges in the graph Additional constraint: • where • e = number of edges • a is a configurable parameter • Can be decreased for highly diverse sample • Can be increased for sample with low diversity

  18. Example Solution found by SplittingHeirs 8 unique haplotypes Haplotypes are quite similar to each other g1: 1111 0001 g2: 2212 0202 g3: 2220 2102 g4: 2222 2121 g5: 2022 0222

  19. Results Tested on 7 sets of haplotype data for which the true phase is known • n is the number of individuals • m is the number of sites • # Ambiguous is the number of genotypes that have more than one feasible solution

  20. Results

  21. Results

  22. Conclusions • Introduced a biologically intuitive model that optimizes cluster-wide similarities and reduced cardinality • Globally optimal solutions can be computed for small regions • Candidate locus studies • Future work • Speed up computation • Use model to guide an approximation method Image source: http://farm3.static.flickr.com/2268/2255581637_a59a956bfe.jpg

  23. Acknowledgments • Olin Fellowship • NIH grants • P50-GM065509 • R01-GM087194A2 • U01-GM063340 • NSF grants • IIS-053557 • DBI-0743797 • Alzheimer’s Association grant • Thanks to: • Taylor Maxwell • Gerold Jaeger

More Related