1 / 23

Authors: Lan Liu , Yonghui Wu,

E fficient A lgorithms for G enome-wide T agSNP S election across P opulations via the Linkage Disequilibrium C riterion. Authors: Lan Liu , Yonghui Wu,. Stefano Lonardi and Tao Jiang. Outline. Introduction The MCTS Model Our Algorithms Experimental Result. Outline. Introduction

vevina
Download Presentation

Authors: Lan Liu , Yonghui Wu,

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Algorithmsfor Genome-wide TagSNP Selection across Populationsvia the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano Lonardi and Tao Jiang

  2. Outline • Introduction • The MCTS Model • Our Algorithms • Experimental Result

  3. Outline • Introduction • The MCTS Model • Our Algorithms • Experimental Result

  4. Motivation • With the rapid development of genotyping technologies, there are more than 10 million verified single-nucleotide polymorphisms (SNPs) in dbSNP database. • We aim to select a subset of informative SNPs (i.e. tagSNPs), in order to • Save the cost for genotyping all SNPs. • Perform disease association mapping.

  5. TagSNP Selection • Haplotype-based methods • Require the information of the phased multilocus haplotypes • Haplotype-free methods • Do not require haplotype information • TagSNP selection via r2 linkage disequilibrium statistics

  6. (pAB –pA. p.B)2 r2 = • r2 statistics: pA.(1-pA.)p.B(1-p.B) r2 Linkage Disequilibrium Statistics • Given a pair of genetic markers 1 and 2. • If r2 is no less than a given threshold r0, marker 1 (or marker 2) can tag marker 2 (or marker 1, respectively).

  7. (a) SNP markers and their LD patterns in a population (b) TagSNPs for the population The TagSNP Selection Problem • Instance: a set V of SNP markers and LD patterns E={(vj1,vj2)| r2(vj1,vj2) is no less than a given threshold r0, vj1and vj2are in V}, Feasible solution: a subset V' , such that given any v in V, there exists a v' in V', where r2(v,v') is no less than r0. Objective: minimize |V'|. If we define G=(V, E), a tagSNP set is equivalent to a dominating set on G. • This model is introduced by Carlson et al., 2004. It is a simple and popular tagging method.

  8. Outline • Introduction • The MCTS Model • Our Algorithms • Experimental Result

  9. Population 2 Population 1 B b B b A 0.0025 0.0475 0.05 A 0.9025 0.0475 0.95 a 0.0475 0.9025 0.95 a 0.0475 0.0025 0.05 0.05 0.95 r2= 0 0.95 0.05 r2= 0 B b Admixed population: 50% population 1 50% population 2 A 0.4525 0.0475 0.5 a 0.0475 0.4525 0.5 0.5 0.5 r2= 0.6561 r2 Statistics in Single and Admixed Populations • SNP 2: B, b • SNP 1: A, a

  10. TagSNP Selection across Populations • A pair of SNPs • have remarkably different marker frequencies and very weak LD in two populations with different evolutionary histories. • may show strong LD in the admixed population. • TagSNPs picked from the admixed populations or one of the populations might not be sufficient to capture the variations in all populations.

  11. (a) SNP markers and their LD patterns in two populations. (b) The minimum TagSNP set for these two populations. The MCTS Model • Given a set of SNP markers and LD patterns in multiple populations, we want to find a minimum common tagSNP set for each of the populations. • The above problem is called the minimum common tagSNPselection problem (MCTS).

  12. Outline • Introduction • The MCTS Model • Our Algorithms • Experimental Result

  13. We calculate • the upper bound:the number of the tagSNPs obtained by our algorithms • the lower bound:the minimum number of tagSNPs needed • GreedyTag_lb • LRTag_lb Our Algorithms • The MCTS problem can be easily formulated by an integer linear programming. • We first apply some data reduction rules, then use one of the following algorithms • A greedy algorithm: GreedyTag • A Lagrangian relaxation algorithm: LRTag

  14. Remove less informative markers • Example: among markers 1, 2 and 6, remove marker 1 and 2. • Remove less stringent occurrences • Example: between the occurrences of markers 4 and 5 in population 2, remove the occurrence of marker 4. Data Reduction Rules • Pick all irreplaceable markers • Example: marker 7

  15. A Greedy Algorithm Apply data reduction rules no un-tagged occurrence? yes Output the tagSNPs Pick the marker which tags the most of the remaining occurrences as a tagSNP

  16. A Lagrangian Relaxation Algorithm iteration := 0 Introduce the Lagrangian multipliers λ no iteration++ < max_iter Obtain the relaxed integer program yes Update λtowards the subgradient direction Output the tagSNPs Initialize λ Obtain the tagSNP set based on λ Update the tagSNP set based on λ

  17. Outline • Introduction • The MCTS Model • Our Algorithms • Experimental Result

  18. There are four populations in HapMap data. • CEU: Europe descendents. • CHB: Chinese, Beijing. • JPT: Japanese, Tokyo. • YRI: Yoruba people of Ibadan, Nigeria. • We get tagSNPs for the following two datasets: • Encode regions • all 10 ENCODE regions • Human genome • chromosomes 1 – 22 • 10,859 markers. • 2,862,454 markers Experimental Result • We apply our algorithms on real HapMap data (release #19, NCBI build 34, October 2005).

  19. Experiment Result for ENCODE Regions • We compare our GreedyTag and LRTag with MultiPop-TagSelect(MPS). • Multipop-TagSelect first generates the tagSNPs for each single population, then combines the obtained tagSNPs together for multiple populations. • The gap between LRTag_lb and LRTag • r2 = 0.5: at most two for each region totally six for all regions • r2 = 0.8: there is no gap.

  20. Experiment Result for Human Genome • The gap between LRTag_lb and LRTag for the whole genome • 2,862,454 SNPs in total • r2 = 0.5: 1,061 • r2 = 0.8: 142 The numbers of tagSNPs selected by our algorithms are almost optimal.

  21. Running Time of Our Algorithms • Running environment • a 32-processor SGI Altix 4700 supercomputer system • 1.6 GHZ CPU • 64 GB shared memory • 15 threads in parallel. • Running time • r2= 0.5, • ENCODE regions: < 7 seconds for each region, < 1 minute for all regions. • Human genome: < 12 minutes for each chromosome, < 1 hour for the genome. • r2> 0.5, our algorithms run faster the above speed.

  22. Outline • Introduction • The MCTS Model • Our Algorithms • Experimental Result

  23. Thanks for your time and attention!

More Related