Authors: Lan Liu , Yonghui Wu,

Efficient Algorithmsfor Genome-wide TagSNP Selection across Populationsvia the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano Lonardi and Tao Jiang

Outline • Introduction • The MCTS Model • Our Algorithms • Experimental Result

Motivation • With the rapid development of genotyping technologies, there are more than 10 million verified single-nucleotide polymorphisms (SNPs) in dbSNP database. • We aim to select a subset of informative SNPs (i.e. tagSNPs), in order to • Save the cost for genotyping all SNPs. • Perform disease association mapping.

TagSNP Selection • Haplotype-based methods • Require the information of the phased multilocus haplotypes • Haplotype-free methods • Do not require haplotype information • TagSNP selection via r2 linkage disequilibrium statistics

(pAB –pA. p.B)2 r2 = • r2 statistics: pA.(1-pA.)p.B(1-p.B) r2 Linkage Disequilibrium Statistics • Given a pair of genetic markers 1 and 2. • If r2 is no less than a given threshold r0, marker 1 (or marker 2) can tag marker 2 (or marker 1, respectively).

(a) SNP markers and their LD patterns in a population (b) TagSNPs for the population The TagSNP Selection Problem • Instance: a set V of SNP markers and LD patterns E={(vj1,vj2)| r2(vj1,vj2) is no less than a given threshold r0, vj1and vj2are in V}, Feasible solution: a subset V' , such that given any v in V, there exists a v' in V', where r2(v,v') is no less than r0. Objective: minimize |V'|. If we define G=(V, E), a tagSNP set is equivalent to a dominating set on G. • This model is introduced by Carlson et al., 2004. It is a simple and popular tagging method.

Population 2 Population 1 B b B b A 0.0025 0.0475 0.05 A 0.9025 0.0475 0.95 a 0.0475 0.9025 0.95 a 0.0475 0.0025 0.05 0.05 0.95 r2= 0 0.95 0.05 r2= 0 B b Admixed population: 50% population 1 50% population 2 A 0.4525 0.0475 0.5 a 0.0475 0.4525 0.5 0.5 0.5 r2= 0.6561 r2 Statistics in Single and Admixed Populations • SNP 2: B, b • SNP 1: A, a

TagSNP Selection across Populations • A pair of SNPs • have remarkably different marker frequencies and very weak LD in two populations with different evolutionary histories. • may show strong LD in the admixed population. • TagSNPs picked from the admixed populations or one of the populations might not be sufficient to capture the variations in all populations.

(a) SNP markers and their LD patterns in two populations. (b) The minimum TagSNP set for these two populations. The MCTS Model • Given a set of SNP markers and LD patterns in multiple populations, we want to find a minimum common tagSNP set for each of the populations. • The above problem is called the minimum common tagSNPselection problem (MCTS).

We calculate • the upper bound：the number of the tagSNPs obtained by our algorithms • the lower bound：the minimum number of tagSNPs needed • GreedyTag_lb • LRTag_lb Our Algorithms • The MCTS problem can be easily formulated by an integer linear programming. • We first apply some data reduction rules, then use one of the following algorithms • A greedy algorithm: GreedyTag • A Lagrangian relaxation algorithm: LRTag

Remove less informative markers • Example: among markers 1, 2 and 6, remove marker 1 and 2. • Remove less stringent occurrences • Example: between the occurrences of markers 4 and 5 in population 2, remove the occurrence of marker 4. Data Reduction Rules • Pick all irreplaceable markers • Example: marker 7

A Greedy Algorithm Apply data reduction rules no un-tagged occurrence? yes Output the tagSNPs Pick the marker which tags the most of the remaining occurrences as a tagSNP

A Lagrangian Relaxation Algorithm iteration := 0 Introduce the Lagrangian multipliers λ no iteration++ < max_iter Obtain the relaxed integer program yes Update λtowards the subgradient direction Output the tagSNPs Initialize λ Obtain the tagSNP set based on λ Update the tagSNP set based on λ

There are four populations in HapMap data. • CEU: Europe descendents. • CHB: Chinese, Beijing. • JPT: Japanese, Tokyo. • YRI: Yoruba people of Ibadan, Nigeria. • We get tagSNPs for the following two datasets: • Encode regions • all 10 ENCODE regions • Human genome • chromosomes 1 – 22 • 10,859 markers. • 2,862,454 markers Experimental Result • We apply our algorithms on real HapMap data (release #19, NCBI build 34, October 2005).

Experiment Result for ENCODE Regions • We compare our GreedyTag and LRTag with MultiPop-TagSelect(MPS). • Multipop-TagSelect first generates the tagSNPs for each single population, then combines the obtained tagSNPs together for multiple populations. • The gap between LRTag_lb and LRTag • r2 = 0.5: at most two for each region totally six for all regions • r2 = 0.8: there is no gap.

Experiment Result for Human Genome • The gap between LRTag_lb and LRTag for the whole genome • 2,862,454 SNPs in total • r2 = 0.5: 1,061 • r2 = 0.8: 142 The numbers of tagSNPs selected by our algorithms are almost optimal.

Running Time of Our Algorithms • Running environment • a 32-processor SGI Altix 4700 supercomputer system • 1.6 GHZ CPU • 64 GB shared memory • 15 threads in parallel. • Running time • r2= 0.5, • ENCODE regions: < 7 seconds for each region, < 1 minute for all regions. • Human genome: < 12 minutes for each chromosome, < 1 hour for the genome. • r2> 0.5, our algorithms run faster the above speed.

Thanks for your time and attention!

Authors: Lan Liu , Yonghui Wu,

Authors: Lan Liu , Yonghui Wu,

Presentation Transcript

Using Subversion

Chapter 3

Writing a scientific research paper

Chapter 1

Political

Soft constraint processing

Beyond the Suffering Embracing the Legacy of African American Soul Care and Spiritual Direction

Hippocratic Databases

BBI3420 / 3436

Week 24 Mangrove Wilderness day 1

Chapter 20

The Incredible Book

PIC

1060

2012. 7. 22

Motivation

Author’s Craft

Fabrications by Single Corresponding Author

Introduction to SAP ERP

Introduction to SAP ERP

Chapter 1

Chapter 2