330 likes | 483 Views
Dynamic Programming Algorithms for Haplotype Block Partitioning: Applications to Human Chromosome 21 Haplotype Data. Speaker: Yao-Ting Huang Advisor: Kuan-Mao Chao. National Taiwan University Department of Computer Science & Information Engineering Algorithms and Computational Biology Lab.
E N D
Dynamic Programming Algorithms for Haplotype Block Partitioning:Applications to Human Chromosome 21 Haplotype Data Speaker: Yao-Ting Huang Advisor: Kuan-Mao Chao National Taiwan University Department of Computer Science & Information Engineering Algorithms and Computational Biology Lab.
Referrences • Zhang, K., Sun, F., Waterman, M.S., Chen, T. Dynamic programming algorithms for haplotype block partitioning: Applications to human chromosome 21 haplotype data, RECOMB, 2003 • Patil, N., et at, Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294: 1719-1723, 2001. • Waterman, M.S., Eggert, M. and Lander E.L. Parametric sequence comparisons. Proceedings of the National Academy of Sciences of the United States of America, 1992. • Zhang, K., Deng M., Chen, T., Waterman, M.S., Sun, F. A dynamic programming algorithm for haplotype block partitioning. Proceedings of the National Academy of Sciences of the United States of America, 2002 • Garey, M.R. and Johnson D.S. Computers and Intractability, New York, 1979
Outline • Related biological background • Related works • The haplotype block partition problem • Three dynamic programming algorithms • Result
Nucleus DNA Genes Chromosomes Cytoplasm Cell Membrane Introduction to Nucleic Acids
Chromosomes and DNA • Human cell has 46 chromosomes, present in 23 pairs, one from each of the two parents. • Genetic information is stored and organized on the chromosomes, which are encoded in DNA (deoxyribonucleic acid).
Base Phosphate Sugar Structure of DNA • DNA is a nucleotide, which has the structure • Phosphate-sugar-base
O N N N C C N O N N C C O N C C O C C N C C C C N N C N C C N N C N C C N Structure of DNA • DNA has four bases • Adenine (A), Guanine (G), Cytosine (C), Thymine (T) • A and G are purines, where C and T are pyrimidines. • Purines are double ring bases • Primidines are single ring bases. Adenine Guanine Cytosine Thymine
Gene replication Gene replication Mutation • Mutation is caused by chemicals or malfunction of DNA replication and exchange a single nucleotide for another. • e.g., C <--> T or A <--> G. Variation (mutation) Parent 1 Recombination Parent 2 or
Single Nucleotide Polymorphism • Single Nucleotide Polymorphism (SNP) arises from mutation. • Mutation nucleotides become SNPs when observed frequency> 1% in a population. • SNP: DNA single base variations found >1% • Mutation: DNA single base variations found <1% A C T T A G C T T General Population 94% SNP A C T T A G C T C 6% A C T T A G C T T General Population 99.9% A C T T A G C T C Mutation 0.1%
Single Nucleotide Polymorphism • All humans share 99.9% the same DNA sequence • SNPs occur about every 600 base pairs. • 90% of human genome variation comes SNPs. • The human genome contains about 3 million SNPs. • Because of the A-T/C-G complement, a SNP can have only two variants: (AT) or (CG). • A SNP is a variable with two states: • Major allele: Allele (i.e., (AT) or (CG)) > 50%. • Minor allele: Allele < 50%.
Phenotype Black eye Brown eye Black eye Blue eye Brown eye Brown eye SNP 1 SNP 2 SNP 3 Haplotype • A set of closely linked SNPs located on one chromosome, which tend to be inherited together (not easily separated by recombination). GATATTCGTACGGA-T GATGTTCGTACTGAAT GATATTCGTACGGA-T GATATTCGTACGGAAT GATGTTCGTACTGAAT GATGTTCGTACTGAAT Haplotypes AG- 2/6 GTA3/6 AGA1/6 1 2 3 4 5 6 DNA Sequence
An Example • The haplotype patterns for 20 independent chromosomes (column) defined by 147 SNPs (row) spanning 106 kb of genomic sequence. • Blue box = major allele • Yellow box = minor allele • The expanded box on the right is an SNP block of 26 SNPs over 19kb. The 4 most common of 7 different haplotypes include 80% of the chromosomes, and can be distinguished with 2 SNPs.
Related Works • Patil et al. proposes a greedy algorithm to identify 20 haplotypes for 24047 SNPs spanning over 32.4 Mbp on human chromosome 21. • The haplotypes are partitioned into 4135 blocks with 4563 tag SNPs. • Zhang et al. reduced the number of haplotype blocks and tag SNPs to 2575 and 3582, respectively, which is done by dynamic algorithm.
Zhang’s Algorithms for Haplotype Block Partitioning • Zhang et at. propose two dynamic programming algorithms to prioritize the SNPs and the corresponding chromosome regions. • Maximize the fraction of the genome covered by using a fixed number of tag SNPs. • Another algorithm to search the local maximal haplotypes that are shared by at least two haplotype samples. • Local maximal haplotype: the haplotype with the maximal length which are shared by a given number of samples. • Local maximal haplotype may correspond to important historical events during the evolution of the species.
Definition • Given K haplotype samples comprised of n consecutive SNPs. • Let hibe a K-dimensional vector, where i = 1, 2, …, n. • e.g., h1 = {0, 0, 1, 0}, h2 = {0, 0, 1, 1} when K = 4, n = 2 • hi(k) = 0, 1, or 2. • 0: missing data • 1: major allele • 2: minor allele
Definition • Two haplotypes are said to be compatible if the alleles are the same for them at each loci with no missing data. • A haplotype in the block is ambiguous if it is compatible with two other haplotypes that are themselves incompatible. • E.g., h1 = (1, 0, 0, 2),h2 = (1, 1, 2, 0),h3= (1, 1, 1, 2) • h1 is compatible with h2and h3,but h2is incompatible to h3. • h1 is ambiguous, whereas h2and h3 are unambiguous. • Only unambiguous haplotypes are discussed in this paper. • Compatible haplotypes are treated as identical.
Definition • Haplotype block: a segment of consecutive SNPs can form a haplotype block if at least α percent of umambigous haplotypes are represented more than once in the samples. • the αvalue in Zhang and Patil’s experiments are both set to 80. • Tag SNPs: minimum number of SNPs that can distinguish at least αpercentage of the haplotypes.
Predefined Functions • block(i, …, j) is a boolean function • Block(i, …, j) = 1 iff at least αM unambiguous haplotypes defined by that SNPs are represented more than once, where M ≤ K is the total number of defined haplotypes. • f(·) is the number of tag SNPs within a block. • Let B = {B1, B2, …, B3} is a set of disjoint blocks • L(i, …, j) is the length of a block. • L(i, …, j) = i – j + 1
Problem 1 • Block Partition with a Fixed number of tag SNPs: • Given K haplotypes consisting of n consecutive SNPs, and an integer m, find a set of disjoint blocks B = {B1, B2, …, Bl} with f(B) ≤ m such that L(B) is maximized. • 2D Dynamic programming algorithm for problem 1 • Le S(j,k) be the maximum length of the genome covered by at most k tag SNPs for the optimal block partition of the first j SNPs, j = 1, 2, …, n. • S(0,k) = 0 for any k • S(0,k) = -∞
2D Dynamic Programming Algorithm for Problem 1 • Case 1: the last block ends beforej • S(j, k)) = S(j-1, k) • Case 2: the last block ends exactly at j and starts at i • S(j, k)) = S(i-1, k - f(i ,..,j)) + L(i ,..,j) • The optimal block partition can be found by backtracking the elements of S that contribute to S(n,m)
Time Complexity to Compute S(n,m) • If the block(·), f(·), and L(·) functions are computed in advance, then S(n, m) has • space complexity = O(m*n). • time complexity = O(N*m*n), where N is the number of SNPs contained in the largest block. • Time complexity to compute L(·) is O(1). • Time complexity to compute block(i, …, i+k+1) is O(K2*k). • Need to compare whether any two haplotypes are compatible at these k SNPs.
Time Complexity to Compute S(n,m) • Time complexity to compute f(·) is a NP-Complete problem. • Equal to the Minimum Test Set problem. • e.g., simplest way to compute f(i, …, i+N+1) • Overall Time complexity • O(2N^K*N*n) + O(K2*N2*n) + (N*m*n)
Problem 2 • Block partition with a fixed genome coverage • Given a chromosome of length L, K haplotypes consisting of n consecutive SNPs and β≤ 1, find a set of disjoint blocks B = {B1, B2, …, Bl} with L(B) ≥ βL such that f(B) is minimized. • Parametric dynamic programming algorithm • Define the positive score for SNPs i, …, j, to be the number of tag SNPs, f(i, …,j), if block(i, …,j) = 1 and this block is included in the partition. • Define the penalty for SNPs i, …, j, to be λL(i,…,j) if they are excluded from the partition.
Parametric Dynamic Programming • Le S(j, λ) be the minimum score for the optimal block partition of the first j SNPs with respect to the deletion parameter λ. • S(0, λ) = 0 • S(n, ∞) = the minimum number of tag SNPs for the entire chromosome because all SNPs are included in the block partition. • The scoring function S(j, λ) is the sum of • the total number of tag SNPs for included blocks, and • The penalty for excluded intervals.
Properties of the scoring function • S(j, λ) is an increase, piecewise-linear, and convex function of λ, S(j, λ) = a + b* λ • The right-most linear segment of S(j, λ) is constant. • The intercept for each linear segment is the total number of tag SNPs. • The slope for each linear segment is the total length of excluded intervals.
Compute S(n, λ) • The algorithm starts with S(n, 0) and S(n, ∞), and let L0 and L∞ intersects at (x,y). • Case 1: if (S(n,x) = y), L0 and L∞ together define the entire function of S(n, λ). • Case 2: if (S(n,x) < y), divide λ into two regions: [0, x] and [0, ∞], and repeat the above procedures for this two regions. S(n, λ) S(n, λ) (x,y) (x,y) (x, S(n,x) ) λ λ
Time complexity of S(n, λ) • If L(·) is additive, the computational time can be reduced to O(n) by using the following recursion • The time complexity for finding S(n, λ) is • O(2N^K*N*n) + O(K2*N2*n) + (N*S*n) • S: the number of segments in S(n, λ) • For the case when different block partitions contribute to the same score, the algorithm chooses the right most segment with maximum number of tag SNPs and the minimum length of excluded intervals.
Problem 3 • Finding local maximal haplotypes for a subset of samples • Given K haplotypes consisting of n consecutive SNPs, and two integers, k≤ K and m ≤ n, find all local maximal haplotypes that are shared by at least k samples and contain at least m SNPs.
Algorithm 3 • Step 1. Let S be a super set containing a set of {all K samples}, |S=1| and j = i. • Step 2. For every set Sw <S, split into two sets if tuere ex9st tw0 samples in Sw that disagree at the jth SNP. • Step 3. Report one local maximal local haplotype if • |j-i+1| ≥ m, • |Sw|≥ k, and • there exists two samples that disagree at the (i-1)th and two ones at (j+1)th SNPs. • Step 4. • Stop if |S| = k; • Otherwise, let j = j+1 and go to Step 2.
Time complexity of algorithm 3 • Let N be the length of the local maximal haplotypes shared b at most k samples. • The overall time complexity is O(K*N*n). 1 1 1 1 … 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 2 1 1 1 1 2 2 1 1 1 1 2 2 1 1 1 2 2 1 1 1 1 1 2 2
Results of Algorithm 1 • The data set includes 20 haplotypes of 24047 SNPs (at least 10% minor allele frequence) spanning over abot 32,4 MB. • The parameter α for block() function is set to 80%. • Figure a shows the relationship between number of tag SNPs and percentage of the covered toal SNPs, where Figure b is w.r.t actual genome length. • The data set the published haplotype data of Human Chromosome 21 from Patil et al.
Results of Algorithm 2 • Figure a shows the relationship between the percentages of the total number of SNPs included and the deletion parameter λ. • Figure b shows the relationship between the percentages of the total number of SNPs included and the number of tag SNPs.
Results of Algorithm 3 Local maximal haplotypes are defined as that are shared by at least 2 samples and contain at least 100 consecutive SNPs.