250 likes | 260 Views
Finding Bit Patterns. Applying haplotype models to association study design. Natalie Castellana Kedar Dhamdhere Russell Schwartz August 16, 2005. Problem: Applying haplotype models. Input: Output:. 10000010100010010 00010100101101001 01101101001000010 10101011111000010.
E N D
Finding Bit Patterns Applying haplotype models to association study design Natalie Castellana Kedar Dhamdhere Russell Schwartz August 16, 2005
Problem: Applying haplotype models • Input: • Output: 10000010100010010 00010100101101001 01101101001000010 10101011111000010 (14,17,“0010”) a set of recurring patterns of the form (start column, end column, pattern)
Haplotype Minor allele Major Allele SNP Association Test Background 1000011010110100000010 Given that this sample has haplotype 1101, does it have the disease?
…1001001… …1000001… …1000101… …1110011… Genetic Variation Mutation: Recombination: …1000011… …1110101… Because of recombination, similar genetic variation can be found within closely linked regions.
Cases: Download from HapMap.org Apply Disease Model Controls: Generate using MS Apply Haplotype Model Perform Association Tests Data Sets 10010011101 01100101101 10010010101 10001110100 Input: 1001001010110 1001001110100 0110010110100 1000111010010
Testing individual SNP’s • Go through each SNP and determine which SNP’s accurately predict which samples have the disease and which do not. Case: 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 1 1 1 0 0 0 Control: 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 1 0 0 0 0 1
Haplotype block method • Instead of looking at each individual SNP, we can look at groups of contiguous SNP’s. 1101000000…11… 1101100100…01… 0111000000…10… 1101100100…00…
Haplotype motif method • Notion that a sequence is the concatenation of segments (like the block method) but does not require conservation of boundaries. 1101000000… 1100100100… 0111000000… 1101100111…
c c c c c c c c 10000100………………………………… 00011100………………………………… 11011110………………………………… 01010110………………………………… Approximation Algorithm General idea: Pick the best partition, minimizing the number of motifs needed to explain all the data.
C 0 1 000…000 111…111 000..100 ……… Finding Motifs 0 0 0 1 1 0 0 1 0 1 1 0 1 0 0 1 1
Problems Really, really, really slow Took over a week to partition our biggest data set. Added a ‘max leaves explored’ feature. Useless for larger c.
General Linear Program Objective Function: minimize: x + y + z Constraints: x + y <= 2 1 1 0 x 2 x +2z <= 5 1 0 2 * y <= 5 z 0 <= x <= 3 0 <= y <= Inf -Inf <= z <= 0
A Linear Program Input: A matrix with M rows and N columns Output: The minimum number of motifs.
Variables X’s: each x corresponds to a motif Define a motif by a tuple: (start column, end column, string pattern) Y’s: each y corresponds to a row partition Define a row partition by a set of motifs: {(1,e1,“…”),(e1+1,e2,“…”),...,(en,N,“…”)}
Constraints Exactly one partition must be chosen per row. If a motif used in a row partition is not chosen, then the row partition may not be chosen. Minimize the sum of all X’s.
Example 10001101 X’s: (1,1,“1”),(1,2,“10”),(1,3,“100”), etc. Y’s: (1,1,“1”),(1,8,“0001101”) (1,2,“10”),(3,3,“0”),(4,8,“01101”)
Constraint Matrix(1) Exactly one row partition must be chosen per row. all X’s all Y’s (1,1,“1”) (1,1,“0”)…(1,2,“10”) Y_1 Y_2 … Row 1 0 0 … 0 1 1 … Row 2 0 0 … 0 0 0 … Row 3 0 0 … 0 1 1 … .. Row M 0 0 0 Y_1 := (1,1,“1”),(1,8,“0001101”) Y_2 := (1,2,“10”),(3,3,“0”),(4,8,“01101”) =1 =1 =1 … =1
Constraint Matrix(2) If a motif used in a row partition is not chosen, then the row partition may not be chosen. all X’s all Y’s (1,1,“1”) (1,1,“0”)…(1,2,“10”) Y_1 Y_2 … Row i: (1,1,“1”) 1 0 … 0 -1 0 … (1,2,“10”) 0 0 … 1 0 -1 … (1,3,“100”) 0 0 … 0 0 0 … .. … … … … … … … (8,8,“1”) 0 0 … 0 0 0 Y_1 := (1,1,“1”),(1,8,“0001101”) Y_2 := (1,2,“10”),(3,3,“0”),(4,8,“01101”) >=0 >=0 >=0 … >=0
Constraint Matrix x’s y’s 1 K K+1 K+P 0 1 0 0 0 0 0 …0 0 0 0 1 1 1 0 0 0 0…. 0 ** Constraint 1 ** 2 0 0 0 0 0 …0 0 0 0 1 0 0 1 1 1 0…. 0 == 1 … M 0 0 0 0 0 …0 0 0 0 0 0 1 0 0 0 1…. 1 1 1 1 0 0 0 0 …0 0 0 0 -1 0 0 0 ….0 0 ** Constraint 2 ** 2 0 1 0 0 0 …0 0 0 0 -1 -1 0 0….-1 0 >= 0 … K_1 0 0 1 0 0 …0 0 0 0 0 0 0 0 ….0 0 . . . M Where K is the number of unique motifs, K_i is the number of motifs appearing in row i, and P is the number of unique partitions
Problems Each row has N(N+1)/2 motifs. So there will be a polynomial number of X’s. Good! Each row can be partitioned in 2^(N-1) ways. So there will be an exponential number of Y’s. Bad! Solution: column generation
Column generation We find the optimal solution to the problem which contains all X’s and only some of the Y’s. Then we see if adding any Y’s would improve the solution.
Where are we now? • Where are we going?