770 likes | 789 Views
Explore minimizing k-mers in a string by flipping letters while adhering to budget constraints. Discuss related parameterized complexity and ILP formulations.
E N D
Flipping letters to minimize the support of a string Giuseppe Lancia, Franca Rinaldi, Romeo Rizzi University of Udine
Outline of talk: 1. Problem definition 2. Parametrized complexity 3. Polynomial cases 4. NP-hardness 5. ILP formulations
We are given a string s and a parameter k (e.g., k = 3) 010010011 The string has a set of k-mers, its support, K(s)
We are given a string s and a parameter k (e.g., k = 3) 010010011 The string has a set of k-mers, its support, K(s) K(s) = { 010 }
We are given a string s and a parameter k (e.g., k = 3) 010010011 The string has a set of k-mers, its support, K(s) K(s) = { 010, 100 }
We are given a string s and a parameter k (e.g., k = 3) 010010011 The string has a set of k-mers, its support, K(s) K(s) = { 010, 100, 001}
We are given a string s and a parameter k (e.g., k = 3) 010010011 The string has a set of k-mers, its support, K(s) K(s) = { 010, 100, 001}
We are given a string s and a parameter k (e.g., k = 3) 010010011 The string has a set of k-mers, its support, K(s) K(s) = { 010, 100, 001}
We are given a string s and a parameter k (e.g., k = 3) 010010011 The string has a set of k-mers, its support, K(s) K(s) = { 010, 100, 001}
We are given a string s and a parameter k (e.g., k = 3) 010010011 The string has a set of k-mers, its support, K(s) K(s) = { 010, 100, 001, 011}
We are given a string s and a parameter k (e.g., k = 3) 010010011 The string has a set of k-mers, its support, K(s) K(s) = { 010, 100, 001, 011} | K(s) | = 4
We are given a string s and a parameter k (e.g., k = 3) 010010011 By flipping some bits, we could reduce the number of k-mers K(s) = { 010, 100, 001, 011} | K(s) | = 4
We are given a string s and a parameter k (e.g., k = 3) 010010011 010010010 S’= By flipping some bits, we could reduce the number of k-mers K(s) = { 010, 100, 001, 011} | K(s) | = 4
We are given a string s and a parameter k (e.g., k = 3) 010010011 010010010 S’= By flipping some bits, we could reduce the number of k-mers K(s) = { 010, 100, 001, 011} | K(s) | = 4 K(s’) = { 010, 100, 001} | K(s’) | = 3
The Problem : Ingredients: - A string s over an alphabetS
The Problem : Ingredients: - A string s over an alphabetS - A parameter k (k-mer size)
The Problem : Ingredients: - A string s over an alphabetS - A parameter k (k-mer size) - A budget B
The Problem : Ingredients: - A string s over an alphabetS - A parameter k (k-mer size) - A budget B Objective: Change at most B letters in s so as resulting s’ has as few distinct k-mers as possible
The Problem : Ingredients: - A string s over an alphabetS - A parameter k (k-mer size) - A budget B Objective: Find a string s’ with d(s,s’) <= B with the smallest number of kmers s s’
Motivation : Real: Curiosity-driven (it’s a cute combinatorial problem)
Motivation : Real: Curiosity-driven (it’s a cute combinatorial problem) Fictious: Analysis of DNA sequences atcgattgatccttta atc, tcg, cga, gat, …. 3-mers are aminoacid codons. Protein complexity relates to # of codons. Mutations may reduce complexity….
Our results: The problem has many parameters (|s|, |S|, k, B), we study all versions (when possibly some of the parameters are bounded) • - Polynomial special cases (e.g. for B fixed or both k,|S| fixed) • - NP-hard special cases (even k=2 or |S|=2)
|S| NO k NO |S| YES k NO |S| NO k YES |S| YES k YES |s| NO B NO |s| YES B NO |s| NO B YES |s| YES B YES
|S| NO k NO |S| YES k NO |S| NO k YES |S| YES k YES |s| NO B NO |s| YES B NO |s| NO B YES |s| YES B YES k <= |s| We can assume :
|S| NO k NO |S| YES k NO |S| NO k YES |S| YES k YES |s| NO B NO |s| YES B NO |s| NO B YES |s| YES B YES k <= |s| We can assume :
|S| NO k NO |S| YES k NO |S| NO k YES |S| YES k YES |s| NO B NO |s| YES B NO |s| NO B YES |s| YES B YES B <= |s| We can assume :
|S| NO k NO |S| YES k NO |S| NO k YES |S| YES k YES |s| NO B NO |s| YES B NO |s| NO B YES |s| YES B YES |S| <= |s| (we don’t need any symbol not already in s) We can assume :
|S| NO k NO |S| YES k NO |S| NO k YES |S| YES k YES |s| NO B NO |s| YES B NO |s| NO B YES |s| YES B YES
|S| NO k NO |S| YES k NO |S| NO k YES |S| YES k YES |s| NO B NO |s| YES B NO |s| NO B YES |s| YES B YES Polynomial cases
|S| NO k NO |S| YES k NO |S| NO k YES |S| YES k YES |s| NO B NO NP-hard for |S|=2 NP-hard for k=2 NP-hard |s| YES B NO |s| NO B YES |s| YES B YES NP-hard cases
The case |S| and k fixed: |S| NO k NO |S| YES k NO |S| NO k YES |S| YES k YES |s| NO B NO NP-hard for |S|=2 NP-hard for k=2 NP-hard |s| YES B NO |s| NO B YES |s| YES B YES
The case |S| and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A?
The case |S| and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? s = 01000100101110 A = 0100, 1001, 0010 , 0001 B = 3
The case |S| and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? s = 01000100101110 A = 0100, 1001, 0010 , 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… ……
The case |S| and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? s = 01000100101110 A = 0100, 1001, 0010 , 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… ……
The case |S| and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? s = 01000100101110 A = 0100, 1001, 0010 , 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… ……
The case |S| and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? s = 01000100101110 A = 0100, 1001, 0010 , 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… …… 0100 0 1 ….. 1 1 0
The case |S| and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? s = 01000100101110 A = 0100, 1001, 0010 , 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… …… 0100 0 1 ….. 1 1 0
The case |S| and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? s = 01000100101110 A = 0100, 1001, 0010 , 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 0 0 0 0 0 0 2 0 …… 0 0 1 0 3 1 1 1 0 0 …… 0 2 1 1 1 1 0100 0 1 ….. 1 1 0
The case |S| and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? s = 01000100101110 A = 0100, 1001, 0010 , 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 0 0 0 0 0 0 2 0 …… 0 0 1 0 3 1 1 1 0 0 …… 0 2 1 1 1 1 0100 0 1 ….. 1 1 0 Each path corresponds to a string s’ with all its kmers in A
The case |S| and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? s = 01000100101110 A = 0100, 1001, 0010 , 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 0 0 0 0 0 0 2 0 …… 0 0 1 0 3 1 1 1 0 0 …… 0 2 1 1 1 1 0100 0 1 ….. 1 1 0 The length of the path is the Hamming distance d(s’, s)
The case |S| and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? s = 01000100101110 A = 0100, 1001, 0010 , 0001 B = 3 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 …… 0100 1001 0010 0001 0100 1001 0010 0001 0100 1001 0010 0001 0 0 0 0 0 0 2 0 …… 0 0 1 0 3 1 1 1 0 0 …… 0 2 1 1 1 1 0100 0 1 ….. 1 1 0 SUB(A) has a solution iff the shortest path is <= B
The case |S| and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? - we can solve SUB(A) in polytime (O|A||S||s|) = O(|s|) since
The case |S| and k fixed: We start with this subproblem: SUB(A): Given a set of kmers A, can we correct s within budget so as it has all of its kmers in A? - we can solve SUB(A) in polytime (O|A||S||s|) = O(|s|) since - There are “only” possible subsets A to try… problem is solved in polytime O(|s|)
The case of B fixed: |S| NO k NO |S| YES k NO |S| NO k YES |S| YES k YES |s| NO B NO NP-hard for |S|=2 NP-hard for k=2 NP-hard |s| YES B NO |s| NO B YES |s| YES B YES