270 likes | 412 Views
Fingerprint Clustering with Bounded Number of Missing Values. Paola Bonizzoni, Gianluca Della Vedova, Giancarlo Mauri Università di Milano-Bicocca, Italy Riccardo Dondi Università di Bergamo, Italy. Talk Outline. Biological problem and combinatorial problem Three versions of the problem:
E N D
Fingerprint Clustering with Bounded Number of Missing Values Paola Bonizzoni, Gianluca Della Vedova, Giancarlo Mauri Università di Milano-Bicocca, Italy Riccardo Dondi Università di Bergamo, Italy Fingerprint Clustering - CPM 2006
Talk Outline • Biological problem and combinatorial problem • Three versions of the problem: • Clustering with Missing Value (CMV) • Inside Edge Clustering (IEC) • Outside Edge Clustering (OEC) • Approximation algorithm for IEC and OEC • Polynomial time algorithm for restricted CMV • APX-hardness of CMV • APX-hardness of IEC and OEC • Future work Fingerprint Clustering - CPM 2006
Biological Motivations Classification of microorganisms: • A library of rDNA (ribosomal RNA clones) is created • A short DNA sequence (a probe) is applied to hybridize with all clones of the library • After hybridization unbounded probes are removed; the library is analyzed to see how much any probe is hybridized to each spot • Experiment repeated for a set of probes Fingerprint Clustering - CPM 2006
Biological Motivations • Fingerprintof a clone: vector consisting of the hybridization intensity values between the clone and each probe To classify microorganisms: • Fingerprints are transformed in binary vectors • Clustering of fingerprints to infer different properties with respect to the probes Fingerprint Clustering - CPM 2006
Biological Motivations • Goal: translate hybridization intensity values into binary values 0, 1. • Due to the intensity values it is not always possible to get binary vectors • For each clone we are given a fingerprintover alphabet {0,1,N} • 0→ no hybridization • 1 → hybridization • N → unable to determine if a hybridization has happened Fingerprint Clustering - CPM 2006
Clustering of fingerprints – Combinatorial problem • Two fingerprints are compatible iff they agree in each position where they are different from N • Example: Two compatible fingerprints: 0 1 0 N N 0 1 0 0 1 N N 1 0 1 0 Two uncompatible fingerprints: 0 1 0 N N 0 1 0 0 1 N N 1 0 0 0 Fingerprint Clustering - CPM 2006
Clustering of fingerprints – Combinatorial problem Clustering of fingerprints:general formulation • Input: a set F of fingerprints • Output: clustering (partition) C of fingerprints such that each cluster ofCcontains only compatible fingerprints Fingerprint Clustering - CPM 2006
Clustering of fingerprints – Combinatorial problem An example • F: f1= 0 1 0 N f2= 0 N 0 1 f3= N 1 0 0 f4= 1 N N 1 Compatibility: f1 and f2; f1 and f3 • Some possible solutions: • (f1= 010N, f2= 0N01), (f3= N100), (f4= 1NN1) • (f1= 010N, f3= N100), (f2= 0N01), (f4= 1NN1) Fingerprint Clustering - CPM 2006
Clustering of fingerprints – Three versions of the problem • Three combinatorial versions of the problem with different objective functions • CMV (Clustering with Missing Values): minimize the number of clusters • IEC (Inside Edge Clustering with missing values): maximize the number of co-clustered pairs of fingerprints • OEC (Outside Edge Clustering with missing values): minimize the number of pairs of compatible fingerprints assigned to different clusters Fingerprint Clustering - CPM 2006
CMV- An example CMV:minimize number of clusters F = {f1= 01NN, f2= 0NN1, f3= 0N00, f4= 00N1} Compatibility: f1 compatible with f2, f1 compatible with f3, f2 compatible with f4 • A solution: (f1= 01NN, f2= 0NN1),(f3= 0N00),(f4= 00N1)→size 3 • Optimum: (f1= 01NN, f3= 0N00), (f2= 0NN1, f4= 00N1)→size 2 Fingerprint Clustering - CPM 2006
IEC- An example IEC:maximize the number of co-clustered pairs F = {f1= 01NN, f2= 0NN1, f3= 0N00, f4= 00N1} Compatibility: f1 compatible with f2, f1 compatible with f3, f2 compatible with f4 • A solution: (f1= 01NN, f2= 0NN1), (f3= 0N00), (f4= 00N1)→ size 1: pair (f1 ,f2) co-clustered • Optimum: (f1= 01NN, f3= 0N00), (f2= 0NN1, f4= 00N1)→ size 2: pairs (f1 ,f3) and (f2 ,f4) co-clustered Fingerprint Clustering - CPM 2006
OEC- An example OEC:minimize the number of compatible not co-clustered pairs F = {f1= 01NN, f2= 0NN1, f3= 0N00, f4= 00N1} Compatibility: f1 compatible with f2, f1 compatible with f3, f2 compatible with f4 • A solution: (f1= 01NN, f2= 0NN1), (f3= 0N00), (f4= 00N1)→ size 2; pair (f1 ,f3) and (f2 ,f4) not co-clustered • Optimum: (f1= 01NN, f3= 0N00), (f2= 0NN1, f4= 00N1)→ size 1; pair (f1 ,f2) not co-clustered Fingerprint Clustering - CPM 2006
Parameterized versions We consider parameterized versions of the problem: number of N’s is our parameter p CMV(p), IEC(p), OEC(p) when fingerprints have at most p positions with value N. Fingerprint Clustering - CPM 2006
Parameterized versions Resolution of a fingerprint f: a vector over {0,1} that is compatible with f Example: f = 01NN10 Possible resolutions: • 01 00 10 • 01 01 10 • 01 10 10 • 01 11 10 Fingerprint Clustering - CPM 2006
Parameterized versions For each fingerprint with p N’s: 2p possible resolutions Reformulation of the problem: given a set of fingerprints and the corresponding set S of resolved vectors, assign each fingerprint f to exactly one of its resolutions in S in order to optimize the objective function Fingerprint Clustering - CPM 2006
Previous results CMV(p): • NP-hard for p ≥ 2[Figueroa et al., CATS 2005] • Poly-time for p = 1[Figueroa et al., J of Comp. Biology 2004] • Approximation algorithm with factor min(1 + log n, 2 + p log l) [Figueroa et al., CATS 2005] IEC(p): • Approximation algorithm with factor 22p−1 [Figueroa et al., CATS 2005] for any p =O(log n) OEC(p) • Approximation algorithm with factor 2(1-1/2p) for restricted instances [Figueroa et al., CATS 2005] Fingerprint Clustering - CPM 2006
Approximation algorithm for OEC(p) and IEC(p) Greedy Algorithm: WHILE (there exists a not assigned fingerprint) • select a resolved vector that resolves the maximum number of fingerprints • Delete the assigned fingerprints ENDWHILE 2-factor approximation ratio for OEC ½ -factor approximation ratio for IEC Fingerprint Clustering - CPM 2006
A tight example for IEC f1 = N001; f2= 0N01; f3= 01N1; f4= 011N; f1 compatible with f2, f2 compatible with f3, f3 compatible with f4 Resolved vectors associated with compatibility r12 = 0001; r23 = 0100; r34 = 0111 Each of these resolved vectors resolves two fingerprints Fingerprint Clustering - CPM 2006
A tight example for IEC The algorithm chooses one resolved vector, for example r23; f2 and f3 are assigned to r23 and deleted; r12 is chosen, f1 is assigned to it and deleted; r34 is chosen and f4 is assigned to it and deleted; Number of compatible co-clustered pairs: 1 The optimal solution consists of: r12; f1 and f2 are assigned to r12; r34; f3 and f4 are assigned to r34; Number of compatible co-clustered pairs in the optimal solution: 2 Fingerprint Clustering - CPM 2006
A Polynomial Time Algorithm for Restricted CMV Restricted CMV for each position j there is at most one fingerprint having a value N in j-th position An instance of restricted CMV f1 = NN 01 01 01; f2= 01 NN 01 01; f3= 01 11 NN 01; f4= 01 11 11 NN Fingerprint Clustering - CPM 2006
A Polynomial Time Algorithm for Restricted CMV Two interesting properties of restricted CMV: • the interesting resolved vectors are at most n2(interesting resolved vectors: resolve more than one fingerprint); • there is a fingerprint (private fingerprint) which is resolved by one interesting resolved vector; The algorithm at each step selects the interesting resolved vector that resolves a private fingerprint Fingerprint Clustering - CPM 2006
APX-hardness of CMV(2) L-reduction from MIN Vertex Cover on cubic graphs (APX-hard[Alimonti et., TCS 2000]) G=(V, E) cubic graph → graph gadget GA=(VA, EA) • For each vi in V define the following gadget GVi GVi Two possible vertex cover of the gadget: type 1: suboptimal type 2: optimal Fingerprint Clustering - CPM 2006
APX-hardness of CMV(2) G=(V, E) cubic graph to graph gadget GA=(VA, EA) • For each edge (vi, vj ) in E define the edge gadget EGij EGij GVj GVi • Four vertices covered in EGij→ GVi and GVj both optimal • Two vertices covered in EGij→ GVi or GVj suboptimal • Case 2 is always better than case 1 Fingerprint Clustering - CPM 2006
APX-hardness of CMV(2) Instance of CMV(2) is built as follows: • a resolved vector is built for each vertex of the gadgets • a fingerprint is built for each edge of the gadgets • two fingerprints share a common resolution iff they are incident on a common vertex Fingerprint Clustering - CPM 2006
APX-hardness of IEC(2) and OEC(2) • L-reduction from MAX Independent Set on cubic graphs (APX-hard[Alimonti et., TCS 2000]) • Similar to the reduction for CMV(2) • G=(V,E) a cubic graph; • for each vertex vi in V a set Fiof 9 fingerprints • for each edge (vi , vj ) a fingerprint fij Fingerprint Clustering - CPM 2006
Open Problems • Approximation of CMV(p): • constant factor not dependant on p? • improve min(1 + log n, 2 + p log l) approximation factor • Approximation of IEC(p) and OEC(p): • improve approximation factors ½ and 2 • Restricted versions of IEC and OEC are in P? Fingerprint Clustering - CPM 2006
Conclusions • Biological problem and combinatorial problem • Three versions • Clustering with Missing Value (CMV) • Inside Edge Clustering (IEC) • Outside Edge Clustering (OEC) • Approximation algorithms for IEC(p) and OEC(p) • Polynomial time algorithm for restricted CMV • APX-hardness of CMV(2) • APX-hardness of IEC(2) and OEC(2) • Future work Fingerprint Clustering - CPM 2006