450 likes | 610 Views
An Algorithm for the Consecutive Ones Property. Claudio Eccher. Outline. C1P definition. Biological background Hybridization mapping. An algorithm for the C1P problem Dividing in components Taking care of a component Joining the components together. The consecutive ones property.
E N D
An Algorithm forthe Consecutive Ones Property Claudio Eccher
Outline • C1P definition • Biological background • Hybridization mapping • An algorithm for the C1P problem • Dividing in components • Taking care of a component • Joining the components together
The consecutive ones property Definition: A binary matrix is said to have the consecutive ones property (C1P) if a permutation of its columns can be found such that all 1s in each row are consecutive
The consecutive ones property Observation: the C1P is closed under taking submatrices A bad matrix: Whichever column x I put in the middle there is a row in which x is 0 Hence, every matrix containing this submatrix is ‘bad’
Hybridization mapping (1) • Copies of a DNA molecule are broken into several fragments (~104 bases) and replicated by cloning (clones) • The possible binding of small sequences (probes) to a clone are checked, the subset of the probes bounded (hybridized) to a clone becomes its fingerprint • Clones’ overlap, and thus their relative order, are determined by comparing fingerprints
Hybridization mapping (2) Two clones sharing part of their respective fingerprints are likely to have come from overlapping DNA regions Clone 1 Clone 2 Probes A B D C
Assumptions • Probes are unique • There are no errors • All “clones x probes” hybridization experiments have been done
n x m binary matrix M built from experimental data • Mij = 1 ð probe j hybridized to clone i • Mij = 0 ð probe j not hybridized to clone i Model • n clones and m probes
Finding a permutation of the columns such that all 1s in each row are consecutive Determing if M has the C1P for rows Problem Obtaining a physical map from M
Without loss of generality we can assume that: • All rows are different • No row is all zeros An algorithm for the C1P problem • The problem belongs to P • The algorithm is from Fulkerson and Gross (1965)
Algorithm sketch Separation of the rows into components (subsets of rows) Permutation of the columns of each component Join of the components together
Row relations Definition: "row iÎM, Si={columns k | Mi,k=1} • Given two rows i and j: • SiÇSj = Æ or • SiÍSj or Sj ÍSi or • SiÇSj¹Æ and none of them is a subset of the other
If $ a row k s.t.: SkÇSi = Æ or SkÍSi"i ¹k in this component Then row k can be put in its own component Dividing in components (1) Let’s initially lump together in the same component the rows with non empty intersection
A graph Gc = (V,E) is built from matrix M • Each vertex V is a row of M • There is an undirected edge E from Vi to Vj if SiÇSj¹Æand none of them is a subset of the other Dividing in components (2) The components we want are the connected components of Gc
b l3 l4 g l5 l8 d l6 l7 Building Gc: an example Gc l2 a l1 Edge (l1, l2)
b l3 l8 d l6 l7 Building Gc: an example Gc l2 a l4 l1 g l5 Edge (l4, l5)
b l3 l4 g l5 Building Gc: an example Gc l2 a l1 Edge (l6, l7) l8 d l6 l7
b l3 l4 g l5 Building Gc: an example Gc l2 a l1 Edge (l6, l8) l8 d l6 l7
l1 l2 l3 Taking care of a component (1) The 1s of the first row have to be put consecutive. The possible solutions can be represented as follows: The second row is adjacent to the first one. Hence, for the second row (l2) there are 2 choices: the 1s can be placed to the left or to the right of those of the row l1. In any case the direction does not really matter
l1 l2 l3 Taking care of a component (2) For the third row (l3) we have to consider the relations with the rows connected by edges to l3 Let’s place l3 with respect to l2: we cannot place l3 in either direction (left or right) because of its relation with l1 To take into account the relation between l1 and l3 is necessary to consider the number of elements in the intersections between S1, S2 and S3
l1 l2 l3 If l1·l3 < min(l1·l2 , l2·l3) then l3 has to be placed in the same direction that l2 was placed with respect to l1 If l1·l3 > min(l1·l2 , l2·l3) then l3 has to be placed in the opposite direction that l2 was placed with respect to l1 Taking care of a component (3) Definition: Let x·y = | SxÇSy | be the internal product of rows x and y If we have equality it isn’t possible to have the 1s of l3 consecutive
l1 l2 l3 Taking care of a component (4) For l3, S3 = {1,4,7,8}, l1·l3= 2, l1·l2= 2, l1·l3= 1, so l3 have to be put to the right of l2:
We had no choice in placing l3 Therefore, if the component has the C1P, then l1 and l3must result properly placed If, on the contrary, l1and l3are not properly placed, then we conclude that the component (and hence the matrix) doesn’t have the C1P Taking care of a component (5) The only choice made was in the placement of l2 with respect to l1 and both possibilities result in the same solutions up to reversal.
String generator We have seen the following examples of string generator A permutation p of the probes is compatible with a string generator if whenever A, B, C appear in this order in p and A and C are in a group G, then B is also included in G An invariant of the algorithm is that, after considering rows 1..k, a permutation p certificates the C1P of the submatrix on rows 1..k iff either p or its reversal is compatible with the string generator
Taking care of a component: a ‘bad’ component The relations between the rows are the same as the preceding component
Taking care of a component (6) For a new row k in the same component find two previously placed rows i and j s.t. $E(k,i), E(i,j) in Gc and proceed as for the three-row case. Check also the consistency with the solution generator The algorithm gives all possible permutations of a component having the C1P, up to reversal
Algorithm implementation Construct Gc and traverse it using depth-first search When visiting a vertex invoke procedure Place AlgorithmPlace input: u, v, w vertices of Gc=(V,E) s.t. (u,v)ÎE and (v,w) ÎE output: A placement for row u, if possible if v = nil and w = nil then Place all 1s of u consecutively else if w = nil then Left- or right-place the 1s of u with respect to the 1s of v Record direction used else if u · w < min(u · v , v · w) then Place u with respect to v in the same direction used in v, w placement. Record direction used else Place u with respect to v in the opposite direction used in v, w placement. Record direction used Check consistency of column set If column sets are not consistent then the component doesn’t have the C1P
Algorithm running time For a n x m matrix building graph Gc takes O(nm) time To check consistency of column sets requires O(m) time per row and there are n rows to process Total time is thus O(nm)
Construct a new graph GM = (V,E) in which: • Each component ak of M is a vertex in GM • For a,bÎV, there is a directed edge from a to b if " row iÎb sets Si are contained in at least one set Sj of a Joining components together (1) GMtells us how the components of M fit together
GM for the example matrix GM a a b b g d g d
Joining components together (2) For two sets Si Î b, SjÎa, if SiÍSj then there is no row kÎ a s.t. Si ËSk and SiÇSk¹ Æ The exact same containments and disjunctions hold for all other sets from b GMis acyclic
Joining components together (3) The joining of components depends on the way sets in one component contain or are contained in sets from other components Components having sets not contained anywhere else should be processed first Containment is specified by the directed edges in GM
Joining components together (4) GM has to be processed in topological order Remove all sources from GM (e.g. a) and make the union of their string generators While GM is not empty take the next source b,remove b from GM, and refine the current string generator with the string generator of b
Example (1) GM a b a b g g d d One topological order is a, b, g, d
Example (2) a b d g
Example (5) In this particular case there are two solutions corresponding to the permutation of identical columns (5 and 9)
Algorithm solution is not unique In general multiple solutions may exist because: • Each component may on its own have several solutions • Each solution can be used in two ways: the permutation and its reversal
Algorithm running time Topological sorting of GM takes time O(n+m) If the entries of M are preprocessed the queries needed for traversing GM can take constant time Preprocessing takes at most O(nm) Total time for processing each component ci is O(nim) Algorithm running time is O(nm)
Concluding remarks (1) Even if a C1P permutation exists, this is not necessarily the true permutation: • The solution is not unique • In general errors do exist, so the true permutation is not the C1P one
Concluding remarks (2) Generalizations to account for errors yield NP-hard problems Also relaxing the assumption of unique probes yields NP-hard problems
Related works A considerably more complicated algorithm from Booth and Leuker exists (1976) that takes O(n+m+r) time (r is the total number of 1s) Quite recently a simple O(n+m+r)-time algorithm has been presented by Hsu - J Algorithms 43 (2002), no. 1, 1-16