720 likes | 838 Views
Celso C. Ribeiro Universidade Federal Fluminense, Brazil. Memory approaches to improve multi-start constructive heuristics. Joint work with Eraldo Fernandes (M.Sc., PUC-Rio, Brazil). WEA’2005 – IV Workshop on Experimental and Efficient Algorithms. Santorini, May 2005. Summary.
E N D
Celso C. Ribeiro Universidade Federal Fluminense, Brazil Memory approaches to improve multi-start constructive heuristics Joint work with Eraldo Fernandes (M.Sc., PUC-Rio, Brazil) WEA’2005 – IV Workshop on Experimental and Efficient Algorithms Santorini, May 2005
Summary • Application: DNA sequencing • Motivation: sequencing by hybridization • Multi-start randomized constructive heuristic • Adaptive memory strategy • Vocabulary building • Complete heuristic: MS+MEM+VB • Computational experiments • Numerical results and comparisons • Concluding remarks Memory approaches to improve multi-start constructive heuristics
DNA sequencing • DNA molecule: sequence formed by a combination of four different nucleotide bases - A, C, G, and T • Each DNA molecule may be represented as a word over the alphabet {A,C,G,T} of nucleotide bases • Example: ATAGGCAGGA • Sequencing: identification of the contents of a DNA molecule • Gel electrophoresis • Chemical method Memory approaches to improve multi-start constructive heuristics
Sequencing by hybridization • SBH: alternative approach to DNA sequencing • Two phases: • Biochemical: hybridization experiment involving a DNA array and the target molecule to be sequenced • Computational: reconstruction problem using the results of the hybridization experiment Memory approaches to improve multi-start constructive heuristics
Sequencing by hybridization • DNA array: • Bidimensional grid • Each cell contains a probe: small sequence of q nucleotides • Library C(q): set of all 4q probes of size q in the array • Hybridization experiment: • Array is introduced into a solution containing many copies of the target sequence • A copy of the target sequence reacts with a probe if the latter is a subsequence (of the complement) of the former • Spectrum: set of all probes of size q that reacted with the target sequence, i.e., subsequences of size q that appear in the target Memory approaches to improve multi-start constructive heuristics
Sequencing by hybridization Library C(4): Target sequence: ATAGGCAGGA Memory approaches to improve multi-start constructive heuristics
Sequencing by hybridization Library C(4): Target sequence: ATAGGCAGGA Spectrum: {ATAG, TAGG, AGGC, GGCA, GCAG, CAGG, AGGA} Memory approaches to improve multi-start constructive heuristics
Sequencing by hybridization • Reconstruction problem: • Second phase: reconstruction of the target sequence from the spectrum • Find a sequence of the probes in the spectrum such that consecutive probes have q-1 bases of superposition • Hamiltonian path problem on the spectrum: • One vertex for each probe u in the spectrum • Arc (u,v) from probe uto vif the last q-1 bases of u coincide with the first q-1 bases of v ATAG TAGG AGGC GGCA GCAG CAGG AGGA ATAG TAGG AGGC GGCA GCAG CAGG AGGAATAGGCAGGA Memory approaches to improve multi-start constructive heuristics
Sequencing by hybridization Spectrum: {ATAG, TAGG, AGGC, GGCA, GCAG, CAGG, AGGA} TAGG AGGC ATAG GGCA AGGA CAGG GCAG Memory approaches to improve multi-start constructive heuristics
Sequencing by hybridization Spectrum: {ATAG, TAGG, AGGC, GGCA, GCAG, CAGG, AGGA} TAGG AGGC ATAG GGCA AGGA CAGG GCAG Memory approaches to improve multi-start constructive heuristics
Sequencing by hybridization Spectrum: {ATAG, TAGG, AGGC, GGCA, GCAG, CAGG, AGGA} ATAG TAGG AGGC GGCA GCAG CAGG AGGAATAGGCAGGA TAGG AGGC ATAG GGCA AGGA CAGG GCAG Memory approaches to improve multi-start constructive heuristics
Sequencing by hybridization • Hybridization errors: • Hybridization experiment is not perfect • False positives: probes that appear in the spectrum but not in the target sequence • False negatives: probes that occur in the target sequence but not in the spectrum ATAG TAGG AGGC ----GCAG CAGG AGGAATAGGCAGGA Memory approaches to improve multi-start constructive heuristics
Sequencing by hybridization • Problem of sequencing by hybridization (PSBH): given the spectrum S = {s1, s2, ..., sm}, the size q of the probes, the length n, and the first probe s0of the target sequence, find a sequence with size smaller than or equal to n with a maximum number of probes. • PSBH is NP-hard (Blazewicz et al., 1999) Memory approaches to improve multi-start constructive heuristics
Sequencing by hybridization • Directed graph G = (V,E) • V = S (probes in the spectrum) • E = {(u,v): uS and vS} • Superposition o(u,v) between two probes u,vS: size of the largest sequence that is both a suffix of u and a prefix of v • Weight w(u,v) of the arc (u,v): Memory approaches to improve multi-start constructive heuristics
Sequencing by hybridization Spectrum: {ATAG, TAGG, AGGC, GCAG, CAGG, AGGA, GGCG} (q = 4) TAGG AGGC ATAG GGCG: false positive GGCA: false negative GGCG AGGA CAGG GCAG Target sequence: ATAGGCAGGA (n = 10) Memory approaches to improve multi-start constructive heuristics
Sequencing by hybridization Spectrum: {ATAG, TAGG, AGGC, GCAG, CAGG, AGGA, GGCG} (q = 4) TAGG 1 AGGC 1 1 3 ATAG 1 2 GGCG: false positive GGCA: false negative GGCG AGGA 3 1 CAGG GCAG 1 1 Target sequence: ATAGGCAGGA (n = 10) Memory approaches to improve multi-start constructive heuristics
Sequencing by hybridization • Feasible solutions: acyclic paths in G emanating from vertex s0 with weight less than or equal to n-q • A path in G is a sequence a = (a1, a2, ..., ak) of probes ai S, i {1, 2, ..., k} • An optimal solution visits a maximum number of vertices and respects the above constraints • Heuristics: ant colony, tabu search, genetic algorithm • This work: multi-start constructive heuristic with a memory-based strategy Memory approaches to improve multi-start constructive heuristics
Sequencing by hybridization Spectrum: {ATAG, TAGG, AGGC, GCAG, CAGG, AGGA, GGCG} (q = 4) TAGG 1 AGGC 1 1 3 ATAG 1 2 GGCG: false positive GGCA: false negative GGCG AGGA 3 1 CAGG GCAG 1 1 Target sequence: ATAGGCAGGA (n = 10) Memory approaches to improve multi-start constructive heuristics
Sequencing by hybridization Spectrum: {ATAG, TAGG, AGGC, GCAG, CAGG, AGGA, GGCG} (q = 4) TAGG 1 AGGC 1 1 3 ATAG 1 2 GGCG: false positive GGCA: false negative GGCG AGGA 3 1 CAGG GCAG 1 1 Target sequence: ATAGGCAGGA (n = 10) Memory approaches to improve multi-start constructive heuristics
Sequencing by hybridization Spectrum: {ATAG, TAGG, AGGC, GCAG, CAGG, AGGA, GGCG} (q = 4) ATAG TAGG AGGC ----GCAG CAGG AGGAATAGGCAGGA TAGG 1 AGGC 1 1 3 ATAG 1 2 GGCG AGGA 3 1 CAGG GCAG 1 1 GGCG: false positive GGCA: false negative Target sequence: ATAGGCAGGA (n = 10) Memory approaches to improve multi-start constructive heuristics
Multi-start randomized constructive heuristic • Iteratively builds multiple solutions using a randomized constructive algorithm • Randomized constructive algorithm builds a different solution at each run • Returns the best solution found • Initial solution formed by a unique probe: a = (s0) • Current partial solution (path) is extended at each iteration by the insertion of a new probe at the end Memory approaches to improve multi-start constructive heuristics
greediness Multi-start randomized constructive heuristic • Current partial solution (path) is extended at each iteration by the insertion of a new probe at the end • Probe to be inserted is probabilistically selected from a restricted candidate list (RCL) • S(a): probes in the current partial solution a • u: last probe in the current path • RCL = {v S\S(a): o(u,v) ≥ (1-).max tS\S(a) o(u,t) and w(a) + w(u,v) n-q} • Randomly select a probe v from RCL with probability p(u,v) = (1/w(u,v))/ΣtS\S(a) (1/w(u,t)) Memory approaches to improve multi-start constructive heuristics
Adaptive memory strategy • Application to QAP: Fleurent and Glover, 1999 • Pool Q of elite solutions(best solutions found): diversity • Intensification strategy for the constructive algorithm • Makes use of two kinds of information in the construction: superposition between the probes and frequency of the arcs in the elite solutions • Parameter used to balance the weights of the two terms: greediness (superposition) and frequency (memory) Memory approaches to improve multi-start constructive heuristics
greediness frequency Adaptive memory strategy higher when the superposition between probes u and v is larger higher for arcs (u,v) appearing more often in the solutions of the elite set Probability p(u,v) of selecting a probe v from the RCL to extend the current partial solution whose last probe is u: Memory approaches to improve multi-start constructive heuristics
Adaptive memory strategy • Pool update: • Pool size: at most q solutions • Solution a is a candidate to be inserted into the pool Q if it is better than the worst solution currently in the pool, i.e., |a| > min a’Q|a’| • Candidate solution a replaces the worst solution in the pool if it is better than the best solution in the pool (|a| > max a’Q|a’|) or if it is sufficiently different from every other solution in the pool (min a’Q dist(a,a’) ≥ dmin) Memory approaches to improve multi-start constructive heuristics
Vocabulary building • Good solutions are very often formed by the same building blocks (paths) • Optimal solutions formed by components appearing in suboptimal solutions • Identify short paths with optimal superposition and combine them to build optimal solutions • Vocabulary building: Glover and Laguna, 1997 • Find common paths appearing in good solutions (words) • Combine them into new good solutions (phrases) Memory approaches to improve multi-start constructive heuristics
Vocabulary building • Solutions encoded as adjacency vectors • Solution a = (a1,a2,...,ak) represented as a vector x = x1,x2,...,x|S| • If xu = s, then probe s follows immediately after probe u, i.e., the arc (u,s) is used in the path 1 2 a= (1,4,2,3,5) 6 3 5 4 Memory approaches to improve multi-start constructive heuristics
Vocabulary building • Solutions encoded as adjacency vectors • Solution a = (a1,a2,...,ak) represented as a vector x = x1,x2,...,x|S| • If xu = s, then probe s follows immediately after probe u, i.e., the arc (u,s) is used in the path 1 2 a= (1,4,2,3,5) 6 3 5 4 Memory approaches to improve multi-start constructive heuristics
Vocabulary building • Solutions encoded as adjacency vectors • Solution a = (a1,a2,...,ak) represented as a vector x = x1,x2,...,x|S| • If xu = s, then probe s follows immediately after probe u, i.e., the arc (u,s) is used in the path 1 2 a= (1,4,2,3,5) 6 3 5 4 Memory approaches to improve multi-start constructive heuristics
Vocabulary building • Some notation: • Set X of adjacency vectors • Size(x): number of arcs in the adjacency vector x • Inter(X): subset of arcs that appear in all vectors in X • Enclosure(y,X): set formed by all vectors in X that contain the arcs in the adjacency vector y Memory approaches to improve multi-start constructive heuristics
3 4 3 4 2 5 2 5 1 6 1 6 8 7 8 7 Inter(x1,x2): Memory approaches to improve multi-start constructive heuristics
3 4 3 4 2 5 2 5 1 6 1 6 8 7 8 7 Inter(x1,x2): Memory approaches to improve multi-start constructive heuristics
3 4 3 4 2 5 2 5 1 6 1 6 8 7 8 7 3 4 2 5 Inter(x1,x2): 1 6 8 7 Memory approaches to improve multi-start constructive heuristics
Vocabulary building • Some notation: • Set X of adjacency vectors • Size(x): number of arcs in the adjacency vector x • Inter(X): subset of arcs that appear in all vectors in X • Enclosure(y,X): set formed by all vectors in X that contain the arcs in the adjacency vector y • Find words: given an elite set X, find vectors y with |Enclosure(y,X)| as large as possible and Size(y) ≥ smin (non-elementary small words), where smin is a parameter Memory approaches to improve multi-start constructive heuristics
Vocabulary building • Algorithm FindWords(X,smin): Y , X’ X while X’ do x rand(X’), Z {x}, X’’ X - {x} while X’’ do x rand(X’’) if Size(Inter(Z{x})) ≥ smin then Z Z {x} X’’ X’’ - {x}; end-while if |Z| > 1 then y Inter(Z); Y Y {y} X’ X’ – Z end-while return Y Martins and Plastino, 2005: more effective algorithm based on data mining strategies Memory approaches to improve multi-start constructive heuristics
Vocabulary building • Additional notation: • x and y: adjacency vectors • ExtInter(x,y): undefined variables in one of the vectors are filled with the corresponding defined variables in the other Memory approaches to improve multi-start constructive heuristics
3 4 3 4 2 5 2 5 1 6 1 6 8 7 8 7 ExtInter(x1,x2): Memory approaches to improve multi-start constructive heuristics
3 4 3 4 2 5 2 5 1 6 1 6 8 7 8 7 3 4 2 5 ExtInter(x1,x2): 1 6 8 7 Memory approaches to improve multi-start constructive heuristics
Vocabulary building • Additional notation: • x and y: adjacency vectors • ExtInter(x,y): undefined variables in one of the vectors are filled with the corresponding defined variables in the other • Combine words: given a set of words Y, combine them into phrases • Very similar to the algorithm that finds words, replacing the original operator Inter by the new operator ExtInter Memory approaches to improve multi-start constructive heuristics
Vocabulary building • Algorithm CombineWords(Y): Z , Y’ Y while Y’ do y rand(Y’), W {y}, Y’’ Y - {y} while Y’’ do y rand(Y’’) if MaxInDegree(ExtInter(W,y)) = 1 then W W {y} Y’’ Y’’ - {y}; end-while if |W| > 1 then z ExtInter(W); Z Z {z} Y’ Y’ – W end-while return Z Memory approaches to improve multi-start constructive heuristics
Vocabulary building • Combine words: given a set of words Y, combine them into phrases • Very similar to the algorithm that finds words, replacing the original operator Inter by the new operator ExtInter • Phrases may be incomplete or unfeasible • Make feasible the unfeasible phrases (solutions) • Insert probe s0 in the best place in case it does not appear in the phrase • Complete the solution joining subpaths of the phrase Memory approaches to improve multi-start constructive heuristics
Vocabulary building • Algorithm VocabularyBuilding(X,smin): Y FindWords(X,smin) Z CombineWords(Y) A for each z Z do a MakeFeasible(z) A A {a} end-for return A Memory approaches to improve multi-start constructive heuristics
Complete heuristic: MS+MEM+VB Q: pool of elite solutions for adaptive memory X: pool of elite solutions for vocabulary building |X|>>|Q| • Algorithm MS+MEM+VB: Q, X ; a* null for i = 1, ..., MAXITER a GreedyRandomizedMemory(Q, ) if |a| > |a*| then a* a update weight and use a to update pools Q and X if i mod(nVB) = 0 then A VocabularyBuilding(X,smin) for every a A do use a to update pools Q and X and if |a| > |a*|then a* a end-for end-for return a* Memory approaches to improve multi-start constructive heuristics
Computational experiments • Conditions: • Pentium 2.4 GHz with 512 M of RAM memory • Linux 10.0 with kernel 2.6.3 • Codes in ANSI C++ compiled with GNU compiler version 3.3.2 • Instances: • set A: instances generated from real human DNA sequences obtained from GenBank • set R: instances randomly generated Memory approaches to improve multi-start constructive heuristics
Computational experiments • Instances A: • Origin: 40 GenBank sequences • Five smaller sequences are generated from each original sequence, corresponding to their prefixes of size n = 109, 209, 309, 409, 509 • For each of them, we consider its ideal spectrum, with size resp. equal to 100, 200, 300, 400, 500, using an array with probes of size q = 10 • Total: 200 instances • 20% of false negatives and 20% of false positives generated for each instance (probe s0 appears in all of them, no repetitions) Memory approaches to improve multi-start constructive heuristics
Computational experiments • Instances R: • Origin: 100 random sequences • Ten smaller sequences are generated from each original sequence, corresponding to their prefixes of size n = 100, 200, ..., 1000 • For each of them, we consider its ideal spectrum, with size resp. equal to 92, 192, ..., 992, using an array with probes of size q = 7 • Total: 1000 instances • 20% of false negatives and 20% of false positives generated for each instance (probe s0 appears in all of them, no repetitions) Memory approaches to improve multi-start constructive heuristics
Computational experiments • Solution quality evaluation: • Number of probes in the solution: |a| • Similarity with the target sequence: • Perform the alignment between the solution and the target sequence (matches: +1, missmatches: -1) to compute the value align((a),*) by dynamic programming • Compute similarity(a) = 100.(align((a),*)+nmax)/(2.nmax), with nmax = max{|(a)|,|*|} • Fraction: • fraction(a) = 100.|a|/|a*| Memory approaches to improve multi-start constructive heuristics
Computational experiments • Random instances in set R used for parameter seting and tuning • Weight decreases with the iteration counter • Small values of are used in the beginning, so as that purely greedy solutions are generated when no frequency information is available • Initial value of decreases with the problem size • MAXITER = 10.n (iterations) • Parameters and are updated after blocks of n/2 iterations Memory approaches to improve multi-start constructive heuristics
MS+MEM+VB MS Numerical results Average similarity with the target sequence over all R instances with the same size Each additional component (memory, VB) improves the multi-start heuristic Memory approaches to improve multi-start constructive heuristics
Numerical results Average computation time over all R instances with the same size Memory approaches to improve multi-start constructive heuristics