270 likes | 373 Views
近似搜索. 邹权 博士、助理教授 http://datamining.xmu.edu.cn. Outline. Global alignment Local alignment BLAST. why compare sequences? sequence comparison: operation consisting of finding which parts of the sequences are alike and which parts differ / Algorithms for an efficient solution.
E N D
近似搜索 邹权 博士、助理教授 http://datamining.xmu.edu.cn http://datamining.xmu.edu.cn
Outline • Global alignment • Local alignment • BLAST http://datamining.xmu.edu.cn
why compare sequences? sequence comparison: operation consisting of finding which parts of the sequences are alike and which parts differ / Algorithms for an efficient solution http://datamining.xmu.edu.cn
TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG || || || | | ||| | |||| ||||| ||| ||| TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG | | | | |||||| | |||| | || | | AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT ||| | ||| || || ||| | ||||||||| || |||||| | AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT http://datamining.xmu.edu.cn
Two notions Similarity: a measure of how similar two sequences are Alignment: a basic operation to compare two sequences, a way of placing one sequence above the other in order to make clear the correspondence between similar characters or substrings from the sequences. http://datamining.xmu.edu.cn
alignments involving: global comparisons: entire sequences local comparisons: just substrings of sequences dynamic programming (DP) comparing two sequences http://datamining.xmu.edu.cn
global comparison- example • example of aligning • GACGGATTAG • GATCGGAATAG • GA –CGGATTAG • GATCGGAATAG • an extra T; a change from A to T; space: dash http://datamining.xmu.edu.cn
Definitions Alignment: insertion of spaces: same size creating a correspondence: one over the other Both space are not allowed (Spaces can be inserted in beginning or end) Scoring function : a measure of similarity between elements ; a match: +1/ identical characters a mismatch: -1/ distinct characters a space: -2/ Scoring system: to reward matches and penalize mismatches and spaces global comparison- the basic algorithm http://datamining.xmu.edu.cn
global comparison- the basic algorithm • GA –CGGATTAG • GATCGGAATAG • Example: total score is 6 • similarity : sim(s, t) • maximum alignment score; many alignments with similarity • best alignment • alignment with similarity http://datamining.xmu.edu.cn
Basic DP algorithm for comparison of two sequences number of alignment between two sequences: exponential Efficient algorithm DP: prefixes: shorter to larger Idea: (m+1)*(n+1) array: entry (i, j) is similarity between s1..i and t1..j p(i, j)=+1 if s[i]=t[j], and -1 if s[i]≠t[j]: upper left corners http://datamining.xmu.edu.cn
A G C 0 1 2 3 0 -2 -4 -6 0 -2 1 1 -1 -1 A 1 -1 -3 -4 1 -1 -1 0 -1 -2 A 2 -6 1 -3 -1 -2 -1 -1 A 3 -8 -1 -5 -1 -4 1 -1 C 4 http://datamining.xmu.edu.cn
local comparison • Problem: • local alignment between s and t: an alignment between a substring of s and a substring of t • Algorithm: to find the highest scoring local alignment between two sequences http://datamining.xmu.edu.cn
Idea: Data structure: an (m+1)×(n+1) array; entry: holding the highest score of an alignment between a suffix of s[1..i] and a suffix of t[1..j]. Initialization First row and column: initialized with zeros←for any entry (i,j), there is always the alignment between the empty suffixes of s[1..i] and t[1..j], which has score zero. local comparison http://datamining.xmu.edu.cn
Global alignment http://datamining.xmu.edu.cn
Local vs. Global Alignment (cont’d) • Global Alignment • Local Alignment—better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc http://datamining.xmu.edu.cn
Compute a “mini” Global Alignment to get Local Local Alignment: Example Local alignment Global alignment http://datamining.xmu.edu.cn
Summary Forgiving initial spaces: initializing certain positions with zero Forgiving final spaces: looking for maximum along certain positions semiglobal comparison http://datamining.xmu.edu.cn
Computing sim(s, t) AlgorithmBestScore input: sequence s and t output: vector a m←|s| n←|t| for j←0 to n do a[j] ←j×g for i←1 to m do old ←a[0] a[0] ←i×g for j←1 to n do temp←a[j] a[j] ←max(a[j]+g, old+p(i,j), a[j-1]+g) old←temp saving space http://datamining.xmu.edu.cn
An optimal alignment in linear space Idea: Divide and conquer strategy Fix position i in s, and consider what matching s[i] in alignment, two possibilities: 1, The symbol t[j] will match s[i], for some j in 1..n (3.6) 2, a space between t[j] and t[j+1] will match s[i], for some j in 1..n (3.7) Recursive method 1, for fixed i 2, to decide which value of i to use in each recursive call: to pick i as close as possible to the middle of sequence http://datamining.xmu.edu.cn
saving space http://datamining.xmu.edu.cn
BLAST/Lucene • 步骤 • 为数据库建立倒排索引 • 查询倒排索引 • 扩展检验 • 问题 • K值选取 • 变长Kmer http://datamining.xmu.edu.cn
Homework • 为{apple, please, eat, apply}建立关键字树,并画出所有的失效链接 • 比对两个字符串(aaac和agc),假定:match得2分,mismatch-1分,空格-2分,画出动态规划表和回溯路径,并给出针对该回溯路径的比对方式 • 简述BLAST的主要思想 • 为字符串“abababc”计算每一位的sp和sp‘值 http://datamining.xmu.edu.cn