1 / 27

近似搜索

近似搜索. 邹权 博士、助理教授 http://datamining.xmu.edu.cn. Outline. Global alignment Local alignment BLAST. why compare sequences? sequence comparison: operation consisting of finding which parts of the sequences are alike and which parts differ / Algorithms for an efficient solution.

lobo
Download Presentation

近似搜索

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 近似搜索 邹权 博士、助理教授 http://datamining.xmu.edu.cn http://datamining.xmu.edu.cn

  2. Outline • Global alignment • Local alignment • BLAST http://datamining.xmu.edu.cn

  3. why compare sequences? sequence comparison: operation consisting of finding which parts of the sequences are alike and which parts differ / Algorithms for an efficient solution http://datamining.xmu.edu.cn

  4. TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG || || || | | ||| | |||| ||||| ||| ||| TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG | | | | |||||| | |||| | || | | AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT ||| | ||| || || ||| | ||||||||| || |||||| | AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT http://datamining.xmu.edu.cn

  5. Two notions Similarity: a measure of how similar two sequences are Alignment: a basic operation to compare two sequences, a way of placing one sequence above the other in order to make clear the correspondence between similar characters or substrings from the sequences. http://datamining.xmu.edu.cn

  6. alignments involving: global comparisons: entire sequences local comparisons: just substrings of sequences dynamic programming (DP) comparing two sequences http://datamining.xmu.edu.cn

  7. global comparison- example • example of aligning • GACGGATTAG • GATCGGAATAG • GA –CGGATTAG • GATCGGAATAG • an extra T; a change from A to T; space: dash http://datamining.xmu.edu.cn

  8. Definitions Alignment: insertion of spaces: same size creating a correspondence: one over the other Both space are not allowed (Spaces can be inserted in beginning or end) Scoring function : a measure of similarity between elements ; a match: +1/ identical characters a mismatch: -1/ distinct characters a space: -2/ Scoring system: to reward matches and penalize mismatches and spaces global comparison- the basic algorithm http://datamining.xmu.edu.cn

  9. global comparison- the basic algorithm • GA –CGGATTAG • GATCGGAATAG • Example: total score is 6 • similarity : sim(s, t) • maximum alignment score; many alignments with similarity • best alignment • alignment with similarity http://datamining.xmu.edu.cn

  10. Basic DP algorithm for comparison of two sequences number of alignment between two sequences: exponential Efficient algorithm DP: prefixes: shorter to larger Idea: (m+1)*(n+1) array: entry (i, j) is similarity between s1..i and t1..j p(i, j)=+1 if s[i]=t[j], and -1 if s[i]≠t[j]: upper left corners http://datamining.xmu.edu.cn

  11. http://datamining.xmu.edu.cn

  12. A G C 0 1 2 3 0 -2 -4 -6 0 -2 1 1 -1 -1 A 1 -1 -3 -4 1 -1 -1 0 -1 -2 A 2 -6 1 -3 -1 -2 -1 -1 A 3 -8 -1 -5 -1 -4 1 -1 C 4 http://datamining.xmu.edu.cn

  13. local comparison • Problem: • local alignment between s and t: an alignment between a substring of s and a substring of t • Algorithm: to find the highest scoring local alignment between two sequences http://datamining.xmu.edu.cn

  14. Idea: Data structure: an (m+1)×(n+1) array; entry: holding the highest score of an alignment between a suffix of s[1..i] and a suffix of t[1..j]. Initialization First row and column: initialized with zeros←for any entry (i,j), there is always the alignment between the empty suffixes of s[1..i] and t[1..j], which has score zero. local comparison http://datamining.xmu.edu.cn

  15. http://datamining.xmu.edu.cn

  16. http://datamining.xmu.edu.cn

  17. http://datamining.xmu.edu.cn

  18. Global alignment http://datamining.xmu.edu.cn

  19. Local vs. Global Alignment (cont’d) • Global Alignment • Local Alignment—better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc http://datamining.xmu.edu.cn

  20. Compute a “mini” Global Alignment to get Local Local Alignment: Example Local alignment Global alignment http://datamining.xmu.edu.cn

  21. Summary Forgiving initial spaces: initializing certain positions with zero Forgiving final spaces: looking for maximum along certain positions semiglobal comparison http://datamining.xmu.edu.cn

  22. http://datamining.xmu.edu.cn

  23. Computing sim(s, t) AlgorithmBestScore input: sequence s and t output: vector a m←|s| n←|t| for j←0 to n do a[j] ←j×g for i←1 to m do old ←a[0] a[0] ←i×g for j←1 to n do temp←a[j] a[j] ←max(a[j]+g, old+p(i,j), a[j-1]+g) old←temp saving space http://datamining.xmu.edu.cn

  24. An optimal alignment in linear space Idea: Divide and conquer strategy Fix position i in s, and consider what matching s[i] in alignment, two possibilities: 1, The symbol t[j] will match s[i], for some j in 1..n (3.6) 2, a space between t[j] and t[j+1] will match s[i], for some j in 1..n (3.7) Recursive method 1, for fixed i 2, to decide which value of i to use in each recursive call: to pick i as close as possible to the middle of sequence http://datamining.xmu.edu.cn

  25. saving space http://datamining.xmu.edu.cn

  26. BLAST/Lucene • 步骤 • 为数据库建立倒排索引 • 查询倒排索引 • 扩展检验 • 问题 • K值选取 • 变长Kmer http://datamining.xmu.edu.cn

  27. Homework • 为{apple, please, eat, apply}建立关键字树,并画出所有的失效链接 • 比对两个字符串(aaac和agc),假定:match得2分,mismatch-1分,空格-2分,画出动态规划表和回溯路径,并给出针对该回溯路径的比对方式 • 简述BLAST的主要思想 • 为字符串“abababc”计算每一位的sp和sp‘值 http://datamining.xmu.edu.cn

More Related