1 / 17

Text Database

Text Database. Text Fuzzy Match Keyword/Key Phrase Extraction Fuzzy Match Applications: Music Retrieval Term Association Analysis Text Categorization Text Summarization, Clustering, and Its Applications. Yuen-Hsien Tseng / 曾元顯 Digital Media Center National Taiwan Normal University.

jarvis
Download Presentation

Text Database

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Database • Text Fuzzy Match • Keyword/Key Phrase Extraction • Fuzzy Match Applications: Music Retrieval • Term Association Analysis • Text Categorization • Text Summarization, Clustering, and Its Applications Yuen-Hsien Tseng / 曾元顯 Digital Media Center National Taiwan Normal University

  2. Share Values of This Course • Information access service is highly profitable • Text database techniques are its core technologies By jiyanjiang on 2006-01-03

  3. Text Fuzzy Match • Applications • Text search, text highlight • duplicate detection, record linkage • Levels of Text Fuzzy Match • Phonic: homonyms, transliteration match • Symbolic: most common type of text match • Semantic: knowledge bases are often needed

  4. Text Fuzzy Match – Phonic (En) • For English, use Soundex code • a hashing system for English words • used by the United States Census Bureau • by Robert C. Russell (U.S. patent 1,261,167 on April 2, 1918) • Soundex Algorithm • begin with the first letter of a word followed by a three-digit (1-6) code that represents the (first three) remaining consonants • If the word is too short to generate 3 numbers, 0 is added as needed. • If the generated code is longer than 3 numbers, the extra are thrown away.

  5. Text Fuzzy Match – Phonic (En) • Examples: • perl -e "use Text::Soundex; print soundex('anything')" • Ans: A535 • References: • http://www.highprogrammer.com/alan/numbers/soundex.html • http://searches.rootsweb.com/cgi-bin/Genea/soundex.sh • Applications: • Yuen-Hsien Tseng, "Automatic Cataloguing and Searching for Retrospective Data by Use of OCR Text", Journal of the American Society for Information Science and Technology, Vol. 52, No. 5, April 2001, pp. 378-390. (SSCI, SCI, EI)

  6. Text Fuzzy Match – Phonic (Ch) • For Chinese, • convert each character into its phonograms or pinyins • match their phonograms or pinyins instead of their characters • Applications • Transliteration match:克麗斯汀 vs 克莉思汀 • Traditional/simplified character match • Example: • D:\demo\ir\sam.lib>perl -s Fuzzyhighlight.pm "國內大型電腦集團赴大陸投資升溫" "國內電腦廠商赴大陸頭姿" 1 • Ans: <font color=red>國內大型電腦集團赴大陸投資</font>升溫 • D:\demo\ir\sam.lib>perl -s Fuzzyhighlight.pm "國內大型電腦集團赴大陸投資升溫" "國內電腦廠商,赴大陸投資,頭姿" 1 • Ans: 國內大型電腦集團<font color=red>赴大陸<font color=red>投資</font></font>升溫

  7. Text Match (Naïve) • Simple brute force (naive) algorithm: O(nm) in its worse case • Text: T[n], where n is the text length • Pattern: P[m], where m is the pattern length • Naïve Algorithm: for (i=0; T[i] != '\0'; i++) { for (j=0; T[i+j] != '\0' && P[j] != '\0' && T[i+j]==P[j]; j++) ; if (P[j] == '\0') found a match } • Example: looking for pattern "nano" in text "banananobano" • Each row represents an iteration of the outer loop, with each character in the row representing the result of a comparison (X if the comparison was unequal) 0 1 2 3 4 5 6 7 8 9 10 11 T: b a n a n a n o b a n o i=0: X i=1: X i=2: n a n X i=3: X i=4: n a n o i=5: X i=6: n X i=7: X i=8: X i=9: n X i=10: X

  8. Text Match (Knuth-Morris-Pratt) • Knuth-Morris-Pratt Algorithm • Skipping outer iterations in the above naïve algorithm • If you've found a partial match of j characters starting at position i, you know what's in positions T[i]...T[i+j-1]. • i=2: n a n • i=3: n a n o <= no need to do this comparison • i=4: n a n o <= just skip to this step • Skipping inner iterations • i=2: n a n • i=4: n a n o <= n has been compared at i=2, skip this n further • Complexity: O(m+n) guaranteed • Reference: • http://www.ics.uci.edu/~eppstein/161/960227.html

  9. Fuzzy Text Match • Become popular after Prof. Sun Wu (吳昇) published his papers since 1992 • complexity O(kn), k: number of allowed errors • But his algorithm still is not good enough for music retrieval (as noted by Ian Witten in their papers about music retrieval) • Reference: • Sun Wu and Udi Manber, “Agrep—A Fast Approximate Pattern-Matching Tool” Usenix Winter 1992 - 被引用 140 次 • Sun Wu and Udi Manber, ‘‘Fast text searching: allowing errors,’’ Communications of the ACM, Vol. 35 , Issue 10 (Oct. 1992) pp: 83 - 91 – cited 71 times in ACM portal • Baeza-Yates, R.A. and Gonnet, G.H. A new approach to text searching. Commun. ACM 35, 10 (1992). Preliminary version appeared in Proceedings of the 12th Annual ACM-SIGIR Conference on Information Retrieval, Cambridge, Mass. (June 1989), pp. 168-175.

  10. Dynamic Programming for Fuzzy Match • DP algorithm: 比較兩字串,允許字元的插入、刪除、代換 • 將兩字串在二維矩陣中展開(如下頁範例),求累計距離 • d[i, j] = min( d[i-1, j] + w(A[i], 0), d[i-1, j-1] + w(A[i], B[j]), d[i, j-1] + w(0, B[j]) ) • d[i, j] : accumulated distance between sequence A and B • w(A[i], B[j]) : cost of substituting A[i]with B[j] • w(A[i], 0) : cost of inserting A[i] • w(0, B[j]) : cost of deleting B[j] • d[0, 0] = 0 • d[i, 0] = d[i-1, 0] + w(A[i], 0) , 1<= i <= m • d[0, j] = d[0, j-1] + w(0, B[j]) , 1<= j <= n • Complexity: O(n * m), m : length of A, n : length of B (i-1,j-1) (i-1, j) (i, j-1) ( i, j )

  11. Basic Dynamic Programming: Examples • 假設文件B=adecdecf,查詢詞A=adec,則 d[4,4]為0,最小,從d[4,4]倒推回d[3,3],再倒推回d[2,2],再倒推回d[1,1],一路都有最小值0,可知match 的位置在最開頭 adec • 但假若match的字串adec在B的中間,就沒有預期的最小值0 • 很多人都不知道如何改進,因此認為DP只能從頭開始比起,尤其是一些音樂檢索系統,只能允許旋律從頭比起,限制很大 • 改進辦法,在第一列插入全部都是0的列,如下頁

  12. Improved Dynamic Programming • 在第一列插入全部都是0的列,其效果猶如允許從任何位置開始比較,亦即,初始值 • 從 d[0, j] = d[0, j-1] + w(0, B[j]) , 1<= j <= n • 改成 d[0, j] = 0 , 1<= j <= n • 最後一列的值,範圍為0到m,定義一函數: 其中 min 是最後一列的最小值 • 函數d(m)的範圍在0到1之間,其與m的關係如下圖,因此,取門檻值如下: if ($min == $lenq) { $dm = 0; } else { $dm = 1/exp($min/($lenq-$min)); } if (($lenq < 5 && $dm < 0.7) or ($dm < 0.54)) { … } #The No Match threshold Reference: Daniel Lopresti and Jiangying Zhou, “Retrieval Strategies for Noisy Text,” Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval, April 15-17, 1996, pp. 255-269.

  13. 應用範例(1/3) • 目前我實做的 fuzzy highlight 規格: • 中文、中文同音、英文、英文同音、中英文、中英文同音、多詞 highlight • 底下的 Perl 範例程式,第一個參數是待標示的文件(字串),第二個參數是查詢詞(可多個,用「,」或「,」分開即可),第三個參數如果不為 0 ,則表示要做同音比對。 D:\>perl -s fuzzymatch5.pl "國內大型電腦集團赴大陸投資升溫" "國內電腦廠商赴大陸投資" Result:<font color=red>國內大型電腦集團赴大陸投資</font>升溫 D:\>perl -s fuzzymatch5.pl “國際商業股份公司” “傷葉古份” 1 Result:國際<font color=red>商業股份</font>公司 D:\>perl -s fuzzymatch5.pl “這是CD-R光碟片” “cdr 胱跌騙” 1 Result:這是<font color=red>CD-R光碟片</font> D:\>perl -s fuzzymatch5.pl "My name is 'sam Tseng'" "sam Tseng" Result:My name is '<font color=red>sam Tseng</font>'

  14. 應用範例(2/3) • 中文同音、多詞 highlight: D:\>perl -s fuzzymatch5.pl "國內大型電腦集團赴大陸投資升溫" "國內電腦廠商,赴大陸投資,生瘟" 1 Result:國內大型電腦集團<font color=red>赴大陸投資</font><font color=red>升溫</font> D:\>perl -s fuzzymatch5.pl "國內大型電腦集團赴大陸投資升溫" "國內電腦廠商,赴大陸投資,頭姿" 1 Result:國內大型電腦集團<font color=red>赴大陸<font color=red>投資</font></font>升溫 • 中英文模糊 highlight: D:\>perl -s fuzzymatch5.pl "國內computer集團赴大陸投資升溫" "國內camputorization集團" Result:<font color=red>國內computer集團</font>赴大陸投資升溫

  15. 應用範例(3/3) • 英文的同音 high light: D:\>perl -s fuzzymatch5.pl "國內computer集團赴大陸投資升溫" "國內camputorization" Result:國內computer集團赴大陸投資升溫 • 錯太多,無法 high light,要改用同音比對 D:\>perl -s fuzzymatch5.pl "國內computer集團赴大陸投資升溫" "國內camputorization" 1 Result:<font color=red>國內computer</font>集團赴大陸投資升溫

  16. Conclusions (1) • We have shown a fuzzy match tool that is capable of • matching strings with some edit distance • regardless of spaces and capitalization • insensitive to plurals and tense • allowing homonym match • converting the edit distance into a similarity score • highlighting similar query string in the text • Google segments the query string into terms and highlight terms in the text • Applications: • Music Retrieval • Yuen-Hsien Tseng, "Content-Based Retrieval for Music Collections," ACM SIGIR '99, Aug. 15-19, Berkeley, U.S.A., 1999, pp.176-182. • Duplicate detection • Yuen-Hsien Tseng and William John Teahan, "Verifying a Chinese Collection for Text Categorization," ACM SIGIR '04, July 25 - 29, U.K., 2004, pp.556-557. • 曾元顯, "分類不一致對文件自動分類效果的影響", 「大學圖書館」, 第 9 卷第 1 期, 2005 年 3 月, 頁2-19.

  17. Conclusions (2) • However, its complexity is • O(mn) in time and O(n) in space • limited to applications of only small text sizes • Do we have better solutions? • Yes, index the texts to be searched first and then match them with the query string based on some retrieval models • In this way, we also are able to deal with semantic text match to some degree • In short, we need information retrieval (IR) techniques

More Related