150 likes | 248 Views
Information Retrieval. CSE 8337 Spring 2003 Simple Text Processing Material for these slides obtained from: Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book. Text Processing TOC. Simple Text Storage String Matching
E N D
Information Retrieval CSE 8337 Spring 2003 Simple Text Processing Material for these slides obtained from: Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book
Text Processing TOC • Simple Text Storage • String Matching • String-to-String Correction (Approximate matching)
Text storage • EBCDIC/ASCII • Array of character • Linked list of character • Trees- B Tree, Trie • Stuart E. Madnick, “String Processing Techniques,” Communications of the ACM, Vol 10, No 7, July 1967, pp 420-424.
Pattern Matching(Recognition) • Pattern Matching: finds occurrences of a predefined pattern in the data. • Applications include speech recognition, information retrieval, time series analysis.
Similarity Measures • Determine similarity between two objects. • Similarity characteristics: • Alternatively, distance measures measure how unlike or dissimilar objects are.
String Matching Problem • Input: • Pattern – length m • Text string – length n • Find one (next, all) occurrences of string in pattern • Ex: • String: 00110011011110010100100111 • Pattern: 011010
String Matching Algorithms • Brute Force • Kknuth-Morris Pratt • Boyer Moore • P209 in text
011010 011010 011010 Brute Force String Matching • Brute Force • Handbook of Algorithms and Data Structures http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/711a.srch.c.html • Space O(m+n) • Time O(mn) 00110011011110010100100111
Creating FSR • Create FSM: • Construct the “correct” spine. • Add a default “failure bus” to state 0. • Add a default “initial bus” to state 1. • For each state, decide its attachments to failure bus, initial bus, or other failure links.
Knuth-Morris-Pratt • Apply FSM to string by processing characters one at a time. • Accepting state is reached when pattern is found. • Space O(m+n) • Time O(m+n) • Handbook of Algorithms and Data Structures http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/712.srch.c.html
Boyer-Moore • Scan pattern from right to left • Skip many positions on illegal character string. • O(mn) • Expected time better than KMP • Expected behavior better • Handbook of Algorithms and Data Structures http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/713.preproc.c.html
String-to-String Correction • Measure of similarity between strings • Can be used to determine how to convert from one string to another • Cost to convert one to the other • Transformations • Match: Current characters in both strings are the same • Delete: Delete current character in input string • Insert: Insert current character in target string into string
Approximate String Matching • Find patterns “close to” the string • Fuzzy matching • Applications: • Spelling checkers • IR • Define similarity (distance) between string and pattern