480 likes | 610 Views
Introduction to Digital Libraries Information Retrieval. Sample Statistics of Text Collections. Dialog : claims to have >12 terabytes of data in >600 Databases, > 800 million unique records
E N D
Sample Statistics of Text Collections • Dialog: claims to have >12 terabytes of data in >600 Databases, > 800 million unique records • LEXIS/NEXIS: claims 7 terabytes, 1.7 billion documents, 1.5 million subscribers, 11,400 databases; >200,000 searches per day; 9 mainframes, 300 Unix servers, 200 NT servers
Information Retrieval • Motivation • the larger the holdings of the archive, the more useful it is • however, it is harder to find what you want
Simple IR Model User Boolean Vector Feedback Query Results Ranking Clustering Weighting Stemming Thesaurus Signature Pre- Processing Post- Processing Boolean Vector Searching Flat Files Inverted Files Signature Files PAT Trees Storage Stemming Stoplist Collection & Processing
IR problem • In libraries ISBN: 0-201-12227-8 Author: Salton, Gerard Title: Automatic text processing: the transformation, analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989 Content: <Text> • external attributes and internal attribute (content) • Search by external attributes = Search in DB • IR: search by content
Basic concepts • Document is described by a set of representative keywords (index terms) • Keywords may have binary weights or weights calculated from statistics of their frequency in text • Retrieval is a ‘matching’ process between document keywords and words in queries
IR Outline • Index Storage • flat files, inverted files, signature files, PAT trees • Processing • Stemming, stop-words • Searching & Queries • Boolean, vector (including ranking, weighting, feedback) • Results • clustering
Flat Files Index • Simple files, no additional processing or storage needed • Worst case keyword search time: O(DW) • D = # of documents • W = # words per document • linear search • Clearly only acceptable for small collections
Inverted Files • All input files are read, and a list of which words appear in what documents (records) is made • Extra space required can be up to 100% of original input files • Worst case keyword search time is now O(log(DW)) • Almost all indexing systems in popular usage use inverted files
Structure of inverted index • May be a hierarchical set of addresses, e.g. word number within sentence number within paragraph number within chapter number within volume number within document number • Consider as a vector (d,v,c,p,s,w)
(document-ID,position in the doc) Inverted File Index alphabetdatabaseindexinformationretrievalsemistructuredXMLXPath (15,42);(26,186);(31,86)(41,10)(15,76);(51,164);(76,641);(81,64)(16,76)(16,88)(5,61);(15,174);(25,41)(1,108);(2,65);(15,741);(21,421)(5,90);(21,301) Store appearance of terms in documents (like index of a book) Answer queries like „xml and index“, „information near retrieval“ But: not suitable for evaluating path expressions
An Inverted File • Search for • “databases” • “microsoft”
Other indexing structures • Signature files • Each document has an associated signature, generating by hashing each term it contains • Leads to possible matches; further processing to resolve • Bitmaps • One-to-one hash function; each distinct term in collection has a bit vector with one bit for each document • Special case of signature file; storage expensive
Signature Files Signature size. Number of bits in a signature, F. Word signature. A bit pattern of size F with exactly m bits set to 1 and the others 0. Block. A sequence of text that contains D distinct words. Block signature. The logical or of all the word signatures in a block of text.
Signature File • Each document is divided into “logical blocks” -- pieces of text that contain a constant number D of distinct, non-common words • Each word yields a “word signature” which is a bit pattern of size F, with m bits set to 1 and the rest to 0 • F and m are design parameters
Sample Signature File Figure, D=2, F=12, m=4
data 0000 0000 0000 0010 0000 base 0000 0001 0000 0000 0000 management 0000 1000 0000 0000 0000 system 0000 0000 0000 0000 1000 ---------------------------------------- block signature 0000 1001 0000 0010 1000 Figure, D=4, F=20, m=1
Signature File • Searching • By examining each block signature for "1" 's in those bit positions that the signature of the search word has a "1". • False Drop • probability that the signature test will “fail”, creating a “false hit” or “false drop” • A word signature may match the block signature, but the word is not in the block. This is a false hit.
Sistrings • Original text: ”The traditional approach for searching a regular expression…” • Sistrings • “The traditional approach for searching …” • “he traditional approach for searching a…” 3. “e traditional approach for searching a …” 4. “onal approach for searching a regular …”
Sistrings • Once upon a time, in a far away land ... • sistring1: Once upon a time ... • sistring2: nce upon a time ... • sistring8: on a time, in a ... • sistring11: a time, in a far ... • sistring22: a far away land ...
PAT Trees • PAT Tree: • a Patricia Tree constructed over all the possible sistrings of a document • bits of the key decide branching • 0 is branch to left subtree • 1 is branch to right subtree • internal node decides which bit of the key to use • at leaf node, check any skipped bits • PAT (Suffix) tree of a string Sis a compacted trie that represents all substrings of S or semi-infinite string (sistring).
PATRICIA TREE • A particular type of “trie” • Example, trie and PATRICIA TREE with content ‘010’, ‘011’, and ‘101’.
PAT Tree Query: 00101 1 sistrings 1-8 already indexed 2 2 3 3 4 2 7 5 5 1 6 3 = sistring 4 8 01100100010111... Text 123456789.... Position = position to check
Try to build the Patricia tree • A 00001 • S 10011 • E 00101 • R 10010 • C 00011 • H 01000 • I 01001 • N 01110 • G 00111 • X 11000 • M 01101 • P 10000
PAT Tree A S E C X R H I G N P M
1 Example 2 2 Text 01100100010111 … sistring 1 01100100010111 … sistring 2 1100100010111 … sistring 3 100100010111 … sistring 4 00100010111 … sistring 5 0100010111 … sistring 6 100010111 … sistring 7 00010111 … sistring 8 0010111 ... 4 3 1 2 1 2 2 4 3 3 2 5 1 : external node sistring (integer displacement) total displacement of the bit to be inspected 1 1 1 1 0 0 1 1 1 2 2 0 1 3 2 : internal node skip counter & pointer
SISTRING • Bit level is too abstract, depends on application, we rarely apply this on bit level. Character level is a better idea! • e.g. CUHK • Corresponding sistrings would be • CUHK000… • UHK000… • HK000… • K000… • We require each should be at least 4 characters long. • (Why we pad 0/NULL at the end of sistring?)
SISTRING (USAGE) • We may instead storing the sistrings of ‘CUHK’, which requires O(n2) storage. • CUHK <- represent C CU CUH CUHK at the same time • UHK0 <- represent U UH UHK at the same time • HK00 <- represent H HK at the same time • K000 <- represent K only • A prefix-matching on sistrings is equivalent to the exact matching on the sub-strings. • Conclusion, sistrings is better representation for storing sub-string information.
PAT Tree (Example) • By digitalizing the string, we can manually visualize how the PAT Tree could be. • Following is the actual bit patternof the four sistrings
PAT Tree (Example) • This works! BUT… • We still need O(n2)memory for storingthose sistrings • We may reduce thememory to O(n)by making use ofpoints.
Space/Time Tradeoffs Space PAT trees inverted files signature files flat files Time
Stemming • Reason: • Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them • Stemming: • Removing some endings of word computer compute computes computing computed computation comput
Stemming • am, are, is be car, cars, car's, cars' car • the boy's cars are different colors the boy car be differ color
Stemming • Manual or Automatic • Can reduce index files up to 50% • Effectiveness studies of stemming are mixed, but in general it has either no effect or a positive effect when measuring both precision and recall
Stopwords • Stopwords exist in stoplists or negative dictionaries • Idea: remove low semantic content • index should only have “important stuff” • What not to index is domain dependent, but often includes: • “small” words: a, and, the, but, of, an, very, etc. • case is removed • punctuation
Stop words • Very common words that have no discriminatory power • (في، من، إلى،...)
Normalization • Token normalization • Canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens • U.S.A vs USA • Anti-discriminatory vs antidiscriminatory • Car vs automobile?
Capitalization/case folding • Good for • Allow instances of Automobile at the beginning of a sentence to match with a query of automobile • Helps a search engine when most users type ferrari when they are interested in a Ferrari car • Bad for • Proper names vs common nouns • General Motors, Associated Press, Black • Heuristic solution: lowercase only words at the beginning of the sentence; true casing via machine learning
Performance of search • 3 major classes of measuring performance • precision / recall • TREC conference series, http://trec.nist.gov/ • space / time • see Esler & Nelson, JNCA for an example • http://techreports.larc.nasa.gov/ltrs/PDF/1997/jp/NASA-97-jnca-sle.pdf • usability • probably the most important measure, but largely ignored
Precision and Recall • Precision = No. of relevant documents retrieved Total no. of documents retrieved • Recall = No. of relevant documents retrieved . Total no. of relevant documents in database
Standard Evaluation Measures Starts with a CONTINGENCY table retrieved not retrieved relevant w x n1 = w + x not relevant y z N n2 = w + y
Precision and Recall From all the documents that are relevant out there, how many did the IR system retrieve? w Recall: w+x From all the documents that are retrieved by the IR system, how many are relevant? w Precision: w+y
User-Centered IR Evaluation • More user-oriented measures • Satisfaction, informativeness • Other types of measures • Time, cost-benefit, error rate, task analysis • Evaluation of user characteristics • Evaluation of interface • Evaluation of process or interaction