320 likes | 340 Views
CS 430 / INFO 430 Information Retrieval. Lecture 2 Text Based Information Retrieval. Course Administration. Web site: http://www.cs.cornell.edu/courses/cs430/2004fa Notices: See the course web site Sign-up sheet: If you did not sign up at the first class, please sign up now.
E N D
CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval
Course Administration Web site: http://www.cs.cornell.edu/courses/cs430/2004fa Notices: See the course web site Sign-up sheet: If you did not sign up at the first class, please sign up now.
Course Administration Please send all questions about the course to: cs430-l@cs.cornell.edu The message will be sent to William Arms All Teaching Assistants
Course Administration Discussion class, Wednesday, September 1 Upson B17, 7:30 to 8:30 p.m. Prepare for the class as instructed on the course Web site. Participation in the discussion classes is one third of the grade, but tomorrow's class will not be included in the grade calculation.
Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for others to comment. When answering: Stand up. Give your name. Make sure that the TA hears it. Speak clearly so that all the class can hear. Suggestions: Do not be shy at presenting partial answers. Differing viewpoints are welcome.
Information Retrieval from Collections of Textual Documents Major Categories of Methods Exact matching (Boolean) Ranking by similarity to query (vector space model) Ranking of matches by importance of documents (PageRank) Combination methods Course begins with Boolean, then similarity methods, then importance methods.
Text Based Information Retrieval Most matching methods are based on Boolean operators. Most ranking methods are based on thevector space model. Web searchmethods combine vector space model with ranking based on importance of documents. Many practical systems combine features of several approaches. In the basic form, all approaches treat words as separate tokens with minimal attempt to interpret them linguistically.
Documents A textual document is a digital object consisting of a sequence of words and other symbols, e.g., punctuation. The individual words and other symbols are known as tokens or terms. A textual document can be: • Free text, also known as unstructured text, which is a continuous sequence of tokens. • Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup. [Methods of markup, e.g., XML, are covered in CS 431.]
Word Frequency Observation: Some words are more common than others. Statistics: Most large collections of text documents have similar statistical characteristics. These statistics: • influence the effectiveness and efficiency of data structures used to index documents • many retrieval models rely on them
Word Frequency Example The following example is taken from: Jamie Callan, Characteristics of Text, 1997 Sample of 19 million words The next slide shows the 50 commonest words in rank order (r), with their frequency (f).
fff the 1130021 from 96900 or 54958 of 547311 he 94585 about 53713 to 516635 million 93515 market 52110 a 464736 year 90104 they 51359 in 390819 its 86774 this 50933 and 387703 be 85588 would 50828 that 204351 was 83398 you 49281 for 199340 company83070 which 48273 is 152483 an 76974 bank 47940 said 148302 has 74405 stock 47401 it 134323 are 74097 trade 47310 on 121173 have 73132 his 47116 by 118863 but 71887 more 46244 as 109135 will 71494 who 42142 at 101779 say 66807 one 41635 mr 101679 new 64456 their 40910 with 101210 share 63925
Rank Frequency Distribution For all the words in a collection of documents, for each word w f is the frequency that w appears r is rank of w in order of frequency. (The most commonly occurring word has rank 1, etc.) f w has rank r and frequency f r
Rank Frequency Example The next slide shows the words in Callan's data normalized. In this example: r is the rank of word w in the sample. f is the frequency of word w in the sample. n is the total number of word occurrences in the sample.
rf*1000/nrf*1000/nrf*1000/n the 59 from 92 or 101 of 58 he 95 about 102 to 82 million 98 market 101 a 98 year 100 they 103 in 103 its 100 this 105 and 122 be 104 would 107 that 75 was 105 you 106 for 84 company 109 which 107 is 72 an 105 bank 109 said 78 has 106 stock 110 it 78 are 109 trade 112 on 77 have 112 his 114 by 81 but 114 more 114 as 80 will 117 who 106 at 80 say 113 one 107 mr 86 new 112 their 108 with 91 share 114
Zipf's Law If the words, w, in a collection are ranked, r, by their frequency, f, they roughly fit the relation: r * f = c Different collections have different constants c. In English text, c tends to be about n / 10, where n is the number of word occurrences in the collection. For a weird but wonderful discussion of this and many other examples of naturally occurring rank frequency distributions, see: Zipf, G. K., Human Behaviour and the Principle of Least Effort. Addison-Wesley, 1949
Methods that Build on Zipf's Law Stop lists: Ignore the most frequent words (upper cut-off). Used by almost all systems. Significant words: Ignore the most frequent and least frequent words (upper and lower cut-off). Rarely used. Term weighting: Give differing weights to terms based on their frequency, with most frequent words weighed less. Used by almost all ranking methods.
1. Exact Matching (Boolean Model) Documents Query Index database Mechanism for determining whether a document matches a query. Set of hits
Evaluation of Matching: Recall and Precision • If information retrieval were perfect ... • Every hit would be relevant to the original query, and every relevant item in the body of information would be found. • Precision: percentage (or fraction) of the hits that are • relevant, i.e., the extent to which the set of hits • retrieved by a query satisfies the requirement that • generated the query. • Recall: percentage (or fraction) of the relevant items that are • found by the query, i.e., the extent to which the query • found all the items that satisfy the requirement.
Recall and Precision with Exact Matching: Example • Collection of 10,000 documents, 50 on a specific topic • Ideal search finds these 50 documents and reject all others • Actual search identifies 25 documents; 20 are relevant but 5 were on other topics • Precision: 20/ 25 = 0.8 (80% of hits were relevant) • Recall: 20/50 = 0.4 (40% of relevant were found)
Measuring Precision and Recall • Precision is easy to measure: • A knowledgeable person looks at each document that is identified and decides whether it is relevant. • In the example, only the 25 documents that are found need to be examined. • Recall is difficult to measure: • To know all relevant items, a knowledgeable person must go through the entire collection, looking at every object to decide if it fits the criteria. • In the example, all 10,000 documents must be examined.
Query A query is a string to match against entries in an index. The string might may contain: search terms computation operators computationandparallel fields author =Newton metacharacters b[aeiou]n*g (Metacharacters can be used to build regular expressions, which will be covered later in the course.)
Boolean Queries Boolean query: two or more search terms, related by logical operators, e.g., andornot Examples: abacusandactor abacusoractor (abacus and actor)or(abacus and atoll) not actor
Boolean Diagram not (A or B) A and B A B A or B
Adjacent and Near Operators abacusadjactor Terms abacus and actor are adjacent to each other as in the string "abacus actor" abacusnear 4actor Terms abacus and actor are near to each other as in the string "the actor has an abacus" Some systems support other operators, such as with (two terms in the same sentence) or same (two terms in the same paragraph).
Evaluation of Boolean Operators Precedence of operators must be defined: adj, near high and, not or low Example A and B or C and B is evaluated as (A and B) or (C and B)
Inverted File Inverted file: A list of search terms that are used to index a set of documents. The inverted file is organized for associative look-up, i.e., to answer the question, "In which documents does a specified search term appear?" In practical applications, the inverted file contains related information, such as the location within the document where the search terms appear.
Inverted File -- Basic Concept • Word Document • abacus 3 • 19 • 22 • actor 2 • 19 • 29 • aspen 5 • atoll 11 • 34 Stop words are removed before building the index.
Inverted List -- Concept • Inverted List: All the entries in an inverted file that apply to a specific word, e.g. • abacus 3 • 19 • 22 Posting: Entry in an inverted list, e.g., there are three postings for "abacus".
Evaluating a Boolean Query 3 19 22 2 19 29 Examples: abacusandactor Postings for abacus Postings for actor Document 19 is the only document that contains both terms, "abacus" and "actor". To evaluate the and operator, merge the two inverted lists with a logical AND operation.
Enhancements to Inverted Files -- Concept Location: The inverted file can hold information about the location of each term within the document. Uses adjacency and near operators user interface design -- highlight location of search term Frequency: The inverted file includes the number of postings for each term. Uses term weighting query processing optimization
Inverted File -- Concept (Enhanced) • Word Postings Document Location • abacus 4 3 94 • 19 7 • 19 212 • 22 56 • actor 3 2 66 • 19 213 • 29 45 • aspen 1 5 43 • atoll 3 11 3 • 11 70 • 34 40
Evaluating an Adjacency Operation 3 94 19 7 19 212 22 56 2 66 19 213 29 45 Examples: abacusadjactor Postings for abacus Postings for actor Document 19, locations 212 and 213, is the only occurrence of the terms "abacus" and "actor" adjacent.