1.32k likes | 1.75k Views
Natural Language Processing for Information Retrieval . Douglas W. Oard College of Information Studies. Roadmap. IR overview NLP for monolingual IR NLP for cross-language IR. What do We Mean by “Information?”. Information is data in context Databases contain data and produce information
E N D
Natural Language Processing forInformation Retrieval Douglas W. Oard College of Information Studies CMSC 723 / LING 845
Roadmap • IR overview • NLP for monolingual IR • NLP for cross-language IR
What do We Mean by “Information?” • Information is data in context • Databases contain data and produce information • IR systems contain and provide information • How is it different from “Knowledge” • Knowledge is a basis for making decisions • Many “knowledge bases” contain decision rules
What Do We Mean by “Retrieval?” • Find something that you are looking for • 3 general categories: • Known item search • Find the class home page • Answer seeking • Is Lexington or Louisville the capital of Kentucky? • Directed exploration • Who makes videoconferencing systems?
Retrieval System Model User Query Formulation Detection Selection Index Examination Indexing Docs Delivery
Query Formulation User Query Formulation Detection
Detection • Searches the index • Not the web! • Looks for words • Desired • Required (+) • Together (“...”) • Ranks the results • Goal is “best first” Query Formulation Detection Selection Index Docs
Selection About 7381 documents match your query. 1. MAHEC Videoconference Systems. Major Category. Product Type. Product. Network System. Multipoint Conference Server (MCS) PictureTel Prism - 8 port. . - size 5K - 6-Jun-97 - English - 2. VIDEOCONFERENCING PRODUCTS. Aethra offers a complete product line of multimedia and videocommunications products to meet all the applications needs of... - size 4K - 1-Jul-97 - English - User Detection Selection Index Examination Docs
Relevance Feedback • Query refinement based on search results
Examination Aethra offers a complete product line of multimedia and videocommunications products to meet all the applications needs of users. The standard product line is augmented by a bespoke service to solve customer specific functional requirements. Standard Videoconferencing Product Line Vega 384 and Vega 128, the improved Aethra Set-top systems, can be connected to any TV monitor for high quality videoconferencing up to 384 Kbps. A compact and lightweight device, VEGA is very easy to use and can be quickly installed in any officeand work environment. Voyager, is the first Videoconference briefcase designed for journalist, reporters and people on-the-go. It combines high quality video-communication (up to 384 Kbps) with the necessary reliability in a small and light briefcase. User Selection Examination Docs Delivery
Delivery User • Bookmark a page for later use • Email as a URL or as HTML • Cut and paste into a presentation • Print a hardcopy for later review Examination Docs Delivery
Human-Machine Synergy • Machines are good at: • Doing simple things accurately and quickly • Scaling to larger collections in sublinear time • People are better at: • Accurately recognizing what they are looking for • Evaluating intangibles such as “quality” • Humans and machines are pretty bad at: • Mapping concepts into search terms
Detection Component Model Utility Human Judgment Information Need Document Query Formulation Query Document Processing Query Processing Representation Function Representation Function Query Representation Document Representation Comparison Function Retrieval Status Value
Controlled Vocabulary Retrieval • A straightforward concept retrieval approach • Works equally well for non-text materials • Assign a unique “descriptor” to each concept • Can be done by hand for collections of limited scope • Assign some descriptors to each document • Practical for valuable collections of limited size • Use Boolean retrieval based on descriptors
Controlled Vocabulary Example Document 1 Descriptor • Canine AND Fox • Doc 1 • Canine AND Political action • Empty • Canine OR Political action • Doc 1, Doc 2 Doc 1 Doc 2 The quick brown fox jumped over the lazy dog’s back. Canine 0 1 Fox 0 1 Political action 1 0 Volunteerism 1 0 [Canine] [Fox] Document 2 Now is the time for all good men to come to the aid of their party. [Political action] [Volunteerism]
Challenges • Thesaurus design is expensive • Shifting concepts generate continuing expense • Manual indexing is even more expensive • And consistent indexing is very expensive • User needs are often difficult to anticipate • Challenge for thesaurus designers and indexers • End users find thesauri hard to use • Codesign problem with query formulation
“Bag of Words” Representation • Simple strategy for representing documents • Count how many times each term occurs • A “term” is any lexical item that you chose • A fixed-length sequence of characters (an “n-gram”) • A word (delimited by “white space” or punctuation) • Some standard “root form” of each word (e.g., a stem) • A phrase (e.g., phrases listed in a dictionary) • Counts can be recorded in any consistent order
Bag of Words Example Document 1 Stopword List Indexed Term Document 1 Document 2 The quick brown fox jumped over the lazy dog’s back. aid 0 1 for all 0 1 is back 1 0 of brown 1 0 the come 0 1 to dog 1 0 fox 1 0 Document 2 good 0 1 jump 1 0 lazy 1 0 Now is the time for all good men to come to the aid of their party. men 0 1 now 0 1 over 1 0 party 0 1 quick 1 0 their 0 1 time 0 1
Why Boolean Retrieval Works • Boolean operators approximate natural language • Find documents about a good party that is not over • AND can discover relationships between concepts • good party • OR can discover alternate terminology • excellent party • NOT can discover alternate meanings • Democratic party
Proximity Operators • More precise versions of AND • “NEAR n” allows at most n-1 intervening terms • “WITH” requires terms to be adjacent and in order • Easy to implement, but less efficient • Store a list of positions for each word in each doc • Stopwords become very important! • Perform normal Boolean computations • Treat WITH and NEAR like AND with an extra constraint
Ranked Retrieval Paradigm • Exact match retrieval often gives useless sets • No documents at all, or way too many documents • Query reformulation is one “solution” • Manually add or delete query terms • “Best-first” ranking can be superior • Select every document within reason • Put them in order, with the “best” ones first • Display them one screen at a time
Similarity-Based Queries • Treat the query as if it were a document • Create a query bag-of-words • Find the similarity of each document • Using the coordination measure, for example • Rank order the documents by similarity • Most similar to the query first • Surprisingly, this works pretty well! • Especially for very short queries
Document Similarity • How similar are two documents? • In particular, how similar is their bag of words? 1 2 3 1 complicated 1: Nuclear fallout contaminated Siberia. 1 contaminated 1 fallout 2: Information retrieval is interesting. 1 1 information 3: Information retrieval is complicated. 1 interesting 1 nuclear 1 1 retrieval 1 siberia
Coordination Measure Example 1 2 3 1 complicated Query: complicated retrieval Result: 3, 2 1 contaminated 1 fallout Query: interesting nuclear fallout Result: 1, 2 1 1 information 1 interesting 1 nuclear Query: information retrieval Result: 2, 3 1 1 retrieval 1 siberia
Incorporating Term Frequency • High term frequency is evidence of meaning • And high IDF is evidence of term importance • Recompute the bag-of-words • Compute TF * IDF for every element
TF*IDF Example 1 2 3 4 1 2 3 4 Unweighted query: contaminated retrieval Result: 2, 3, 1, 4 5 2 1.51 0.60 complicated 0.301 4 1 3 0.50 0.13 0.38 contaminated 0.125 5 4 3 0.63 0.50 0.38 fallout 0.125 Weighted query: contaminated(3) retrieval(1) Result: 1, 3, 2, 4 6 3 3 2 information 0.000 1 0.60 interesting 0.602 3 7 0.90 2.11 nuclear 0.301 IDF-weighted query: contaminated retrieval Result: 2, 3, 1, 4 6 1 4 0.75 0.13 0.50 retrieval 0.125 2 1.20 siberia 0.602
Document Length Normalization • Long documents have an unfair advantage • They use a lot of terms • So they get more matches than short documents • And they use the same words repeatedly • So they have much higher term frequencies
Cosine Normalization Example 1 2 3 4 1 2 3 4 1 2 3 4 5 2 1.51 0.60 0.13 0.57 0.69 complicated 0.301 4 1 3 0.50 0.13 0.38 0.29 0.14 contaminated 0.125 5 4 3 0.63 0.50 0.38 0.37 0.19 0.44 fallout 0.125 6 3 3 2 information 0.000 1 0.60 0.62 interesting 0.602 3 7 0.90 2.11 0.53 0.79 nuclear 0.301 6 1 4 0.75 0.13 0.50 0.77 0.05 0.57 retrieval 0.125 2 1.20 0.71 siberia 0.602 1.70 0.97 2.67 0.87 Length Unweighted query: contaminated retrieval, Result: 2, 4, 1, 3 (compare to 2, 3, 1, 4)
Summary So Far • Find documents most similar to the query • Optionally, Obtain query term weights • Given by the user, or computed from IDF • Compute document term weights • Some combination of TF and IDF • Normalize the document vectors • Cosine is one way to do this • Compute inner product of query and doc vectors • Multiply corresponding elements and then add
Passage Retrieval • Another approach to long-document problem • Break it up into coherent units • Recognizing topic boundaries is hard • But overlapping 300 word passages work fine • Document rank is best passage rank • And passage information can help guide browsing
Advantages of Ranked Retrieval • Closer to the way people think • Some documents are better than others • Enriches browsing behavior • Decide how far down the list to go as you read it • Allows more flexible queries • Long and short queries can produce useful results
Ranked Retrieval Challenges • “Best first” is easy to say but hard to do! • Probabilistic retrieval tries to approximate it • How can the user understand the ranking? • It is hard to use a tool that you don’t understand • Efficiency may become a concern • More complex computations take more time
Evaluation Criteria • Effectiveness • Set, ranked list, user-machine system • Efficiency • Retrieval time, indexing time, index size • Usability • Learnability, novice use, expert use
What is Relevance? • Relevance relates a topic and a document • Duplicates are equally relevant by definition • Constant over time and across users • Pertinence relates a task and a document • Accounts for quality, complexity, language, … • Utility relates a user and a document • Accounts for prior knowledge • We seek utility, but relevance is what we get!
IR Effectiveness Evaluation • System-centered strategy • Given documents, queries, and relevance judgments • Try several variations on the retrieval system • Measure which ranks more good docs near the top • User-centered strategy • Given several users, and at least 2 retrieval systems • Have each user try the same task on both systems • Measure which system works the “best”
Measures of Effectiveness • Good measures: • Capture some aspect of what the user wants • Have predictive value for other situations • Different queries, different document collection • Are easily replicated by other researchers • Can be expressed as a single number • Allows two systems to be easily compared
IR Test Collections • Documents • Representative quantity • Representative sources and topics • Topics • Used to form queries • Relevance judgments • For each document, with respect to each topic • This is the expensive part!
Some Assumptions • Unchanging, known queries • The same queries are used by each system • Binary relevance • Every document is either relevant or it is not • Unchanging, known relevance • The relevance of each doc to each query is known • But only used for evaluation, not retrieval! • Focus on effectiveness
The Contingency Table Action Retrieved Not Retrieved Doc Relevant Retrieved Relevant Rejected Relevant Irrelevant Retrieved Irrelevant Rejected Not relevant
R R The Precision-Recall Curve Action Retrieved Not Retrieved R Doc=10 Relevant Retrieved Relevant Rejected Relevant=4 Irrelevant Retrieved Irrelevant Rejected Not relevant=6 R
Precision at recall=0.1 Average Precision Breakeven Point Precision at 10 docs
Single-Number MOE Weaknesses • Precision at 10 documents • Pays no attention to recall • Precision at constant recall • A specific recall fraction is rarely the user’s goal • Breakeven point • Nobody ever searches at the breakeven point • Average precision • Users typically operate near an extreme of the curve • So the average is not very informative
Why Choose Average Precision? • It is easy to trade between recall and precision • Adding related query terms improves recall • But naive query expansion techniques kill precision • Limiting matches by part-of-speech helps precision • But it almost always hurts recall • Comparisons should give some weight to both • Average precision is a principled way to do this • Rewards improvements in either factor
How Much is Enough? • The maximum average precision is 1.0 • But inter-rater reliability is 0.8 or less • So 0.8 is a practical upper bound at every point • Precision 0.8 is sometimes seen at low recall • Two goals • Achieve a meaningful amount of improvement • This is a judgment call, and depends on the application • Achieve that improvement reliably across queries • This can be verified using statistical tests
Obtaining Relevance Judgments • Exhaustive assessment can be too expensive • TREC has 50 queries for >1 million docs each year • Random sampling won’t work either • If relevant docs are rare, none may be found! • IR systems can help focus the sample • Each system finds some relevant documents • Different systems find different relevant documents • Together, enough systems will find most of them
Pooled Assessment Methodology • Each system submits top 1000 documents • Top 100 documents for each are judged • All are placed in a single pool • Duplicates are eliminated • Placed in an arbitrary order to avoid bias • Evaluated by the person that wrote the query • Assume unevaluated documents not relevant • Overlap evaluation shows diminishing returns • Compute average precision over all 1000 docs
Lessons From TREC • Incomplete judgments are useful • If sample is unbiased with respect to systems tested • Different relevance judgments change absolute score • But rarely change comparative advantages when averaged • Evaluation technology is predictive • Results transfer to operational settings) Adapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999
Roadmap • IR overview • NLP for monolingual IR • NLP for cross-language IR
Machine Assisted Indexing • Automatically suggest controlled vocabulary • Better consistency with lower cost • Typically use a rule-based expert system • Design thesaurus by hand in the usual way • Design an expert system to process text • String matching, proximity operators, … • Write rules for each thesaurus/collection/language • Try it out and fine tune the rules by hand