1 / 58

CS533 Information Retrieval

Learn about techniques and examples of relevance feedback, query modification, and automatic abstract generation in information retrieval systems. Understand how to improve retrieval results using user judgments and query adjustments.

Download Presentation

CS533 Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS533 Information Retrieval Dr. Michal Cutler Lecture #24 April 28, 1999

  2. Relevance feedback • The main idea • Issues • Query modification examples

  3. Relevance Feedback • A techniques for modifying a query • The weights of query terms may be modified and/or new terms are added to query • Relevance feedback is a very powerful technique, which yields significant improvement in retrieval results

  4. Using Relevance feedback (traditional) • Use initial query to retrieve some items • Ask user to judge retrieved items as relevant/non-relevant • (true/false, no notion of more fuzzy terms such as very, somewhat) • Modify query

  5. Main idea Modified query Original query Relevant documents Non relevant documents Move query towards good items, away from bad ones

  6. Changing the query -how? • Change the weight of query terms? • Add new query terms? How many new terms and which ones will be included in the modified query? • In TREC only the “best” 20-100 terms in relevant documents are usually added • What weights to new terms? • Delete some query terms?

  7. Modify based on what? • Original query? • Relevant retrieved documents? • Use or ignore (non relevant & retrieved) documents? • Use non-retrieved items? How?

  8. All retrieved documents? • Long documents may cover many topics and move the query in undesirable directions • Should all retrieved documents be used, or maybe only some (for example ignore long ones)?

  9. Part or whole document? • Use only “good passages” instead of the whole document?

  10. Not enough information? • Query has 0 hits? • Hits but no relevant documents? • Only one relevant document?

  11. Query Modification for vector space model • In this method the original query is ignored and a new query is formed based on the retrieved set • Given a set R of relevant retrieved items and a set • N of nonrelevant retrieved items

  12. Query Modification • Let Di be the weight vector for document i. • This method computes • an average weight vector from good items and • subtracts from it • the average vector of the bad items

  13. The new query is: Query Modification

  14. Query Modification (2) • Original query Q0 is retained • Method includes three parameters determined experimentally, or using some learning technique

  15. Feedback without relevance information • In well performing retrieval systems the top small n documents have a high precision • Experiments have shown that assuming that all top small n documents are relevant and performing positive feedback improves retrieval results (TREC)

  16. Feedback with other retrieval models • Feedback has been employed for other ranking retrieval models such as the probabilistic and fuzzy models • Pure Boolean systems use feedback to extend Boolean queries

  17. Using relevance feedback (Dunlop 1997) • Main idea: use also • (relevant & nonmatching) documents and/or • (non relevant & nonmatching) documents • Such documents can be found by browsing hypertext or hypermedia collections

  18. Using relevance feedback (Dunlop 1997) • A (relevant & non matching) document should affect the query more than a (relevant & matching) one • A (non relevant & non matching) document should affect the query less than a (non relevant & matching) one

  19. Automatic abstract generation • AI approach • IR approach • Examples of text extraction systems

  20. Types of Abstracts • Automatic creation of summaries of texts • Short summaries indicating what document is about • Longer documents which summarize the main information contained in the text

  21. Abstracts types • Summary in response to a query • summarizing more than one text, or • creating a summary of portions of the text which are relevant to the query

  22. Abstracts classification • Classified as indicative and/or informative • Indicative abstracts - help reader decide whether to read document • Informative abstracts - contain also informative material such as main results and conclusions.

  23. Abstracts • In this case a user may not need to read paper • critical or comparative material are more difficult to generate and ignored in this discussion

  24. Evaluation criteria • Cohesion • Balance and coverage • Repetition • Length

  25. Automatic abstracting • Information retrieval approach involves • Selection of portions from the text, • An attempt to make them belong together

  26. Automatic abstracting • Artificial intelligence approaches to text summarization: • Extract semantic information, • Instantiate pre-defined constructs • Use instantiated constructs to generate a summary

  27. An artificial intelligence approach • DeJong’s FRUMP system analyses news articles by: • Instantiating slots in one of a predefined set of scripts . • Using the instantiated script to generate a summary

  28. An artificial intelligence approach • In Rau’s SCISOR system, a detailed linguistic analysis of a text results in the construction of semantic graphs. • A natural language generator produces a summary from the stored material

  29. The artificial intelligence approach • Systems are only capable of summarizing text in a narrow domain • The artificial intelligence approaches are fragile. • If the system does not recognize the main topic its extraction may be erroneous

  30. Automatic text extraction • First experiment reported by Luhn (1958) • Provided extracts • An extract is a set of sentences (or paragraphs) selected to provide a good indication of the subject matter of the document

  31. Luhn’s approach 1. For each sentence • look for clues of its importance, • compute a score for the sentence based on the clues found in it

  32. Luhn’s approach 2. Select all the sentences with a score above a threshold, or the highest scoring sentences up to a predefined sum of the scores 3. Print the sentences in their order of occurrence in the original text

  33. Concept importance (Luhn) • Extracted words, • Eliminated stop words, • Did some conflation (stemming), • Selected words with frequency above a threshold • (First to associate concept importance with frequency)

  34. Sentence importance (Luhn) • Looked for clusters of keywords in a sentence and • Based the sentence score on these clusters • A cluster is formed by keywords that occur close together (no more than 4 words between)

  35. Sentence score (Luhn) • If the length of the cluster is X, • and it contains Y significant words, • the score of the cluster is Y2/X. • The score of the sentence is the highest cluster score, or 0 if the sentence has no clusters

  36. Example (Luhn) • Sentence is (- -[*-**- -*] - -) (11 words) • Length of cluster X=7 (number words in brackets) • Y=4 (number of significant (*) words) • Sentence score is Y2/X=16/7=2.3

  37. Clues for word importance (Edmunson 1964) • Keywords in titles • Keywords selected from the title, subtitle and headings of the document have higher score • Edmunson eliminated stop words, and gave higher scores to terms from the main title than from lower level headings

  38. Sentence importance clues (Edmunson) • The location of the sentence • Frequently the first and the last sentence of a paragraph are the most important sentences

  39. Sentence importance clues (Edmunson) • Edmunson used this observation to score sentences using their location • in a paragraph, • in a document (first few and last few paragraphs are important), • below a heading, etc.

  40. Clues for sentence importance (Edmunson) • Certain words and phrases, which are not keywords, provide information on sentence importance • Used 783 bonus words, which increase sentence score • 73 stigma words which decrease the score

  41. Clues for sentence importance (Edmunson) • Bonus words include superlatives and value words such as “greatest” “significant” • Stigma words include anaphors and belittling expressions such as “hardly” “impossible”

  42. Indicator phrases (Rush 1973) • Used word control list as positive and negative indicators • Negative indicators eliminated sentences • Strong positives were expressions about the text topic: “our work”, “the purpose of”

  43. Indicator constructs (Paice 1981) • More elaborate positive constructs: • “The main aim of the present paper is to describe” • “The purpose of this article is to review” • “our investigation has shown that”

  44. Indicative constructs (Paice) • There are only 7 or 8 distinctive types of indicative phrases, which can be identified in relevance to a template allowing substitution of alternative words or phrases • Not all texts contain such phrases. • Useful when they can be found

  45. Skorokhod’ko’s extraction (1972) • Builds a semantic structure for the document • Generates a graph, • Sentences are nodes, • Sentences that refer to same concepts (thesaurus) are connected by an edge

  46. Skorokhod’ko’s extraction • The most significant sentences are those which are related to a large number of other sentences • Such sentences are prime candidates for extraction

  47. Skorokhod’ko’s extraction • Sentences are scored based on: • Number of sentences to which they are significantly related and • Degree of change in the graph structure which would result from a deletion of the sentence

  48. Text cohesion • Extracts discussed so far suffer from lack of cohesion • We discuss lack of cohesion caused by explicit references in a sentence which can only be understood by reference to material elsewhere in the text

  49. Text cohesion • This covers - • anaphoric references (he, she, etc.), • lexical or definite references (these objects, the oldest), and • the use of rhetorical connectives ( “so” “however” “on the other hand” • Other levels of coherence not addressed

  50. Text cohesion (Rush) • Rush attempted to deal with the problem of anaphors by either adding preceding sentences, or if more than three preceding sentences would need to be added by deleting the sentence

More Related