1 / 22

CS 430 / INFO 430 Information Retrieval

This lecture covers course administration details and provides guidelines for preparing for the discussion class, focusing on exploring search services. It also introduces similarity ranking methods in information retrieval.

sandraphall
Download Presentation

CS 430 / INFO 430 Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 430 / INFO 430 Information Retrieval Lecture 2 Searching Full Text 2

  2. Course Administration Web site: http://www.cs.cornell.edu/courses/cs430/2005fa Notices: See the home page on the course Web site Sign-up sheet: If you did not sign up at the first class, please sign up now.

  3. Course Administration Please send all questions about the course to: cs430-l@lists.cs.cornell.edu The message will be sent to William Arms Teaching Assistants

  4. Course Administration Discussion class, Wednesday, August 2 Upson B17, 7:30 to 8:30 p.m. Prepare for the class as instructed on the course Web site. Participation in the discussion classes is one third of the grade, but tomorrow's class will not be included in the grade calculation.

  5. Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for others to comment. When answering: Stand up. Give your name. Make sure that the TA hears it. Speak clearly so that all the class can hear. Suggestions: Do not be shy at presenting partial answers. Differing viewpoints are welcome.

  6. Discussion Class: Preparation You are given two problems to explore: • What is the medical evidence that red wine is good or bad for your health? • What in history led to the current turmoil in Palestine and the neighboring countries? In preparing for the class, focus on the question: What characteristics of the three search services are helpful or lead to difficulties in addressing these two problems? The aim of your preparation is to explore the search services, not to solve these two problems. Take care. Many of the documents that you might find are written from a one-sided viewpoint.

  7. Discussion Class: Preparation In preparing for the discussion classes, you may find it useful to look at the slides from last year's class on the old Web site: http://www.cs.cornell.edu/Courses/cs430/2004fa/

  8. Similarity Ranking Methods Methods that look for matches (e.g., Boolean) assume that a document is either relevant to a query or not relevant. Similarity ranking methods: measure the degree of similarity between a query and a document. Similar Documents Query Similar: How similar is document to a request?

  9. Similarity Ranking Methods Index database Documents Query Mechanism for determining the similarity of the query to the document. Set of documents ranked by how similar they are to the query

  10. Term Similarity: Example Problem:Given two text documents, how similar are they? [Methods that measure similarity do not assume exact matches.] A documents can be any length from one word to thousands. A query is a special type of document. Example Here are three documents. How similar are they? d1 ant ant bee d2 dog bee dog hog dog ant dog d3 cat gnu dog eel fox

  11. Term Similarity: Basic Concept Two documents are similar if they contain some of the same terms. Possible measures of similarity might take into consideration: (a) The number of terms that are shared (b) Whether the terms are common or unusual (c) How many times each term appears (d) The lengths of the documents

  12. TERM VECTOR SPACE Term vector space n-dimensional space, where n is the number of different terms used to index a set of documents (i.e. size of the word list). Vector Document i is represented by a vector. Its magnitude in dimension j is tij, where: tij > 0 if term j occurs in document i tij = 0 otherwise tij is the weight of term j in document i.

  13. A Document Represented in a 3-Dimensional Term Vector Space t3 d1 t13 t2 t12 t11 t1

  14. Basic Method: Incidence Matrix (No Weighting) document text terms d1ant ant beeant bee d2dog bee dog hog dog ant dogant bee dog hog d3cat gnu dog eel foxcat dog eel fox gnu ant bee cat dog eel fox gnu hog d1 1 1 d2 1 1 1 1 d3 1 1 1 1 1 3 vectors in 8-dimensional term vector space Weights: tij = 1 if document i contains term j and zero otherwise

  15. Basic Vector Space Methods: Similarity Similarity The similarity between two documents is a function of the angle between their vectors in the term vector space.

  16. Two Documents Represented in 3-Dimensional Term Vector Space t3 d1 d2 t2  t1

  17. Vector Space Revision x = (x1, x2, x3, ..., xn) is a vector in an n-dimensional vector space Length of x is given by (extension of Pythagoras's theorem) |x|2 = x12 + x22 + x32 + ... + xn2 If x1 and x2 are vectors: Inner product (or dot product) is given by x1.x2 = x11x21 + x12x22 +x13x23 + ... + x1nx2n Cosine of the angle between the vectors x1 and x2: cos () = x1.x2 |x1| |x2|

  18. Example: Comparing Documents (No Weighting) ant bee cat dog eel fox gnu hog length d1 1 1 2 d2 1 1 1 1 4 d3 1 1 1 1 1 5

  19. Example: Comparing Documents Similarity of documents in example: d1d2d3 d1 1 0.71 0 d2 0.71 1 0.22 d3 0 0.22 1

  20. Simple Uses of Vector Similarity in Information Retrieval Threshold For query q, retrieve all documents with similarity above a threshold, e.g., similarity > 0.50. Ranking For query q, return the n most similar documents ranked in order of similarity. [This is the standard practice.]

  21. Similarity of Query to Documents(No Weighting) query qant dog document text terms d1ant ant beeant bee d2dog bee dog hog dog ant dogant bee dog hog d3cat gnu dog eel foxcat dog eel fox gnu ant bee cat dog eel fox gnu hog q 1 1 d1 1 1 d2 1 1 1 1 d3 1 1 1 1 1

  22. Calculate Ranking Similarity of query to documents in example: d1d2d3 q 1/2 1/√2 1/√10 0.5 0.71 0.32 If the query q is searched against this document set, the ranked results are: d2, d1, d3

More Related