1 / 16

Basic IR: Queries

Explore various query types such as Boolean queries and phrase search in information retrieval. Understand the advantages and limitations of different query languages used in web search engines.

mkutz
Download Presentation

Basic IR: Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Basic IR: Queries • Query is statement of user’s information need. • Index is designed to map queries to likely to be relevant documents. Query type, content, representation dictates what the index must do. • Varies from single keywords through specialized query languages to exemplar documents.

  2. Documents as Queries • “Find other documents like this one.” • Query is itself a document; it can go through same sort of pre-processing (e.g., stop word removal, stemming). • Characteristics of queries mimic those of documents.

  3. Query Term Distribution in SavvySearch

  4. Keyword Queries • Query is composed of a set of keywords. • Retrieve document that best matches keywords. Advantage: easy to use, supports fast indexing Disadvantage: coarse, easy to lead astray (e.g., words with multiple meanings), difficult to express complex information need

  5. Boolean Queries • combine queries (keywords) with Boolean operators: • OR: “children OR kids” • AND: “windows AND software” • BUT: “unix BUT solaris” • no NOT! Advantage: more precise queries Disadvantage: does not support ranking, less intuitive

  6. Phrase Search • Supplement single terms with phrases: exact sequence of terms • Requires index that tracks proximity of terms or stores both singletons and phrases • Extension is context: where proximity between terms is stated (e.g., “adele w/2 howe” to CiteSeer)

  7. Query Language: Web Search Engines I • Altavista Advanced Search: • Form: • all of these words • this exact phrase • any of these words • none of these words • Boolean Expression: • AND • OR • AND NOT • NEAR • Date, File Type, Location

  8. Query Language: Web Search Engines II • Google Advanced Search: • Find Results: • with all of the words • with the exact phrase • with at least one of the words • without the words • Language, File Format, Date, Occurrences, Domain, SafeSearch • Page-Specific Search: • Find pages similar to the page • Find pages that link to the page • Topic Specific Searches

  9. Typical Query Behavior on WWW • Query term distribution obeys Zipf’s Law (quite skewed, although skew does drift). • Length is ? terms. • Few users exploit full power of query languages; most enter terms without operations and do not use advanced search interfaces. • Change in behavior?

  10. Natural Language • Augmented Boolean approach: • Treat query as document. • Rank documents by how well they match the constraints of the query and return those above a certain threshold. • NLP approach: • Interpret semantics in a limited way to constrain query (e.g., “who” indicates a person)

  11. NL Example: AskJeeves

  12. Advanced Querying • Pattern Matching: • combinations of syntactic features, e.g., regular expressions, wild-card queries • Structural Queries: • forms, hypertext and hierarchies • typically supports iterative querying as in guided browsing (e.g., WebGlimpse or Letitzia)

  13. Advanced Querying: Letizia • Recommends new pages based on user’s browsing preferences • Infers interests by observing user behavior: save bookmark, follow link, spend time on page… • Models documents as list of keywords Figure from http://lieber.www.media.mit.edu/people/lieber/Liebrary/Letizia/Letizia.html

  14. Caching • An astonishing number of people submit the same queries (e.g., “Harry Potter”). • Just as single word usage is skewed (Zipf’s Law) so is query submission on WWW. • Can exploit this by caching results for oft repeated queries.

  15. Single vs. On-Going Queries: Filtering • Find new documents from information stream that satisfy a static information need • User profile represents interests; threshold represents how closely documents must match. • User may provide the query or it may be learned through relevance feedback.

  16. Docs Filtered for User 2 User 2 Profile User 1 Profile Docs for User 1 Documents Stream Filtering Process from MIR text

More Related