160 likes | 174 Views
Explore various query types such as Boolean queries and phrase search in information retrieval. Understand the advantages and limitations of different query languages used in web search engines.
E N D
Basic IR: Queries • Query is statement of user’s information need. • Index is designed to map queries to likely to be relevant documents. Query type, content, representation dictates what the index must do. • Varies from single keywords through specialized query languages to exemplar documents.
Documents as Queries • “Find other documents like this one.” • Query is itself a document; it can go through same sort of pre-processing (e.g., stop word removal, stemming). • Characteristics of queries mimic those of documents.
Keyword Queries • Query is composed of a set of keywords. • Retrieve document that best matches keywords. Advantage: easy to use, supports fast indexing Disadvantage: coarse, easy to lead astray (e.g., words with multiple meanings), difficult to express complex information need
Boolean Queries • combine queries (keywords) with Boolean operators: • OR: “children OR kids” • AND: “windows AND software” • BUT: “unix BUT solaris” • no NOT! Advantage: more precise queries Disadvantage: does not support ranking, less intuitive
Phrase Search • Supplement single terms with phrases: exact sequence of terms • Requires index that tracks proximity of terms or stores both singletons and phrases • Extension is context: where proximity between terms is stated (e.g., “adele w/2 howe” to CiteSeer)
Query Language: Web Search Engines I • Altavista Advanced Search: • Form: • all of these words • this exact phrase • any of these words • none of these words • Boolean Expression: • AND • OR • AND NOT • NEAR • Date, File Type, Location
Query Language: Web Search Engines II • Google Advanced Search: • Find Results: • with all of the words • with the exact phrase • with at least one of the words • without the words • Language, File Format, Date, Occurrences, Domain, SafeSearch • Page-Specific Search: • Find pages similar to the page • Find pages that link to the page • Topic Specific Searches
Typical Query Behavior on WWW • Query term distribution obeys Zipf’s Law (quite skewed, although skew does drift). • Length is ? terms. • Few users exploit full power of query languages; most enter terms without operations and do not use advanced search interfaces. • Change in behavior?
Natural Language • Augmented Boolean approach: • Treat query as document. • Rank documents by how well they match the constraints of the query and return those above a certain threshold. • NLP approach: • Interpret semantics in a limited way to constrain query (e.g., “who” indicates a person)
Advanced Querying • Pattern Matching: • combinations of syntactic features, e.g., regular expressions, wild-card queries • Structural Queries: • forms, hypertext and hierarchies • typically supports iterative querying as in guided browsing (e.g., WebGlimpse or Letitzia)
Advanced Querying: Letizia • Recommends new pages based on user’s browsing preferences • Infers interests by observing user behavior: save bookmark, follow link, spend time on page… • Models documents as list of keywords Figure from http://lieber.www.media.mit.edu/people/lieber/Liebrary/Letizia/Letizia.html
Caching • An astonishing number of people submit the same queries (e.g., “Harry Potter”). • Just as single word usage is skewed (Zipf’s Law) so is query submission on WWW. • Can exploit this by caching results for oft repeated queries.
Single vs. On-Going Queries: Filtering • Find new documents from information stream that satisfy a static information need • User profile represents interests; threshold represents how closely documents must match. • User may provide the query or it may be learned through relevance feedback.
Docs Filtered for User 2 User 2 Profile User 1 Profile Docs for User 1 Documents Stream Filtering Process from MIR text