1 / 27

Information Retrieval and Web Search

This outline provides an overview of query languages intended for users in information retrieval and web search, including types of queries, query types, keyword-based querying, and pattern matching.

pamundson
Download Presentation

Information Retrieval and Web Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Vasile Rus, PhD vrus@memphis.edu www.cs.memphis.edu/~vrus/teaching/ir-websearch/ Information Retrieval and Web Search

  2. Query Languages Outline

  3. Query Languages Intended for users Two types Query languages for information retrieval Allow the answer to be ranked Query languages for data retrieval Protocols Intended for high-level software packages to query an on-line database or a CD-ROM archive Languages vs Protocols

  4. Query preprocessing has already been done Stemming Removal of stop words Query expansion Keyword = word that can be retrieved by a query Retrieval unit is ‘documents’ File Document Web page paragraph Some Assumptions

  5. Keyword-based Use simple words and phrases Boolean operators to manipulate sets Pattern matching Complements keyword retrieval with data retrieval capabilities Querying based on structure of text Protocols for Internet and CD-ROM publishing Query Types

  6. Intuitive Easy to express Allow fast ranking of answer documents Types of keyword-based queries, aka basic queries (patterns, covered later, are also basic queries) Single-word Multiple-word Keyword-based Querying

  7. Single-word Queries Context Queries Boolean Queries Natural Language Keyword-based Querying

  8. What is a word A sequence of letters surrounded by separators The text database manager will decide on letters and separators Result of word queries is a set of documents containing at least one of the words in the query Documents are ranked according to a degree of similarity to the query (e.g., tf-idf) Single-Word Queries

  9. Words which appear near each other indicate higher likelihood of relevance Retrieve documents with a specific phrase (ordered list of contiguous words) “information theory” May allow intervening stop words and/or stemming. “buy camera” matches: “buy a camera” “buying the cameras” Context Queries

  10. Must have an inverted index that also stores positions of each keyword in a document Retrieve documents and positions for each individual word, intersect documents, and then finally check for ordered contiguity of keyword positions Best to start contiguity check with the least common word in the phrase Phrase-based Retrieval

  11. List of words with specific maximal distance constraints between terms Example: “enhance” and “retrieval” within 4 words matches “…enhance the power of retrieval …” May also perform stemming and/or not count stop words Proximity Queries

  12. Use approach similar to phrasal search to find documents in which all keywords are found in a context that satisfies the proximity constraints During search for positions of remaining keywords, find closest position of next keywordto current position p and check that it is within maximum allowed distance Proximity Retrieval with Inverted Index

  13. Atoms, i.e. basic queries, combined with Boolean operators: OR: (e1 OR e2) AND: (e1 AND e2) BUT: (e1 BUT e2) Satisfy e1 but note2 Boolean operators work on sets of documents Negation only allowed using BUT to allow efficient use of inverted index by filtering another efficiently retrievable set Naïve users have trouble with Boolean logic Boolean Queries

  14. Atom: Retrieve containing documents using the inverted index OR: Recursively retrieve e1 and e2 and take union of results (duplicates are eliminated) AND: Recursively retrieve e1 and e2 and take intersection of results BUT: Recursively retrieve e1 and e2 and take set difference of results Boolean Retrieval with Inverted Indices

  15. Full text queries as arbitrary strings Typically just treated as a bag-of-words for a vector-space model Typically processed using standard vector-space retrieval methods “Natural Language” Queries

  16. Allow queries that match strings rather than word tokens Data retrieval queries Requires more sophisticated data structures and algorithms than inverted indices to retrieve efficiently Pattern Matching

  17. Words Prefixes: Pattern that matches start of word “anti” matches “antiquity”, “antibody”, etc. Suffixes: Pattern that matches end of word: “ix” matches “fix”, “matrix”, etc. Simple Patterns

  18. Substrings: Pattern that matches arbitrary subsequence of characters “rapt” matches “enrapture”, “velociraptor” etc. “any flow” will match ‘… many flowers …’ Ranges: Pair of strings that matches any word lexicographically (alphabetically) between them “tin” to “tix” matches “tip”, “tire”, “title”, etc. Simple Patterns

  19. What if query or document contains typos or misspellings? Judge letter-similarity of words (or arbitrary strings) using: Edit distance (Levenstein distance) Allow proximity search with bound on string similarity Allowing Errors

  20. Minimum number of character deletions, additions, or replacements needed to make two strings equivalent. “misspell” to “mispell” is distance 1 “misspell” to “mistell” is distance 2 “misspell” to “misspelling” is distance 3 Can be computed efficiently using dynamic programming in O(mn) time where m and n are the lengths of the two strings being compared Edit (Levenstein) Distance

  21. Language for composing complex patterns from simpler ones. An individual character is a regex. Union: If e1 and e2 are regexes, then (e1 | e2) is a regex that matches whatever either e1 or e2 matches. Concatenation: If e1 and e2 are regexes, then e1e2 is a regex that matches a string that consists of a substring that matches e1 immediately followed by a substring that matches e2 Repetition (Kleene closure): If e1 is a regex, then e1* is a regex that matches a sequence of zero or more strings that match e1 Regular Expressions

  22. Special terms for common sets of characters, such as alphabetic or numeric or general “wildcard”. Special repetition operator (+) for 1 or more occurrences. Special optional operator (?) for 0 or 1 occurrences. Special repetition operator for specific range of number of occurrences: {min,max}. A{1,5} One to five A’s. A{5,} Five or more A’s A{5} Exactly five A’s Enhanced Regex’s (Perl)

  23. Assumes documents have structure that can be exploited in search. Structure could be: Fixed set of fields, e.g. title, author, abstract, etc. Hierarchical (recursive) tree structure: Structural Queries book chapter chapter title section title section title subsection

  24. Allow queries for text appearing in specific fields: “nuclear fusion” appearing in a chapter title Queries with Structure

  25. Z39.50: Querying bibliographical information WAIS: network publishing protocol to query databases through the Internet CD-ROM protocols CCL: defines 19 commands CD-RDx: client-server; allows fixed-length fields, images and audio SFQL: Relational database query language SQL enhanced with “full text” search Select abstract from journal.papers where author contains “Teller” and title contains “nuclear fusion” and date < 1/1/1950 More general and flexible than CCL or CD-RDx Protocols

  26. Query Languages Summary

  27. Query Operations Next

More Related