1 / 30

Information Retrieval

Information Retrieval. Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ Data Mining Introductory and Advanced Topics by Margaret H. Dunham

Download Presentation

Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book

  2. Information Retrieval • Information Retrieval (IR): retrieving desired information from textual data. • Library Science • Digital Libraries • Web Search Engines • Traditionally keyword based • Sample query: Find all documents about “data mining”.

  3. DB vs IR • Records (tuples) vs. documents • Well defined results vs. fuzzy results • DB grew out of files and traditional business systesm • IR grew out of library science and need to categorize/group/access books/articles

  4. DB vs IR (cont’d) • Data retrieval • which docs contain a set of keywords? • Well defined semantics • a single erroneous object implies failure! • Information retrieval • information about a subject or topic • semantics is frequently loose • small errors are tolerated • IR system: • interpret contents of information items • generate a ranking which reflects relevance • notion of relevance is most important

  5. Motivation • IR in the last 20 years: • classification and categorization • systems and languages • user interfaces and visualization • Still, area was seen as of narrow interest • Advent of the Web changed this perception once and for all • universal repository of knowledge • free (low cost) universal access • no central editorial board • many problems though: IR seen as key to finding the solutions!

  6. Accents spacing Noun groups Manual indexing Docs stopwords stemming structure structure Full text Index terms Basic Concepts Logical view of the documents Document representation viewed as a continuum: logical view of docs might shift

  7. Text User Interface user need Text Text Operations logical view logical view Query Operations DB Manager Module Indexing user feedback inverted file query Searching Index retrieved docs Text Database Ranking ranked docs The Retrieval Process

  8. IR is Fuzzy Reject Reject Accept Accept Simple Fuzzy

  9. Information Retrieval • Similarity: measure of how close a query is to a document. • Documents which are “close enough” are retrieved. • Metrics: • Precision = |Relevant and Retrieved| |Retrieved| • Recall= |Relevant and Retrieved| |Relevant|

  10. Indexing • IR systems usually adopt index terms to process queries • Index term: • a keyword or group of selected words • any word (more general) • Stemming might be used: • connect: connecting, connection, connections • An inverted file is built for the chosen index terms

  11. Indexing Docs Index Terms doc match Ranking Information Need query

  12. Inverted Files • There are two main elements: • vocabulary – set of unique terms • Occurrences – where those terms appear • The occurrences can be recorded as terms or byte offsets • Using term offset is good to retrieve concepts such as proximity, whereas byte offsets allow direct access

  13. Inverted Files • The number of indexed terms is often several orders of magnitude smaller when compared to the documents size (Mbs vs Gbs) • The space consumed by the occurrence list is not trivial. Each time the term appears it must be added to a list in the inverted file • That may lead to a quite considerable index overhead

  14. Example • Text: • Inverted file 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful Vocabulary Occurrences beautiful flowers garden house 70 45, 58 18, 29 6

  15. Ranking • A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the query • A ranking is based on fundamental premisses regarding the notion of relevance, such as: • common sets of index terms • sharing of weighted terms • likelihood of relevance • Each set of premisses leads to a distinct IR model

  16. Classic IR Models - Basic Concepts • Each document represented by a set of representative keywords or index terms • An index term is a document word useful for remembering the document main themes • Usually, index terms are nouns because nouns have meaning by themselves • However, search engines assume that all words are index terms (full text representation)

  17. Classic IR Models - Basic Concepts • The importance of the index terms is represented by weights associated to them • ki- an index term • dj- a document • wij - a weight associated with (ki,dj) • The weight wijquantifies the importance of the index term for describing the document contents

  18. Classic IR Models - Basic Concepts • t is the total number of index terms • K = {k1, k2, …, kt} is the set of all index terms • wij >= 0 is a weight associated with (ki,dj) • wij = 0 indicates that term does not belong to doc • dj= (w1j, w2j, …, wtj) is a weighted vector associated with the document dj • gi(dj) = wij is a function which returns the weight associated with pair (ki,dj)

  19. The Boolean Model • Simple model based on set theory • Queries specified as boolean expressions • precise semantics and neat formalism • Terms are either present or absent. Thus, wij  {0,1} • Consider • q = ka  (kb  kc) • qdnf = (1,1,1)  (1,1,0)  (1,0,0) • qcc= (1,1,0) is a conjunctive component

  20. The Vector Model • Use of binary weights is too limiting • Non-binary weights provide consideration for partial matches • These term weights are used to compute a degree of similarity between a query and each document • Ranked set of documents provides for better matching

  21. The Vector Model • wij > 0 whenever ki appears in dj • wiq >= 0 associated with the pair (ki,q) • dj = (w1j, w2j, ..., wtj) • q = (w1q, w2q, ..., wtq) • To each term ki is associated a unitary vector i • The unitary vectors i and j are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents) • The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors

  22. Query Languages • Keyword Based • Boolean • Weighted Boolean • Context Based (Phrasal & Proximity) • Pattern Matching • Structural Queries

  23. Keyword Based Queries • Basic Queries • Single word • Multiple words • Context Queries • Phrase • Proximity

  24. Boolean Queries • Keywords combined with Boolean operators: • OR: (e1 OR e2) • AND: (e1 AND e2) • BUT: (e1 BUT e2) Satisfy e1 but note2 • Negation only allowed using BUT to allow efficient use of inverted index by filtering another efficiently retrievable set. • Naïve users have trouble with Boolean logic.

  25. Boolean Retrieval with Inverted Indices • Primitive keyword: Retrieve containing documents using the inverted index. • OR: Recursively retrieve e1 and e2 and take union of results. • AND: Recursively retrieve e1 and e2 and take intersection of results. • BUT: Recursively retrieve e1 and e2 and take set difference of results.

  26. Phrasal Queries • Retrieve documents with a specific phrase (ordered list of contiguous words) • “information theory” • May allow intervening stop words and/or stemming. • “buy camera” matches: “buy a camera” “buying the cameras” etc.

  27. Phrasal Retrieval with Inverted Indices • Must have an inverted index that also stores positions of each keyword in a document. • Retrieve documents and positions for each individual word, intersect documents, and then finally check for ordered contiguity of keyword positions. • Best to start contiguity check with the least common word in the phrase.

  28. Proximity Queries • List of words with specific maximal distance constraints between terms. • Example: “dogs” and “race” within 4 words match “…dogs will begin the race…” • May also perform stemming and/or not count stop words.

  29. Pattern Matching • Allow queries that match strings rather than word tokens. • Requires more sophisticated data structures and algorithms than inverted indices to retrieve efficiently.

  30. Simple Patterns • Prefixes: Pattern that matches start of word. • “anti” matches “antiquity”, “antibody”, etc. • Suffixes: Pattern that matches end of word: • “ix” matches “fix”, “matrix”, etc. • Substrings: Pattern that matches arbitrary subsequence of characters. • “rapt” matches “enrapture”, “velociraptor” etc. • Ranges: Pair of strings that matches any word lexicographically (alphabetically) between them. • “tin” to “tix” matches “tip”, “tire”, “title”, etc.

More Related