Information Retrieval

Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book

Information Retrieval • Information Retrieval (IR): retrieving desired information from textual data. • Library Science • Digital Libraries • Web Search Engines • Traditionally keyword based • Sample query: Find all documents about “data mining”.

DB vs IR • Records (tuples) vs. documents • Well defined results vs. fuzzy results • DB grew out of files and traditional business systesm • IR grew out of library science and need to categorize/group/access books/articles

DB vs IR (cont’d) • Data retrieval • which docs contain a set of keywords? • Well defined semantics • a single erroneous object implies failure! • Information retrieval • information about a subject or topic • semantics is frequently loose • small errors are tolerated • IR system: • interpret contents of information items • generate a ranking which reflects relevance • notion of relevance is most important

Motivation • IR in the last 20 years: • classification and categorization • systems and languages • user interfaces and visualization • Still, area was seen as of narrow interest • Advent of the Web changed this perception once and for all • universal repository of knowledge • free (low cost) universal access • no central editorial board • many problems though: IR seen as key to finding the solutions!

Accents spacing Noun groups Manual indexing Docs stopwords stemming structure structure Full text Index terms Basic Concepts Logical view of the documents Document representation viewed as a continuum: logical view of docs might shift

Text User Interface user need Text Text Operations logical view logical view Query Operations DB Manager Module Indexing user feedback inverted file query Searching Index retrieved docs Text Database Ranking ranked docs The Retrieval Process

IR is Fuzzy Reject Reject Accept Accept Simple Fuzzy

Indexing • IR systems usually adopt index terms to process queries • Index term: • a keyword or group of selected words • any word (more general) • Stemming might be used: • connect: connecting, connection, connections • An inverted file is built for the chosen index terms

Indexing Docs Index Terms doc match Ranking Information Need query

Inverted Files • There are two main elements: • vocabulary – set of unique terms • Occurrences – where those terms appear • The occurrences can be recorded as terms or byte offsets • Using term offset is good to retrieve concepts such as proximity, whereas byte offsets allow direct access

Inverted Files • The number of indexed terms is often several orders of magnitude smaller when compared to the documents size (Mbs vs Gbs) • The space consumed by the occurrence list is not trivial. Each time the term appears it must be added to a list in the inverted file • That may lead to a quite considerable index overhead

Example • Text: • Inverted file 1 6 12 16 18 25 29 36 40 45 54 58 66 70 That house has a garden. The garden has many flowers. The flowers are beautiful Vocabulary Occurrences beautiful flowers garden house 70 45, 58 18, 29 6

Ranking • A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the query • A ranking is based on fundamental premisses regarding the notion of relevance, such as: • common sets of index terms • sharing of weighted terms • likelihood of relevance • Each set of premisses leads to a distinct IR model

Classic IR Models - Basic Concepts • Each document represented by a set of representative keywords or index terms • An index term is a document word useful for remembering the document main themes • Usually, index terms are nouns because nouns have meaning by themselves • However, search engines assume that all words are index terms (full text representation)

Classic IR Models - Basic Concepts • The importance of the index terms is represented by weights associated to them • ki- an index term • dj- a document • wij - a weight associated with (ki,dj) • The weight wijquantifies the importance of the index term for describing the document contents

Classic IR Models - Basic Concepts • t is the total number of index terms • K = {k1, k2, …, kt} is the set of all index terms • wij >= 0 is a weight associated with (ki,dj) • wij = 0 indicates that term does not belong to doc • dj= (w1j, w2j, …, wtj) is a weighted vector associated with the document dj • gi(dj) = wij is a function which returns the weight associated with pair (ki,dj)

The Boolean Model • Simple model based on set theory • Queries specified as boolean expressions • precise semantics and neat formalism • Terms are either present or absent. Thus, wij  {0,1} • Consider • q = ka  (kb  kc) • qdnf = (1,1,1)  (1,1,0)  (1,0,0) • qcc= (1,1,0) is a conjunctive component

The Vector Model • Use of binary weights is too limiting • Non-binary weights provide consideration for partial matches • These term weights are used to compute a degree of similarity between a query and each document • Ranked set of documents provides for better matching

The Vector Model • wij > 0 whenever ki appears in dj • wiq >= 0 associated with the pair (ki,q) • dj = (w1j, w2j, ..., wtj) • q = (w1q, w2q, ..., wtq) • To each term ki is associated a unitary vector i • The unitary vectors i and j are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents) • The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors

Query Languages • Keyword Based • Boolean • Weighted Boolean • Context Based (Phrasal & Proximity) • Pattern Matching • Structural Queries

Keyword Based Queries • Basic Queries • Single word • Multiple words • Context Queries • Phrase • Proximity

Boolean Queries • Keywords combined with Boolean operators: • OR: (e1 OR e2) • AND: (e1 AND e2) • BUT: (e1 BUT e2) Satisfy e1 but note2 • Negation only allowed using BUT to allow efficient use of inverted index by filtering another efficiently retrievable set. • Naïve users have trouble with Boolean logic.

Boolean Retrieval with Inverted Indices • Primitive keyword: Retrieve containing documents using the inverted index. • OR: Recursively retrieve e1 and e2 and take union of results. • AND: Recursively retrieve e1 and e2 and take intersection of results. • BUT: Recursively retrieve e1 and e2 and take set difference of results.

Phrasal Queries • Retrieve documents with a specific phrase (ordered list of contiguous words) • “information theory” • May allow intervening stop words and/or stemming. • “buy camera” matches: “buy a camera” “buying the cameras” etc.

Phrasal Retrieval with Inverted Indices • Must have an inverted index that also stores positions of each keyword in a document. • Retrieve documents and positions for each individual word, intersect documents, and then finally check for ordered contiguity of keyword positions. • Best to start contiguity check with the least common word in the phrase.

Proximity Queries • List of words with specific maximal distance constraints between terms. • Example: “dogs” and “race” within 4 words match “…dogs will begin the race…” • May also perform stemming and/or not count stop words.

Pattern Matching • Allow queries that match strings rather than word tokens. • Requires more sophisticated data structures and algorithms than inverted indices to retrieve efficiently.

Simple Patterns • Prefixes: Pattern that matches start of word. • “anti” matches “antiquity”, “antibody”, etc. • Suffixes: Pattern that matches end of word: • “ix” matches “fix”, “matrix”, etc. • Substrings: Pattern that matches arbitrary subsequence of characters. • “rapt” matches “enrapture”, “velociraptor” etc. • Ranges: Pair of strings that matches any word lexicographically (alphabetically) between them. • “tin” to “tix” matches “tip”, “tire”, “title”, etc.

Information Retrieval