University of Palestine

University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2nd Semester 2008-2009

Chapter 2– Part1 Information Retrieval Models

Introduction • Traditional information retrieval systems usually adopt index terms to index and retrieve documents. • An index term is a keyword (or group of related words) which has some meaning of its own (usually a noun). • Advantages: • Simple • The semantic of the documents and of the user information need can be naturally expressed through sets of index terms.

IR Models • Ranking algorithms are at the core of information retrieval systems (predicting which documents are relevant and which are not).

A taxonomy of information retrieval models Classic Models Set Theoretic U S E R T A S K Boolean Vector Probabilistic Fuzzy Extended Boolean Retrieval: Ad hoc Filtering Algebraic Structured Models Generalized Vector Lat. Semantic Index Neural Networks Non-overlapping lists Proximal Nodes Browsing Probabilistic Browsing Inference Network Belief Network Flat Structured Guided Hypertext

Basic Concept • Each document is described by a set of representative keywords called index term. • Index term: is simply a word whose semantic help in remembering the documents main themes. • Consider a collection of one hundred documents, a word which appears in each of the one hundred document is completely useless as an index term, because it does not tell us anything about which documents are the user interested in. • But the word which appear in just 5 document is quite useful, because it narrow down considerably the space of documents which might be of interest to the user.

Retrieval Strategy • An IR strategyis a technique by which a relevance measure is obtained between a query and a document. • Manual Systems • Boolean, Fuzzy Set • Automatic Systems • Vector Space Model • Language Models • Latent Semantic Indexing

Basic Concepts • In the classic models • each document is described by a set of representative keywords called index terms • index terms are mainly nouns • index term weights are usually assumed to be mutually independent

Boolean Model • Binary decision criterion • Data retrieval model • A query is a Boolean expression which can be represented as a disjunction of conjunctive vectors • Advantage • clean formalism, simplicity • Disadvantage • exact matching may lead to retrieval of too few or too many documents

Doc’s containing term A Doc’s containing term A term B term B term C term C - Pre 1970’s - Dominant industrial model through 1994 (Lexis-Nexis, DIALOG) Boolean IR • Documents composed of TERMS(words, stems) • Express result in set-theoretic terms A AND B (A AND B) OR C

Information Retrieval Models • An information retrieval model is a formal framework that supports all the major phases of the information retrieval process, including: • Item (document) representation • User need representation • Matching of needs to items • Ranking of retrieved items • An information retrieval model is analogous to a database model (relation, object-oriented, semi-structured, etc.).

A Generic Model • D: A set of document representations. • Q: A set of user-need representations (queries). • R : DxQ gives Real-Numbers • A function that assigns each document and each query a real number that represents the ranking(relevance) of the document with respect to the query.

Common Preprocessing Steps • Strip unwanted characters/markup (e.g. HTML tags, punctuation, numbers, etc.). • Break into tokens (keywords) on whitespace. • Stem tokens to “root” words • computational  compute • Remove common stopwords (e.g. a, the, it, etc.). • Detect common phrases (possibly using a domain specific dictionary). • Build inverted index (keyword  list of docs containing it).

Common Assumptions • The three models are all based on document representations that are sets of index terms. • Index terms (keywords): Words (mostly nouns) extracted from a document to summarize its contents. • Method: Could extract all distinct words, only those that appearin a global lexicon, etc. (to be discussed in Topic 8). • Weights: Not all extracted index terms carry the same importance in summarizing the contents of a document. • Let t be the number of terms in the entire system. • Let kibe a term, let djbe a document. • wi,j >= is a weight associated with the pair (ki, dj). • wi,j = 0 when kidoes not appear in dj. • Document dj is associated with index term vector dj = (w1j, w2j,…, wtj).

Common Assumptions (cont.) • A naïve assumption is made that the terms in a document are mutually independent. • Term independence: The appearance of one term in a document is unrelated to the appearance of another. • This assumption is made for simplicity of calculations, but is often wrong. • Example: • The terms computer and network are not independent. • If the term computer appears, the probability that the term network appears is higher than if the term computer does not appear. • Whereas independence requires P(network|computer) = P(network). • Nonetheless, it has been demonstrated that performance is still good under this naïve assumption.

The Boolean Model • Document: A document is a set of index terms without weights; or, equivalently, with binaryterm weights: wi,j = 0 or wi,j = 1. • i.e., a keyword is either presentin or absentfrom a document. • Query: A query is a Boolean expression of index terms. i.e., index terms connected with and, or and not, according to the usual conventions. • Ranking: With each index term ki we associate a set Dki of the documents in which ki appears: Dki= {dj| wij = 1} and the Boolean expression is converted to a set-theoretic expression: • Each term ki is substituted by the set Dki • The Boolean operators and(^), or(V) and not(¥) are substituted, respectively, by the set operators intersection , union andcomplement. • The documents in the resulting set are relevant, all others are non-relevant.

The Boolean Model (cont.) • Example: • Terms: K1, …,K8. • Documents: • D1= {K1, K2, K3, K4 , K5} • D2 = {K1, K2, K3, K4} • D3= {K2, K4, K6, K8} • D4= {K1, K3, K5, K7} • D5= {K4 , K5, K6, K7, K8} • D6 = {K1, K2, K3, K4} • Query: K1 AND (K2 OR NOT K3) • Answer: {D1, D2, D4, D6} 3({D1, D2, D3, D6} 4{D3, D5}) = {D1, D2, D6}

The Boolean Model (cont.) • Popular retrieval model because: • Easy to understand for simple queries. • Clean formalism. • Reasonably efficient implementations possible for normal queries. • Simple and clean formalism. • Adopted by many early commercial IR systems. • A document is either relevant or non-relevant to a query (i.e., no “strong” or “weak” relevance). • Hence, a query splits the collection into two distinct sets of documents: relevant and non-relevant, and there is no ranking from most relevant to least relevant. • May lead to answers with too few or too many documents.

Drawbacks of the Boolean Model • Retrieval based on binary decision criteria with no notion of partial matching • No ranking of the documents is provided (absence of a grading scale) • Very rigid: AND means all; OR means any. • Difficult to express complex user requests. • Information need has to be translated into a Boolean expression which most users find awkward • The Boolean queries formulated by the users are most often too simplistic • As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query • Difficult to perform relevance feedback. • If a document is identified by the user as relevant or irrelevant, how should the query be modified?

Example • Assume that we have 5 documents D1,D2,D3,D4,D5 and the following terms found in each document. • D1:{play,sport,football,swimming,university,just} • D2:{jordan,news,sport,university} • D3:{football,yarmouk,university,sport,play} • D4:{univesity,just,sport,jordan} • D5:{play,jordan,university,football}

continue • The following query will be used to search for relevant document. “Yarmouk” AND(“Univesity” OR NOT ”Just”) • Translate the query into disjunctive normal form dfn dfn1:Yarmouk AND NOT University AND NOT Just=(100) dfn2:Yarmouk AND UniversityAND NOT Just=(110) dfn3:Yarmouk AND University AND Just=(111) The set of vector represent the query are Qdfn={(111),(110),(100)}

continue

finally • To find the similarity between document and query, we search for conjunctive component such that every key term in the query has the same value in on of the dfn component as it does in the document

Query key term matrex

The result • The query returns the document D3 as the relevant document

University of Palestine