240 likes | 258 Views
Chapter 2 Modeling. 資工 4B 86075800 陳建勳. Introduction. Traditional information retrieval systems usually adopt index terms to index and retrieve documents. An index term is a keyword (or group of related words) which has some meaning of its own (usually a noun).
E N D
Chapter 2 Modeling 資工4B 86075800 陳建勳
Introduction. Traditional information retrieval systems usually adopt index terms to index and retrieve documents. An index term is a keyword(or group of related words) which has some meaning of its own (usually a noun).
The advantage of using index terms • Simple • The semantic of the documents and of the user information need can be naturally expressed through sets of index terms. • Ranking algorithms are at the core of information • retrieval systems(predicting which documents are • relevant and which are not).
A taxonomy of information retrieval models Classic Models Set Theoretic U S E R T A S K Boolean Vector Probabilistic Fuzzy Extended Boolean Retrieval: Ad hoc Filtering Algebraic Structured Models Generalized Vector Lat. Semantic Index Neural Networks Non-overlapping lists Proximal Nodes Browsing Probabilistic Browsing Inference Network Belief Network Flat Structured Guided Hypertext
Figure 2.2 Retrieval models most frequently associated with distinct combinations of a document logical view and a user task.
Retrieval : Ad hoc and Filtering • Ad hoc : The documents in the collection remain relatively static while new queries are submtted to the system. • Filtering : The queries remain relatively static while new documents come into the system
Filtering • Typically, the filtering task simply indicates to the user the documents which might be of interest to him. • Routing : Rank the filtering documents and show this ranking to the user. • Constructing user profiles in two ways.
A formal characterization of IR models • D : A set composed of logical views(or representation) for the documents in the collection. • Q : A set composed of logical views(or representation) for the user information needs(queries). • F : A framework for modeling document representations, queries, and their relationships. • R(qi, dj) : A ranking function which defines an ordering among the documents with regard to the query.
Classic information retrieval model • Basic concepts : Each document is described by a set of representative keywords called index terms. • Assign a numerical weights to distinct relevance between index terms.
Define • ki : A generic index term • K : The set of all index terms {k1,…,kt} • wi,j : A weight associated with index term ki of a document dj • gi : A function returns the weight associated with ki in any t-dimensoinal vector( gi(dj)=wi,j )
Boolean model • Based on a binary decision criterion without any notion of a grading scale. • Boolean expressions have precise semantics.It is not simple to translate an information need into a Boolean expression. • Can be represented as a disjunction of conjunction vectors(in disjunctive normal form-DNF).
Vector model • Assign non-binary weights to index terms in queries and in documents. • Compute the similarity between documents and query. • More precise than Boolean model.
想法 We think of the documents as a collection C of objects and think of the user query as a specification of a set A of objects.In this scenario, the IR problem can be reduced to the problem of determine which documents are in the set A and which ones are not(i.e., the IR problem can be viewed as a clustering problem).
Intra-cluster : One needs to determine what are the features which better describe the objects in the set A. • Inter-cluster : One needs to determine what are the features which better distinguish the objects in the set A.
tf : inter-clustering similarity is quantified by measuring the raw frequency of a term ki inside a document dj, such term frequency is usually referred to as the tf factor and provides one measure of how well that term describes the document contents. • idf : inter-clustering similarity is quantified by measuring the inverse of the frequency of a term ki among the documents in the collection.This frequency is often referred to as the inverse document frequency.
Vector model is simple and fast. It’s a popular retrieval model. • Disadvantage : Index terms are assumed to be mutually independent. It doesn’t account for index term dependencies.
Probabilistic model • We can think of the querying process as a process of specifying the properties of an ideal answer set(The problem is that we do not know exactly what these properties are.).
Structured text retrieval model • Retrieval models which combine information on text content with information on the document structure are called structured text retrieval model. • Match point : refer to the position in the text of a sequence of words which matches the user query. • Region : refer to a contiguous portion of the text. • Node : refer to a structural component of the document such as a chapter, a section, a subsection.
Model based on Non-overlapping lists • Divide the whole text of each document in non-overlapping text regions which are collected in a list. • Text regions in the same list have no overlapping, but text regions from distinct lists might overlap.
Model based on Proximal nodes • A model which allows the definition of independent hierarchical indexing structures over the same document text. • Each of these index structures is a strict hierarchy composed of chapters, sections, paragraphs, pages, and lines which called nodes.
Models for browsing • Flat browsing • Structure guided browsing • The hypertext model
Flat browsing • The documents might be represented as dots in a plan or as elements in a list. • Relevance feedback • Disadvantage : In a given page or screen there may not be any indication about the context where the user is.
Structure guided browsing • Organized in a directory structure. It groups documents covering related topics. • The same idea can be applied to a single document. • Using history map.
The hypertext model • Written text is usually conceived to be read sequentially. • The reader should not expect to fully understand the message conveyed by the writer by randomly reading pieces of text here and there.