500 likes | 1.4k Views
Information Retrieval System. Prof.R.C.Tripathi. Content. Introduction Conceptual Models of IR File Structures Query Operations Term Operations Document Operations Hardware for IR IR AND OTHER TYPES OF INFORMATION SYSTEMS IR SYSTEM EVALUATION Typical IR Task IR System Relevance
E N D
Information Retrieval System Prof.R.C.Tripathi
Content. Introduction Conceptual Models of IR File Structures Query Operations Term Operations Document Operations Hardware for IR IR AND OTHER TYPES OF INFORMATION SYSTEMS IR SYSTEM EVALUATION Typical IR Task IR System Relevance Keyword Search Problems with Keywords Beyond Keywords Intelligent IR IR System Architecture IR System Components IR Models Web Search Related Areas
1. Introduction Information Retrieval is a wide, often loosely defined term. Unfortunately the word information can be very misleading. In the context of information retrieval (IR), information, in the technical meaning given in Shannon’s theory of communication, is not readily measured (Shannon and Weaver). Information retrieval is the term conventionally, though somewhat inaccurately, applied to the type of activity, given by Lancaster. Something different issues discussed by researchers about Data retrieval and Information retrieval, listed some example in the below table
2. Conceptual Models of IR The most general facet in the previous classification scheme is conceptual model. An IR conceptual model is a general approach to IR systems. Several taxonomies for IR conceptual models have been proposed. Faloutsos (1985) gives three basic approaches: text pattern search, inverted file search, and signature search. Belkin and Croft (1987) categorize IR conceptual models differently. They divide retrieval techniques first into exact match and inexact match. The exact match category contains text pattern search and Boolean search techniques.
Contd.. The inexact match category contains such techniques as probabilistic, vector space, and clustering, among others. Almost all of the IR systems fielded today are either Boolean IR systems or text pattern search systems. Text pattern search queries are strings or regular expressions. Text pattern systems are more common for searching small collections, such as personal collections of files.
3. File Structures A fundamental decision in the design of IR systems is which type of file structure to use for the underlying document database. As can be seen in Table below, the file structures used in IR systems are flat files, inverted files, signature files, PAT trees, and graphs. Though it is possible to keep file structures in main memory, in practice IR databases are usually stored on disk because of their size.
4. Query Operations Queries are formal statements of information needs put to the IR system by users. The operations on queries are obviously a function of the type of query, and the capabilities of the IR system. One common query operation is parsing, that is breaking the query into its constituent elements. Boolean queries, for example, must be parsed into their constituent terms and operators. The set of document identifiers associated with each query term is retrieved, and the sets are then combined according to the Boolean operators.
5. Term Operations Operations on terms in an IR system include stemming, truncation, weighting, and stoplist and thesaurus operations. Stemming is the automated conflation (fusing or combining) of related words, usually by reducing the words to a common root form. Truncation is manual conflation of terms by using wildcard characters in the word, so that the truncated term will match multiple words. For example, a searcher interested in finding documents about truncation might enter the term "truncat?" which would match terms such as truncate, truncated, and truncation. Another way of conflating related terms is with a thesaurus which lists synonymous terms, and sometimes the relationships among them. A stoplist is a list of words considered to have no indexing value, used to eliminate potential indexing terms. Each potential indexing term is checked against the stoplist and eliminated if found there.
6. Document Operations Documents are the primary objects in IR systems and there are many operations for them. In many types of IR systems, documents added to a database must be given unique identifiers, parsed into their constituent fields, and those fields broken into field identifiers and terms. Once in the database, one sometimes wishes to mask off certain fields for searching and display. For example, the searcher may wish to search only the title and abstract fields of documents for a given query, or may wish to see only the title and author of retrieved documents. One may also wish to sort retrieved documents by some field, for example by author
7. Hardware for IR Hardware affects the design of IR systems because it determines, in part, the operating speed of an IR system--a crucial factor in interactive information systems--and the amounts and types of information that can be stored practically in an IR system. Most IR systems in use today are implemented on von Neumann machines--general purpose computers with a single processor. Most of the discussion of IR techniques in this book assumes a von Neumann machine as an implementation platform. The computing speeds of these machines have improved enormously over the years, yet there are still IR applications for which they may be too slow. In response to this problem, some researchers have examined alternative hardware for implementing IR systems. There are two approaches--parallel computers and IR specific hardware
8. IR AND OTHER TYPES OF INFORMATION SYSTEMS How do IR systems relate to different types of information systems such as database management systems (DBMS), and artificial intelligence (AI) systems? Table 1.3 summarizes some of the similarities and differences.
Contd.. One difference between IR, DBMS, and AI systems is the amount of usable structure in their data objects. Documents, being primarily text, in general have less usable structure than the tables of data used by relational DBMS, and structures such as frames and semantic nets used by AI systems. It is possible, of course, to analyze a document manually and store information about its syntax and semantics in a DBMS or an AI system. The barriers for doing this to a large collection of documents are practical rather than theoretical. The work involved in doing knowledge engineering on a set of say 50,000 documents would be enormous. Researchers have devoted much effort to constructing hybrid systems using IR, DBMS, AI, and other techniques; see, for example, Tong (1989). The hope is to eventually develop practical systems that combine IR, DBMS, and AI.
Contd.. Another distinguishing feature of IR systems is that retrieval is probabilistic. That is, one cannot be certain that a retrieved document will meet the information need of the user. In a typical search in an IR system, some relevant documents will be missed and some nonrelevant documents will be retrieved. This may be contrasted with retrieval from, for example, a DBMS where retrieval is deterministic. In a DBMS, queries consist of attribute-value pairs that either match, or do not match, records in the database.
9. IR SYSTEM EVALUATION IR systems can be evaluated in terms of many criteria including execution efficiency, storage efficiency, retrieval and the features they offer a user. The relative importance of these factors must be decided by the designers of the system, and the selection of appropriate data structures and algorithms for implementation will depend on these decisions.
Contd.. one often wishes to compare IR performance in terms of both recall and precision, methods for evaluating them simultaneously have been developed. One method involves the use of recall-precision graphs--bivariate plots where one axis is recall and the other precision. Figure 1.2 shows an example of such a plot. Recall-precision plots show that recall and precision are inversely related. That is, when precision goes up, recall typically goes down and vice-versa.
Contd.. • Efficiency: time, space • Effectiveness: • How is a system capable of retrieving relevant documents? • Is a system better than another one? • Metrics often used (together): • Precision = retrieved relevant docs / retrieved docs • Recall = retrieved relevant docs / relevant docs relevant retrieved retrieved relevant
General form of precision/recall • Precision change w.r.t. Recall (not a fixed point) • Systems cannot compare at one Precision/Recall point • Average precision (on 11 points of recall: 0.0, 0.1, …, 1.0)
10. Typical IR Task • Given: • A corpus of textual natural-language documents. • A user query in the form of a textual string. • Find: • A ranked set of documents that are relevant to the query.
Document corpus Query String 1. Doc1 2. Doc2 3. Doc3 . . Ranked Documents 11. IR System IR System
12. Relevance • Relevance is a subjective judgment and may include: • Being on the proper subject. • Being timely (recent information). • Being authoritative (from a trusted source). • Satisfying the goals of the user and his/her intended use of the information (information need).
13. Keyword Search Simplest notion of relevance is that the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).
14. Problems with Keywords • May not retrieve relevant documents that include synonymous terms. • “restaurant” vs. “café” • “PRC” vs. “China” • May retrieve irrelevant documents that include ambiguous terms. • “bat” (baseball vs. mammal) • “Apple” (company vs. fruit) • “bit” (unit of data vs. act of eating)
15. Beyond Keywords We will cover the basics of keyword-based IR, but… We will focus on extensions and recent developments that go beyond keywords. We will cover the basics of building an efficient IR system, but… We will focus on basic capabilities and algorithms rather than systems issues that allow scaling to industrial size databases.
16. Intelligent IR • Taking into account the meaning of the words used. • Taking into account the order of words in the query. • Adapting to the user based on direct or indirect feedback. • Taking into account the authority of the source.
17. IR System Architecture User Interface Text User Need Text Operations Logical View User Feedback Query Operations Indexing Database Manager Inverted file Searching Query Index Text Database Ranked Docs Retrieved Docs Ranking
18. IR System Components • Text Operations forms index words (tokens). • Stopword removal • Stemming • Indexing constructs an inverted index of word to document pointers. • Searching retrieves documents that contain a given query token from the inverted index. • Ranking scores all retrieved documents according to a relevance metric.
IR System Components (continued) • User Interface manages interaction with the user: • Query input and document output. • Relevance feedback. • Visualization of results. • Query Operations transform the query to improve retrieval: • Query expansion using a thesaurus. • Query transformation using relevance feedback.
Algebraic Set Theoretic Generalized Vector Lat. Semantic Index Neural Networks Structured Models Fuzzy Extended Boolean Non-Overlapping Lists Proximal Nodes Classic Models Probabilistic boolean vector probabilistic Inference Network Belief Network Browsing Flat Structure Guided Hypertext 19. IR Models U s e r T a s k Retrieval: Adhoc Filtering Browsing
20. Web Search • Application of IR to HTML documents on the World Wide Web. • Differences: • Must assemble document corpus by spidering the web. • Can exploit the structural layout information in HTML (XML). • Documents change uncontrollably. • Can exploit the link structure of the web.
Web Spider Document corpus Query String 1. Page1 2. Page2 3. Page3 . . Ranked Documents Web Search System IR System
. Google Web
21. Related Areas Database Management Library and Information Science Artificial Intelligence Natural Language Processing Machine Learning
Database Management Focused on structured data stored in relational tables rather than free-form text. Focused on efficient processing of well-defined queries in a formal language (SQL). Clearer semantics for both data and queries. Recent move towards semi-structured data (XML) brings it closer to IR.
Library and Information Science • Focused on the human user aspects of information retrieval (human-computer interaction, user interface, visualization). • Concerned with effective categorization of human knowledge. • Concerned with citation analysis and bibliometrics (structure of information). • Recent work on digital libraries brings it closer to CS & IR.
Artificial Intelligence • Focused on the representation of knowledge, reasoning, and intelligent action. • Formalisms for representing knowledge and queries: • First-order Predicate Logic • Bayesian Networks • Recent work on web ontologies and intelligent information agents brings it closer to IR.
Natural Language Processing Focused on the syntactic, semantic, and pragmatic analysis of natural language text and discourse. Ability to analyze syntax (phrase structure) and semantics could allow retrieval based on meaning rather than keywords.
Contd.. Thank You.