Introduction to Text Retrieval

Introduction to Text Retrieval CSE3201/4500 Information Retrieval Systems (c) Maria Indrawan 2004

Database Types highly-structured Relational DB XML collections Text Collections Multimedia Collections ill-structured (c) Maria Indrawan 2004

Ill-structured data • Attributes: • Variable length records, fields • Repeated fields [non-normalised] • Mixed media • Often large • Often accessed by “novice” users • Need for both currency and completeness (c) Maria Indrawan 2004

Information Retrieval • Information retrieval has been the term applied to such areas as: • text retrieval systems, library systems, citation retrieval systems, records management and archives, photo library applications etc. • These systems are typical of variable-length record systems • Text retrieval is a subset of Information Retrieval. • research articles may use the term IR = text retrieval, especially in the 70s,80s and 90s. (c) Maria Indrawan 2004

Text Retrieval - Overview • Information retrieval • branch of database theory • specialises in managing retrieval of unstructured data • large amount of free format text. • Key problem: • How to retrieve the appropriate pieces of unstructured data (e.g. documents) in response to a more or less structured query. • Response to a query: • Does not answer the query directly • Identify relevant information. (c) Maria Indrawan 2004

Text Retrieval Characteristics • large volume of document space • document space may/may not be structured. • query may not be structured. • exact matching, such as relational database, will not work effectively. • objects which are to be retrieved, usually represented by surrogate records. (c) Maria Indrawan 2004

Surrogate Records • Most text retrieval systems rely on surrogate records rather than directly accessing the objects themselves. • The quality of the surrogate records often decides how well the system retrieves. • The structure of the surrogate records will affect how well they can be indexed or otherwise accessed. (c) Maria Indrawan 2004

Text Retrieval Processes • Representation • Storage • Organization • Retrieval • Presentation (c) Maria Indrawan 2004

Text Retrieval Processes Model (c) Maria Indrawan 2004

Retrieval Process (c) Maria Indrawan 2004

Document Natural Language Text ANALYSE Keywords - Stemming - Thesauri Replacement - (Weight Assignment) STORE Indexing (Document Analysis) (c) Maria Indrawan 2004

Query Formulation • Controlled vocabulary: • keyword of query  keyword in document collection (c) Maria Indrawan 2004

Indexing in Text Retrieval Systems (c) Maria Indrawan 2004

Indexed Files in Traditional Databases • An index is a look up table which establishes a correspondence between a particular attribute (or attributes) and the address of the record in the file. • One named (physical) file - two logical files: • Data file - contains full data records • Index file - “records” consist of two fields: key value and address • Index file small - quick to search • Addresses obtained from the index enable direct access to the data file • Logically sequential access also via index (c) Maria Indrawan 2004

Index Indexed Non-Sequential File Data Records (c) Maria Indrawan 2004

Index Indexed Sequential File Data Records (c) Maria Indrawan 2004

Indexing in Text Retrieval Systems Doc-2 (data record) Doc-1 (data record) (c) Maria Indrawan 2004

Purpose of Indexing • a sufficiently general description of a document so that it can be retrieved with queries that concern the same subject as the document; • sufficiently specific description so that the document will not be returned for those queries which are not related to the document. (c) Maria Indrawan 2004

Indexing • Manual indexing • Automatic indexing (c) Maria Indrawan 2004

Style of indexing • depends on the form of queries and vice-versa. • We must decide whether the terms available for indexing are predefined, a controlled vocabulary, or chosen at the time of indexing, an uncontrolled vocabulary. (c) Maria Indrawan 2004

Controlled Vocabulary • Controlled vocabulary is a method of predetermining the terms which will be used in a specific domain so that • indexers will select from a limited set of terms • searchers can use terms knowing that they have been applied in an objective manner • index sets are reduced in size (c) Maria Indrawan 2004

Manual Indexing Methods • 1. Give the document a single code from a predefined list. e.g.: • the first letter of the first author’s family name • a Dewey Decimal number • 2. Assign several of a predefined lists of codes to a document. e.g.: • assign the Computing Reviews classification to articles. • Assign to each document a set of descriptors that are not predefined. The descriptors may be words from the text of the document and/or thesaurus. (c) Maria Indrawan 2004

Manual Indexing - Analysis • Single term indexing: simple and low index cost, but poor retrieval. • All other techniques require that a more complex index be maintained. • When a controlled vocabulary is used, a taxonomy of the document contents must be devised. Having devised this it must be adhered to henceforth. (c) Maria Indrawan 2004

Manual Indexing - Analysis • Advantage: terms never used in the text but are extremely descriptive may be assigned to the document. • Disadvantage: • inter-indexer consistency • inflexible view of documents • no control on number of satisfying documents. (c) Maria Indrawan 2004

Automatic Indexing - A Basic Method • Assume that a document consists of just text and that we will derive our indexing terms from this text. • Break the text up into words, casefold, and index on every word. This technique is very simple and performs reasonably well. (c) Maria Indrawan 2004

Automatic Indexing - Refinement • Language dependent. • refinement for English will be different from Chinese • Stop List • Stemming • Term Weighting (c) Maria Indrawan 2004

Indexing Refinement – Stop List • A list of common words. • Generally contains words that are not nouns, verbs, adjectives and adverbs. • A stop list might consist of a, the, an is, be , .... • Common stop lists run from 10 to hundreds of words. • It does not matter what the stop list is, typically around 300 common words will do well. • Indexing process will ignore the words listed in the stop list. (c) Maria Indrawan 2004

Stop Lists • Fox indicates that the first 20 stop words accounts for 31.19% of the English corpus. • Fox C. (1992). Lexical Analysis and Stoplists. In Frakes W.B. and Baeza-Yates R., Eds.), Information Retrieval:Data Structures and Algorithms, Englewood Cliffs, NJ.: Prentice-Hall • The first 20 stop words: • The, of, and, to, a , in, that, is, was, he, for , it, with, as, not, his, on, be, at, by. (c) Maria Indrawan 2004

Refinement - Stemming • To incorporate many variations of words, where an attempt is made to accommodate many variations comprising a concept • This avoids exceedingly long “or” query statement. • Example: inquiry or inquired or inquiries • The process is performed after the “stop list” process. • Porter stemming algorithm • Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14(3) :130-137) (c) Maria Indrawan 2004

Stemming - Suffix • Most English meaning shifts for grammatical purposes are handled by suffixes • Most retrieval systems allow for “trailing” or suffixes truncation. • Example: • “inquir$” will retrieve documents containing the words “inquire”, “inquired”, “inquires”, “inquiring”, “inquiry” etc. (c) Maria Indrawan 2004

Stemming - Prefix • Usually is not used in English text retrieval systems. • Prefix is substantial modifier, even a negation. • Example: • flammable and inflammable. • Prefix stemming may be useful in Chemical databases. (c) Maria Indrawan 2004

Stemming – Exception List • Irregularity in the language needs to be implemented as a “lookup list” • Example: • Irregular plurals • woman => women • child => children • past tense • choose => chose • find => found (c) Maria Indrawan 2004

Summary • Text Retrieval Systems: • motivation • model • Indexing Refinements: • Stop List • Stemming • Term Weight (week 8) (c) Maria Indrawan 2004

Introduction to Text Retrieval

Introduction to Text Retrieval

Presentation Transcript

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to information retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Text-retrieval Systems

Introduction to Information Retrieval

Introduction to Text Processing and Information Retrieval System

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Text retrieval systems