360 likes | 713 Views
Introduction to Text Retrieval. CSE3201/4500 Information Retrieval Systems. Database Types. highly-structured. Relational DB. XML collections. Text Collections. Multimedia Collections. ill-structured. Ill-structured data. Attributes: Variable length records, fields
E N D
Introduction to Text Retrieval CSE3201/4500 Information Retrieval Systems (c) Maria Indrawan 2004
Database Types highly-structured Relational DB XML collections Text Collections Multimedia Collections ill-structured (c) Maria Indrawan 2004
Ill-structured data • Attributes: • Variable length records, fields • Repeated fields [non-normalised] • Mixed media • Often large • Often accessed by “novice” users • Need for both currency and completeness (c) Maria Indrawan 2004
Information Retrieval • Information retrieval has been the term applied to such areas as: • text retrieval systems, library systems, citation retrieval systems, records management and archives, photo library applications etc. • These systems are typical of variable-length record systems • Text retrieval is a subset of Information Retrieval. • research articles may use the term IR = text retrieval, especially in the 70s,80s and 90s. (c) Maria Indrawan 2004
Text Retrieval - Overview • Information retrieval • branch of database theory • specialises in managing retrieval of unstructured data • large amount of free format text. • Key problem: • How to retrieve the appropriate pieces of unstructured data (e.g. documents) in response to a more or less structured query. • Response to a query: • Does not answer the query directly • Identify relevant information. (c) Maria Indrawan 2004
Text Retrieval Characteristics • large volume of document space • document space may/may not be structured. • query may not be structured. • exact matching, such as relational database, will not work effectively. • objects which are to be retrieved, usually represented by surrogate records. (c) Maria Indrawan 2004
Surrogate Records • Most text retrieval systems rely on surrogate records rather than directly accessing the objects themselves. • The quality of the surrogate records often decides how well the system retrieves. • The structure of the surrogate records will affect how well they can be indexed or otherwise accessed. (c) Maria Indrawan 2004
Text Retrieval Processes • Representation • Storage • Organization • Retrieval • Presentation (c) Maria Indrawan 2004
Text Retrieval Processes Model (c) Maria Indrawan 2004
Retrieval Process (c) Maria Indrawan 2004
Document Natural Language Text ANALYSE Keywords - Stemming - Thesauri Replacement - (Weight Assignment) STORE Indexing (Document Analysis) (c) Maria Indrawan 2004
Query Formulation • Controlled vocabulary: • keyword of query keyword in document collection (c) Maria Indrawan 2004
Indexing in Text Retrieval Systems (c) Maria Indrawan 2004
Indexed Files in Traditional Databases • An index is a look up table which establishes a correspondence between a particular attribute (or attributes) and the address of the record in the file. • One named (physical) file - two logical files: • Data file - contains full data records • Index file - “records” consist of two fields: key value and address • Index file small - quick to search • Addresses obtained from the index enable direct access to the data file • Logically sequential access also via index (c) Maria Indrawan 2004
Index Indexed Non-Sequential File Data Records (c) Maria Indrawan 2004
Index Indexed Sequential File Data Records (c) Maria Indrawan 2004
Indexing in Text Retrieval Systems Doc-2 (data record) Doc-1 (data record) (c) Maria Indrawan 2004
Purpose of Indexing • a sufficiently general description of a document so that it can be retrieved with queries that concern the same subject as the document; • sufficiently specific description so that the document will not be returned for those queries which are not related to the document. (c) Maria Indrawan 2004
Indexing • Manual indexing • Automatic indexing (c) Maria Indrawan 2004
Style of indexing • depends on the form of queries and vice-versa. • We must decide whether the terms available for indexing are predefined, a controlled vocabulary, or chosen at the time of indexing, an uncontrolled vocabulary. (c) Maria Indrawan 2004
Controlled Vocabulary • Controlled vocabulary is a method of predetermining the terms which will be used in a specific domain so that • indexers will select from a limited set of terms • searchers can use terms knowing that they have been applied in an objective manner • index sets are reduced in size (c) Maria Indrawan 2004
Manual Indexing Methods • 1. Give the document a single code from a predefined list. e.g.: • the first letter of the first author’s family name • a Dewey Decimal number • 2. Assign several of a predefined lists of codes to a document. e.g.: • assign the Computing Reviews classification to articles. • Assign to each document a set of descriptors that are not predefined. The descriptors may be words from the text of the document and/or thesaurus. (c) Maria Indrawan 2004
Manual Indexing - Analysis • Single term indexing: simple and low index cost, but poor retrieval. • All other techniques require that a more complex index be maintained. • When a controlled vocabulary is used, a taxonomy of the document contents must be devised. Having devised this it must be adhered to henceforth. (c) Maria Indrawan 2004
Manual Indexing - Analysis • Advantage: terms never used in the text but are extremely descriptive may be assigned to the document. • Disadvantage: • inter-indexer consistency • inflexible view of documents • no control on number of satisfying documents. (c) Maria Indrawan 2004
Automatic Indexing - A Basic Method • Assume that a document consists of just text and that we will derive our indexing terms from this text. • Break the text up into words, casefold, and index on every word. This technique is very simple and performs reasonably well. (c) Maria Indrawan 2004
Automatic Indexing - Refinement • Language dependent. • refinement for English will be different from Chinese • Stop List • Stemming • Term Weighting (c) Maria Indrawan 2004
Indexing Refinement – Stop List • A list of common words. • Generally contains words that are not nouns, verbs, adjectives and adverbs. • A stop list might consist of a, the, an is, be , .... • Common stop lists run from 10 to hundreds of words. • It does not matter what the stop list is, typically around 300 common words will do well. • Indexing process will ignore the words listed in the stop list. (c) Maria Indrawan 2004
Stop Lists • Fox indicates that the first 20 stop words accounts for 31.19% of the English corpus. • Fox C. (1992). Lexical Analysis and Stoplists. In Frakes W.B. and Baeza-Yates R., Eds.), Information Retrieval:Data Structures and Algorithms, Englewood Cliffs, NJ.: Prentice-Hall • The first 20 stop words: • The, of, and, to, a , in, that, is, was, he, for , it, with, as, not, his, on, be, at, by. (c) Maria Indrawan 2004
Refinement - Stemming • To incorporate many variations of words, where an attempt is made to accommodate many variations comprising a concept • This avoids exceedingly long “or” query statement. • Example: inquiry or inquired or inquiries • The process is performed after the “stop list” process. • Porter stemming algorithm • Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14(3) :130-137) (c) Maria Indrawan 2004
Stemming - Suffix • Most English meaning shifts for grammatical purposes are handled by suffixes • Most retrieval systems allow for “trailing” or suffixes truncation. • Example: • “inquir$” will retrieve documents containing the words “inquire”, “inquired”, “inquires”, “inquiring”, “inquiry” etc. (c) Maria Indrawan 2004
Stemming - Prefix • Usually is not used in English text retrieval systems. • Prefix is substantial modifier, even a negation. • Example: • flammable and inflammable. • Prefix stemming may be useful in Chemical databases. (c) Maria Indrawan 2004
Stemming – Exception List • Irregularity in the language needs to be implemented as a “lookup list” • Example: • Irregular plurals • woman => women • child => children • past tense • choose => chose • find => found (c) Maria Indrawan 2004
Summary • Text Retrieval Systems: • motivation • model • Indexing Refinements: • Stop List • Stemming • Term Weight (week 8) (c) Maria Indrawan 2004