Computer Evaluation of Indexing and Text Processing

Computer Evaluation of Indexing and Text Processing By G. Salton M.E. Lesk 1968 Journal of ACM, 15:1, 8-36.

SMART System • The storage and retrieval of information from text documents and determining the best procedures used to perform these tasks was the subject of this paper.

SMART System • The SMART system is a fully automatic document retrieval system. • Operates on the IBM 7094. • Has 7 facilities incorporated. • Search process controlled by user .

7 Facilities of the SMART System A system for separating words into stems and affixes, which can be used to reduce incoming texts into “word stem” form.

7 Facilities of the Smart System Synonym Dictionary, or thesaurus, used to replace significant word stems by “concept numbers” each concept representing a class of related word stems.

7 Facilities of the Smart System • A hierarchical arrangement of the concepts included in the thesaurus which makes it possible, given any concept, to find its “parent” in the hierarchy, it’s “sons” and “brothers” or any of a set of cross-references.

7 Facilities of the Smart System • Statistical association methods used to compute similarity coefficients between words, word stems, or concepts based on co-occurrence patterns between entities in the sentences of a document, or in documents of the collection.

7 Facilities of the Smart System • Syntactic Analysis used to generate phrases consisting of several words or concepts. • Each phrase serves as an indicator of document content, provided certain pre-specified syntactic relations obtained between phase components.

7 Facilities of the Smart System • Statistical Phase Recognition operates like the preceding syntactic procedures by using a pre-constructed phrase dictionary, except that no test is made to ensure that the syntactic relationships between phase components are satisfied.

7 Facilities of the Smart System • Request-Document Matching allows the use of a variety of different methods to compare analyzed documents with analyzed requests, including: • Concept Weight Adjustments • Variations in Length of Document Texts

SMART Procedure • Stored documents and search requests are processed by the system without any prior manual analysis using one of several hundred automatic content analysis methods, and those documents which most nearly match a given search request are identified.

SMART Procedure • A correlation coefficient is computed to indicate the similarity between each document and each search request • The documents are then ranked in order of the correlation coefficient. • A cutoff can be picked and the documents above the cutoff can be withdrawn from the file and turned over to the user as answers to the search request.

SMART System • Can evaluate the effectiveness of various processing methods. • Uses same data or documents and runs searches varying analysis procedures. • Measure effectiveness of each procedure by common metric.

SMART System Evaluation Metrics R : Standard Recall P: Standard Precision

Normalized Recall n: size of relevant document set N: size of total document collection Ri: rank of ith relevant document when the documents are arranged in decreasing order of their correlation with the search request.

Normalized Precision

Rank Recall

Log Precision

SMART System overall Effectiveness Measure Overall evaluation measures (normalized recall, precision, rank recall, and log precision) are averaged. A graph is generated using a defined 5 step process. Page 14

Fourteen Statistics are Generated for Each Search Request and Processing Method Four Global Statistics Rank Recall Log Precision Normalized Recall Normalized Precision

Fourteen Statistics are Generated for Each Search Request and Processing Method • Ten Local Statistics • Standard Precision for ten recall levels

Comparing Two Methods • The fourteen statistic are compared between methods and a difference is calculated • Then a probability statistic is developed for each. • The an aggregate probability value is computed

Two Testing Procedures • T-test • Requires normalized data • Sign Test • Normality is not required

Three Document Collections Computer Science (IRE-3) Set of 780 abstracts of documents in computer literature, published from 1959-1961. Documentation (ADI) Set of 82 short papers, each an average 1380 words in length, presented at the 1963 Annual Meeting of the American Documentation Institute Aerodynamics (CRAN-1) Set of 200 abstracts of documents used by the second ASLIB Cranfield Project.

Variable of Interest • Document Length • Conclusion • Document abstracts are more effective for content analysis purposes than document titles alone; further improvements appear possible when abstracts are replaced by larger text portions; however, the increase in effectiveness is not large enough to reach the unequivocal conclusion that full text processing is always superior.

Term Weights • Assigning weights to the indicators in proportion to their presumed importance. • Such weights can be derived in part by using the frequency of occurrence of the original text words which give rise to various indictors • On the whole it appears that weighted content indicators produce better results than non-weighted ones.

Matching Functions • Cosine Correllation • Overlap Correllation

Matching Functions • Where q and d are considered to be n-dimensional vectors of terms representing an analyzed query q and an analyzed document d, respectively, in a space of n terms assignable as information identifiers. • Both functions range from 0 for no match to 1 for perfect identity between the respective vectors.

Matching Function Conclusion • The cosine correlation function is more useful as a measure of document-request similarity than the overlap function.

Language Normalization • Suffix “s” process • Word-stem dictionary • Synonym dictionary • Statistical Phrase Dictionary • Concept Association Method

Synonym Recognition • It appears that dictionaries providing synonym recognition produce statistically significantly improvements in retrieval effectiveness compared with the word-stem matching process

Phrase Recognition • The phrase generation methods, whether implemented by dictionary look-up or by statistical association processes, appear to offer improvements in retrieval effectiveness for some recall levels, however improvement is not sufficient when averaged over many search requests

Hierarchical Expansion • Hierarchical arrangements of subject identifiers are used in many standard library classification systems, and are also incorporated into many non-conventional information systems. • Subject Hierarchies are useful for representation of generic inclusion relations between terms, and they serve to broaden, or narrow, or otherwise “expand” a given description.

Hierarchical Expansion • The conclusion is that more often the change specified by the hierarchy option is too violent, and the average performance of most hierarchy procedures does not appear to be sufficiently promising to advocate their immediate incorporation in an analysis system.

Manual Indexing • Cranfield was used to experiment because it had been indexed by trained indexers and had an average of 30 terms per document. • Based on results there was not significant difference between the results of manually indexed and automatic text processing.

Conclusions • Results indicate that in automatic systems weighted terms should be used, derived from excerpts whose length is at least equivalent to that of a abstract; furthermore, synonym dictionaries should be incorporated where available. Other, local improvements may be obtained by incorporating phrase hierarchies, and word-word association techniques.

Computer Evaluation of Indexing and Text Processing

Computer Evaluation of Indexing and Text Processing

Presentation Transcript

Text Processing

Full-Text Indexing

Strings and Text Processing

Text Mining: Fast Phrase-based Text Indexing and Matching

Tools for Text Indexing and SearchING

Text processing

Text Indexing

Text Processing

Basic Text Processing and Indexing

Text Processing

Tokenizing and Text Processing

Text processing

Multimedia and Text Indexing

Text Processing

Performing Indexing and Full-Text Searching

Strings and Text Processing

Full-Text Indexing

Text Processing

Multimedia and Text Indexing

Text processing

Text processing