Modern Information Retrieval Chapter 7: Text Operations

Modern Information RetrievalChapter 7: Text Operations Ricardo Baeza-Yates Berthier Ribeiro-Neto

Document Preprocessing • Lexical analysis of the text • Elimination of stopwords • Stemming • Selection of index terms • Construction of term categorization structures

Lexical Analysis of the Text • Word separators • space • digits • hyphens • punctuation marks • the case of the letters

Elimination of Stopwords • A list of stopwords • words that are too frequent among the documents • articles, prepositions, conjunctions, etc. • Can reduce the size of the indexing structure considerably • Problem • Search for “to be or not to be”?

Stemming • Example • connect, connected, connecting, connection, connections • effectiveness --> effective --> effect • picnicking --> picnic • king -\-> k • Removing strategies • affix removal: intuitive, simple • table lookup • successor variety • n-gram

Index Terms Selection • Motivation • A sentence is usually composed of nouns, pronouns, articles, verbs, adjectives, adverbs, and connectives. • Most of the semantics is carried by the noun words. • Identification of noun groups • A noun group is a set of nouns whose syntactic distance in the text does not exceed a predefined threshold

Thesauri • Peter Roget, 1988 • Example cowardlyadj. Ignobly lacking in courage: cowardly turncoats Syns: chicken (slang), chicken-hearted, craven, dastardly, faint-hearted, gutless, lily-livered, pusillanimous, unmanly, yellow (slang), yellow-bellied (slang). • A controlled vocabulary for the indexing and searching

The Purpose of a Thesaurus • To provide a standard vocabulary for indexing and searching • To assist users with locating terms for proper query formulation • To provide classified hierarchies that allow the broadening and narrowing of the current query request

Thesaurus Term Relationships • BT: broader • NT: narrower • RT: non-hierarchical, but related

Term Selection Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.

Automatic Indexing • Indexing: • assign identifiers (index terms) to text documents. • Identifiers: • single-term vs. term phrase • controlled vs. uncontrolled vocabulariesinstruction manuals, terminological schedules, … • objective vs. nonobjective text identifiers cataloging rules define, e.g., author names, publisher names, dates of publications, …

Two Issues • Issue 1: indexing exhaustivity • exhaustive: assign a large number of terms • nonexhaustive • Issue 2: term specificity • broad terms (generic)cannot distinguish relevant from nonrelevant documents • narrow terms (specific)retrieve relatively fewer documents, but most of them are relevant

Parameters of retrieval effectiveness • Recall • Precision • Goal high recall and high precision

Retrieved Part b a Nonrelevant Items Relevant Items d c

A Joint Measure • F-score •  is a parameter that encode the importance of recall and procedure. • =1: equal weight • <1: precision is more important • >1: recall is more important

Choices of Recall and Precision • Both recall and precision vary from 0 to 1. • Particular choices of indexing and search policies have produced variations in performance ranging from 0.8 precision and 0.2 recall to 0.1 precision and 0.8 recall. • In many circumstance, both the recall and the precision varying between 0.5 and 0.6 are more satisfactory for the average users.

Term-Frequency Consideration • Function words • for example, "and", "or", "of", "but", … • the frequencies of these words are high in all texts • Content words • words that actually relate to document content • varying frequencies in the different texts of a collect • indicate term importance for content

A Frequency-Based Indexing Method • Eliminate common function words from the document texts by consulting a special dictionary, or stop list, containing a list of high frequency function words. • Compute the term frequency tfij for all remaining terms Tj in each document Di, specifying the number of occurrences of Tj in Di. • Choose a threshold frequency T, and assign to each document Di all term Tj for which tfij > T.

Inverse Document Frequency • Inverse Document Frequency (IDF) for term Tjwhere dfj (document frequency of term Tj) is the number of documents in which Tj occurs. • fulfil both the recall and the precision • occur frequently in individual documents but rarely in the remainder of the collection

TFxIDF • Weight wij of a term Tj in a document di • Eliminating common function words • Computing the value of wij for each term Tj in each document Di • Assigning to the documents of a collection all terms with sufficiently high (tfxidf) factors

Term-discrimination Value • Useful index terms • Distinguish the documents of a collection from each other • Document Space • Two documents are assigned very similar term sets, when the corresponding points in document configuration appear close together • When a high-frequency term without discrimination is assigned, it will increase the document space density

A Virtual Document Space After Assignment of good discriminator After Assignment of poor discriminator Original State

Good Term Assignment • When a term is assigned to the documents of a collection, the few objects to which the term is assigned will be distinguished from the rest of the collection. • This should increase the average distance between the objects in the collection and hence produce a document space less dense than before.

Poor Term Assignment • A high frequency term is assigned that does not discriminate between the objects of a collection. Its assignment will render the document more similar. • This is reflected in an increasein document spacedensity.

Term Discrimination Value • Definitiondvj = Q - Qjwhere Q and Qj are space densities before and after the assignments of term Tj. • dvj>0, Tj is a good term; dvj<0, Tj is a poor term.

Variations of Term-Discrimination Value with Document Frequency Document Frequency N Low frequency dvj=0 Medium frequency dvj>0 High frequency dvj<0

TFijx dvj • wij = tfijx dvj • compared with • : decrease steadily with increasing document frequency • dvj: increase from zero to positive as the document frequency of the term increase, decrease shapely as the document frequency becomes still larger.

Document Centroid • Issue: efficiency problemN(N-1) pairwise similarities • Document centroidC = (c1, c2, c3, ..., ct)where wij is the j-th term in document i. • Space density

Probabilistic Term Weighting • GoalExplicit distinctions between occurrences of terms in relevant and nonrelevant documents of a collection • DefinitionGiven a user query q, and the ideal answer set of the relevant documents • From decision theory, the best ranking algorithm for a document D

Probabilistic Term Weighting • Pr(rel), Pr(nonrel):document’s a priori probabilities of relevance and nonrelevance • Pr(D|rel), Pr(D|nonrel):occurrence probabilities of document D in the relevant and nonrelevant document sets

Assumptions • Terms occur independently in documents

Derivation Process

Given a document D=(d1, d2, …, dt) Assume di is either 0 (absent) or 1 (present). For a specific document D Pr(xi=1|rel) = pi Pr(xi=0|rel) = 1-pi Pr(xi=1|nonrel) = qi Pr(xi=0|nonrel) = 1-qi

Term Relevance Weight

Issue • How to computepjand qj? pj = rj / R qj = (dfj-rj)/(N-R) • R: the total number of relevant documents • N: the total number of documents

Estimation of Term-Relevance • The occurrence probability of a term in the nonrelevant documents qj is approximated by the occurrence probability of the term in the entire document collectionqj= dfj/ N • The occurrence probabilities of the terms in the small number of relevant documents is equal by using a constant value pj= 0.5 for all j.

Comparison = idfj  When N is sufficiently large, N-dfj N,

Estimation of Term-Relevance • Estimate the number of relevant documents rj in the collection that contain term Tj as a function of the known document frequency tfj of the term Tj.pj = rj / R qj = (dfj-rj)/(N-R)R: an estimate of the total number of relevant documents in the collection.

Summary • Inverse document frequency, idfj • tfij*idfj (TFxIDF) • Term discrimination value, dvj • tfij*dvj • Probabilistic term weightingtrj • tfij*trj • Global properties of terms in a document collection

Modern Information Retrieval Chapter 7: Text Operations