CSM06 Information Retrieval

CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway a.salway@surrey.ac.uk

Recap from Lecture 2 • IR Systems treat documents as ‘bags of words’: common document preprocessing techniques - tokenization, stop lists and stemming • Data about the occurrence of words in documents is stored in postings data structures: • the simple inverted index stores the minimum data required for full-text indexing/retrieval; • the extra data stored by the STAIRS data model facilitates more IR functionality

Recap from Lecture 2 How documents are matched / ranked for a query is determined by the IR model used: • Boolean Model – exact matching of documents according to query terms / Boolean operators (underlied by Set Theory). • Vector Space Model – documents and queries represented by vectors in the same vector space; dimensions are frequencies of keywords. Similarity of documents to a query is measured by cosine distance  ranking of documents. • VSM Lab Exercise: create a frequency table and see which documents are ranked highest for queries of your choosing. (System Quirk will help in making the frequency table; Microsoft Excel will help in calculating cosine distances and ranking).

Lecture 3: OVERVIEW • TFIDF - Term-Frequency Inverse Document Frequency, and example of term weighting • Semi-automatic query modification withRelevance Feedback • Automatic creation ofterm clustersfor query expansion • Latent Semantic Indexing

Term Weighting • In the simplest case an index is binary, i.e. either a keyword is present in a document or it is not • However, we may want to deal with ‘degrees of aboutness’ to characterise a document more accurately  • Use a weighting to capture the strength of the relationship between a keyword and a document • As a starting point we can consider the frequency with which a term occurs in a document

TF-IDF (Term Frequency – Inverse Document Frequency) • To incorporate a word’s discrimination into the weighting, consider its inverse document frequency to take into account the number of different documents in which the term occurs • This leads to the widely used TF-IDF weighting for index terms; and also terms in long queries Belew 2000, Section 3.6

Modifying Query with Relevance Feedback • User makes initial query and system returns ranked documents • User identifies the the top-ranked documents as relevant or irrelevant • The document-vectors of the top-ranked documents are used to modify the initial query vector, e.g. using the Standard-Rocchio equation • The effect is to emphasise appropriate index terms in the query (and de-emphasise others) – with no ‘technical’ input from user • May also introduce new query terms (Baeza-Yates and Ribiero-Neto 1999, pp. 118-120)

RelFbk: the vector view Belew (2000), Fig. 4.4

Making a new query vector… • The query vector is moved towards the centroid of the documents judged relevant by the user. • It may also move away from the centroid of the irrelevant docs, but they are less likely to be clustered: (alternatively select one irrelevant document, like the highest ranked, and move query vector away from it). Belew (2000), Fig. 4.6

Standard-Rocchio equation qm= αq + β/Dr * Σ(Rel doc vectors) - γ/Di * Σ(Irrel doc vectors) q = query vector qm= modified query α, β, γ are constants Dr = number of documents marked relevant by user Di = number of documents marked irrelevant by user

EXERCISE Consider a query vector vq, two documents returned by an information retrieval system that a user considers relevant with vectors v1 and v2, and three documents returned considered irrelevant with vectors v3, v4, and v5. Compute a modified query using the Standard Rochio equation with α = β = γ = 1. vq= (2, 1, 0, 0) v1= (0, 4, 0, 2) v2= (0, 3, 0, 1) v3= (1, 0, 2, 0) v4= (0, 1, 4, 0) v5= (1, 1, 0, 0)

Creating Term Clusters for Query Expansion • Generally a query may be expanded by adding index terms that are related to the terms in the initial query • Related terms may be: • synonyms • stems/grammatical variants • co-occurring terms • Relationships between terms may be calculated by analysing the results set (local analysis), or by analysing the whole document collection (global analysis)

Creating Term Clusters for Query Expansion • Aim is to produce clusters of related terms by automatic analysis of the local document set • Restricting the analysis to the local document set may improve the quality of the clusters • Different techniques for measuring term correlation give different kinds of clusters: • Association clusters • Metric clusters • Scalar clusters For more details, see Baeza-Yates and Ribiero-Neto 1999, pp. 123-7

Association Clusters • Based on how often terms/stems co-occur in documents, I.e. if term A and term B appear in a large number of documents together then they may be related (at least in the local context) • Query is expanded by adding the n most correlated terms for each term in the original query [BY&RN, p.125]

Metric Clusters • Based on how close terms/stems are in documents (i.e. where they occur rather than how often) [BY&RN p. 126]

Scalar Clusters • Scalar clusters are formed by grouping terms/stems which correlate in similar ways with other terms • For each term calculate a vector which is that term’s correlation (association or metric) with all other terms • Calculate the cosine distance between the two vector to get the scalar correlation between the terms [BY&RN p. 127]

EXERCISE Which keyword (K2-K4) clusters most closely to keyword K1 using association clusters?

Latent Semantic Indexing WHY? • PROBLEMS for Vector Model: • Size of frequency table (a matrix) becomes prohibitive (i.e. 100’s of terms and 1m’s of texts): also the matrix is sparse • Synonymy: different people may use different words to mean the same thing • VSM assumes that the frequency of each keyword is independent of the frequencies of all other keywords

Latent Semantic Indexing • LSI involves dimensionality reduction: the dimensions in the reduced space are taken to reflect the ‘latent semantics’ • In the VSM making each term a dimension assumes they are orthogonal  • LSI exploits term co-occurrence: co-occurringterms are projected onto the same dimensions

Latent Semantic Indexing So…. • Storage space is saved • Texts and queries can be recognised as being similar even if they don’t share the same words (so long as they do share words that have been projected onto the same dimension) ***In the latentsemantic space a query and a document can have a cosine distance close to 1 even if they do not share any terms***

Latent Semantic Indexing HOW… Singular Value Decomposition • SVD is a technique for dimensionality reduction (cf. Eigenfactor Analysis / Principal Components Analysis) • In effect SVD takes the (v. large) frequency table matrix and represents it as three smaller matrices • The dimensions of the reduced space correspond to the axes of greatest variance: the question remains of how many dimensions? • NB. Can use tools like Matlab to perform SVD

Set Reading (LECTURE 3) • See references in previous slides for reading about TFIDF, relevance feedback and term clusters • For an overview of Latent Semantic Indexing (LSI) – http://lsi.research.telcordia.com/lsi/papers/execsum.html

Further Reading (LECTURE 3) • For more about LSI, see: Deerwester et al. (1990), ‘Indexing by Latent Semantic Analysis’, Journal of the Society for Information Science 41(6), 391-407. http://lsi.research.telcordia.com/lsi/papers/JASIS90.pdf

Lecture 3: LEARNING OUTCOMES After this lecture you should be able to: • Explain how TFIDF weights terms • Explain how relevance feedback can be used to automatically modify a query, and apply the Standard Rochio equation • Explain how term clusters can be used for automatic query expansion, and calculate association clusters • Explain how LSI modifies the VSM • Critically discuss how each of these techniques could improve an information retrieval system

Reading ahead for LECTURE 4 If you want to read about next week’s lecture topics, see: Page and Brin (1998), “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, SECTIONS 1 and 2 http://www-db.stanford.edu/~backrub/google.html Hock (2001), The extreme searcher's guide to web search engines, pages 25-31. (An overview of the factors used to rank webpages). AVAILABLE in Main Library collection and in Library Article Collection.

CSM06 Information Retrieval