1 / 46

Chapter 5: Query Operations

Chapter 5: Query Operations. Hassan Bashiri April 2009. Cross-Language. What is CLIR? Users enter their query in one language and the search engine retrieves relevant documents in other languages. English Query. French Documents. Retrieval System. Cross-Language Text Retrieval.

degnan
Download Presentation

Chapter 5: Query Operations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 5: Query Operations Hassan Bashiri April 2009

  2. Cross-Language • What is CLIR? • Users enter their query in one language and the search engine retrieves relevant documents in other languages. English Query French Documents Retrieval System

  3. Cross-Language Text Retrieval Query Translation Document Translation Text Translation Vector Translation Controlled Vocabulary Free Text Knowledge-based Corpus-based Ontology-based Dictionary-based Term-aligned Sentence-aligned Document-aligned Unaligned Thesaurus-based Parallel Comparable 11

  4. Query Language • Visual languages • Example: library shown on the screen. Act: take books, open catalogs, etc. • Better Boolean queries: “I need books by Cervantes AND Lope de Vega”?!

  5. IR Interface • Query interface • Selection interface • Examination interface • Document delivery

  6. User Query Formulation Detection Selection Index Examination Indexing Docs Delivery Retrieval System Model

  7. Starfield

  8. Query Formulation • No detailed knowledge of collection and retrieval environment • difficult to formulate queries well designed for retrieval • Need many formulations of queries for good retrieval • First formulation: naïve attempt to retrieve relevant information • Documents initially retrieved: • Examined for relevance information • Improved query formulations for retrieving additional relevant documents • Query reformulation: • Expanding original query with new terms • Reweighting the terms in expanded query

  9. Three approaches • Approaches based on feedback from users (relevance feedback) • Approaches based on information derived from set of initially retrieved documents (local set of documents) • Approaches based on global information derived from document collection

  10. User relevance feedback • Most popular query reformulation strategy • Cycle: • User presented with list of retrieved documents • User marks those which are relevant • In practice: top 10-20 ranked documents are examined • Incremental • Select important terms from documents assessed relevant by users • Enhance importance of these terms in a new query • Expected: • New query moves towards relevant documents and away from non-relevant documents • For Instance • Q1:US Open • Q2:US Open Robocup

  11. User relevance feedback • Two basic techniques • Query expansion Add new terms from relevant documents • Term reweighting Modify term weights based on user relevance judgements

  12. Query Expansion and Term Reweighting for the Vector Model • basic idea • Relevant documents resemble each other • Non-relevant documents have term-weight vectors which are dissimilar from the ones for the relevant documents • The reformulated query is moved to closer to the term-weight vector space of relevant documents

  13. Query Expansion and Term Reweighting for the Vector Model (Continued) Dr: set of relevant documents, as identified by the user Dn: set of non-relevant documents the retrieved documents collection Cr: set of relevant documents set of non-relevant documents

  14. User relevance feedback: Vector Space Model : set of relevant documents, as identified by the user, among the retrieved documents; : set of non-relevant documents among the retrieved documents; : set of relevant documents among all documents in the collection; : number of documents in the sets , respectively; : tuning constants.

  15. Calculate the modified query • Standard-Rochio • Ide-Regular • Ide-Dec-Hi • , , : tuning constants (usually, >) • =1 (Rochio, 1971) • ===1 (Ide, 1971) • =0: positive feedback the highest ranked non-relevant document Similar performance

  16. Analysis • advantages • simplicity • good results • disadvantages • No optimality criterion is adopted

  17. User relevance feedback: Probabilistic Model • The similarity of a document dj to a query q : the probability of observing the term ki in the set R of relevant documents : the probability of observing the term ki in the set R of non-relevant documents Initial search:

  18. User relevance feedback: Probabilistic Model Feedback search:

  19. User relevance feedback: Probabilistic Model Feedback search: No query expansion occurs

  20. User relevance feedback: Probabilistic Model For small values of |Dr| and |Dr,i| (i.e., |Dr|=1, |Dr,i|=0) Alternative 1: Alternative 2:

  21. Analysis • advantages • Feedback process is directly related to the derivation of new weights for query terms • The term reweighting is optimal • disadvantages • Document term weights are not considered • No query expansion is used

  22. Query Expansion Similarity Thesaurus Global Statistical Thesaurus Query Expansion Context Analysis Association Clustering Local Clustering Metric Clustering Scalar Clustering

  23. Automatic Local Analysis • user relevance feedback • Known relevant documents contain terms which can be used to describe a larger cluster of relevant documents with assistance from the user (clustering) • automatic analysis • Obtain a description (i.t.o terms) for a larger cluster of relevant documents automatically • global strategy: global thesaurus-like structure is trained from all documentsbefore querying • local strategy: terms from the documents retrieved for a given query are selected at query time

  24. Query Expansion based on a Similarity Thesaurus • Query expansion is done in three steps as follows: • Represent the query in the concept space used for representation of the index terms • Based on the global similarity thesaurus, compute a similarity sim(q,kv) between each term kv correlated to the query terms and the whole query q. • Expand the query with the top r ranked terms according to sim(q,kv)

  25. Query Expansion – Step 1 • To the query q is associated a vector q in the term-concept space given by • where wi,q is a weight associated to the index-query pair[ki,q]

  26. Query Expansion – Step 2 • Compute a similarity sim(q,kv) between each term kv and the user query q • where cu,v is the correlation factor

  27. Query Expansion – Step 3 • Add the top r ranked terms according to sim(q,kv) to the original query q to form the expanded query q’ • To each expansion term kv in the query q’ is assigned a weight wv,q’ given by • The expanded query q’ is then used to retrieve new documents to the user

  28. Query Expansion - Sample • Doc1 = D, D, A, B, C, A, B, C • Doc2 = E, C, E, A, A, D • Doc3 = D, C, B, B, D, A, B, C, A • Doc4 = A • c(A,A) = 10.991 • c(A,C) = 10.781 • c(A,D) = 10.781 • ... • c(D,E) = 10.398 • c(B,E) = 10.396 • c(E,E) = 10.224

  29. Query Expansion - Sample • Query: q = A E E • sim(q,A) = 24.298 • sim(q,C) = 23.833 • sim(q,D) = 23.833 • sim(q,B) = 23.830 • sim(q,E) = 23.435 • New query: q’ = A C D E E • w(A,q')= 6.88 • w(C,q')= 6.75 • w(D,q')= 6.75 • w(E,q')= 6.64

  30. Query Expansion • Methods of local analysis extract information from local set of documents retrieved to expand the query • An alternative is to expand the query using information from the whole set of documents

  31. Local Cluster • stem • V(s): a non-empty subset of words which are grammatical variants of each othere.g., {polish, polishing, polished} • A canonical form s of V(s) is called a steme.g., polish • local document set Dl • the set of documents retrieved for a given query • local vocabulary Vl (Sl) • the set of all distinct words (stems) in the local document set

  32. Local Cluster • basic concept • Expanding the query with terms correlated to the query terms • The correlated terms are presented in the local clusters built from the local document set • local clusters • association clusters: co-occurrences of pairs of terms in documents • metric clusters: distance factor between two terms • scalar clusters: terms with similar neighborhoods have some synonymity relationship

  33. Association Clusters • idea • Based on the co-occurrence of stems (or terms) inside documents • association matrix • fsi,j: the frequency of a stem si in a document dj (Dl) • m=(fsi,j): an association matrix with |Sl| rows and |Dl| columns • : a local stem-stem association matrix

  34. : a correlation between the stems su and sv an element in su,v=cu,v: unnormalized matrix : normalized matrix local association cluster around the stem su Take u-th row Return the set of n largest values su,v (uv)

  35. Metric Clusters • idea • Consider the distance between two terms in the computation of their correlation factor • local stem-stem metric correlation matrix • r(ki,kj): the number of words between keywords ki and kj in a same document • cu,v: metric correlation between stems su and sv

  36. su,v=cu,v: unnormalized matrix : normalized matrix local metric cluster around the stem su Take u-th row Return the set of n largest values su,v (uv)

  37. Scalar Clusters The row corresponding to a specific term in a term co-occurrence matrix forms its neighborhood • idea • Two stems with similar neighborhoods have synonymity relationship • The relationship is indirect or induced by the neighborhood • scalar association matrix local scalar cluster around the stem su Take u-th row Return the set of n largest values su,v (uv)

  38. x x x x x x Sv(n) x x Su x x x x Sv x x x x x x x x Interactive Search Formulation • neighbors of the query term sv • Terms su belonging to clusters associated to sv, i.e., suSv(n) • su is called a searchonym of sv

  39. SimilarityThesaurus • The similarity thesaurus is based on term to term relationships rather than on a matrix of co-occurrence. • This relationship are not derived directly from co-occurrence of terms inside documents. • They are obtained by considering that the terms are concepts in a concept space. • In this concept space, each term is indexed by the documents in which it appears. • Terms assume the original role of documents while documents are interpreted as indexing elements

  40. SimilarityThesaurus • Inverse term frequency for document dj • t: number of terms in the collection • N: number of documents in the collection • fi,j: frequency of occurrence of the term ki in the document dj • tj: vocabulary of document dj • itfj: inverse term frequency for document dj • To ki is associated a vector

  41. SimilarityThesaurus • where wi,j is a weight associated to index-document pair[ki,dj]. These weights are computed as follows

  42. SimilarityThesaurus • The relationship between two terms ku and kv is computed as a correlation factor cu,v given by • The global similarity thesaurus is built through the computation of correlation factor Cu,v for each pair of indexing terms [ku,kv] in the collection

  43. Represent the query in the concept space used for representation of the index terms • Based on the global similarity thesaurus, compute a similarity sim(q,kv) between each term kv correlated to the query terms and the whole query q query term expand term

  44. Expand the query with the top r ranked terms according to sim(q,kv)

  45. SimilarityThesaurus • This computation is expensive • Global similarity thesaurus has to be computed only once and can be updated incrementally

More Related