320 likes | 512 Views
Technical Memo Example: Titles . c1Human machine interface for Lab ABC computer applicationsc2A survey of user opinion of computer system response timec3The EPS user interface management systemc4System and human system engineering testing of EPSc5Relation of user-perceived response tim
E N D
1. Alternative IR models
Slides from Samer Hassan, Ian Ruthven
2. Technical Memo Example: Titles
3. Technical Memo Example: Terms and Documents
4. Alternative IR models
5. Probabilistic model Asks the question what is probability that user will see relevant information if they read this document
P(rel|di) probability of relevance after reading di
How likely is the user to get relevance information from reading this document
high probability means more likely to get relevant info.
6. Probabilistic model if a reference retrieval system's response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance ... the overall effectiveness of the system to its user will be the best that is obtainable
, Probability ranking principle
Rank documents based on decreasing probability of relevance to user
Calculate P(rel|di) for each document and rank
Suggests mathematical approach to IR and matching
Can predict (somewhat) good retrieval models based on their estimates of P(rel|di)
7. Example
8. Relevance Feedback What these systems are doing is
Using examples of documents the user likes
To detect which words are useful
new query words
query expansion
To detect how useful these words are
change weights of query words
term reweighting
Use new query for retrieval
9. Query Expansion Add useful terms to the query
Terms that appear often in relevant documents
Try to compensate for poor queries
usually short, can be ambiguous, use too many common words
add better terms to users query
Try to emphasize recall
10. Term Weighting Reweight query terms
Assign new weights according to importance in relevant documents
Personalized searching
Try to improve precision
11. Probabilistic Models Most probabilistic models based on combining probabilities of relevance and non-relevance of individual terms
Probability that a term will appear in a relevant document
Probability that the term will not appear in a non-relevant document
These probabilities are estimated based on counting term appearances in document descriptions
12. Example Assume we have a collection of 100 documents
N=100
20 of the documents contain the term sausage
Searcher has read and marked 10 documents as relevant
R=10
Of these relevant documents 5 contain the term sausage
How important is the word sausage to the searcher?
13. Probability of Relevance from these four numbers we can estimate probability of sausage given relevance information
i.e. how important term sausage is to relevant documents
is number of relevant documents containing sausage (5)
is number of relevant documents that do not contain sausage (5)
Eq. (I) is
higher if most relevant documents contain sausage
lower if most relevant documents do not contain sausage
high value means sausage is important term to user in our example (5/5=1)
14. Probability of Non-relevance Also we can estimate probability of sausage given non-relevance information
i.e. how important term sausage is to non-relevant documents
is number of non-relevant documents that contain term sausage (20-5)=15
is number of non-relevant documents that do not contain sausage (100-20-10+5)=75
15. F4 reweighting formula how important is sausage being present in relevant documents
how important is sausage being absent from non-relevant documents
from example on slide 10 weight of sausage is 5 (1/0.2)
16. F4 reweighting formula F4 gives new weights to all terms in collection (or just query)
high weights to important terms
Low weights to unimportant terms
replaces idf, tf, or any other weights
document score is based on sum of query terms in documents
17. Probabilistic model can also use to rank terms for addition to query
rank terms in relevant documents by term reweighting formula
i.e. by how good the terms are at retrieving relevant documents
add all terms
add some, e.g. top 5
in probabilistic formula query expansion and term reweighting done separately
18. Probabilistic model Advantages over vector-space
Strong theoretical basis
Based on probability theory (very well understood)
Easy to extend
Disadvantages
Models are often complicated
No term frequency weighting
Which is better vector-space or probabilistic?
Both are approximately as good as each other
Depends on collection, query, and other factors
19. Explicit/Latent Semantic Analysis BOW
Explicit Semantic Analysis
Latent Semantic Anaylsis
20. Explicit/Latent Semantic Analysis Objective
Replace indexes that use sets of index terms/docs by indexes that use concepts.
Approach
Map the term vector space into a lower dimensional space, using singular value decomposition.
Each dimension in the new space corresponds to a explicit/latent concept in the original data.
21. Deficiencies with Conventional Automatic Indexing Synonymy:
Various words and phrases refer to the same concept (lowers recall).
Polysemy:
Individual words have more than one meaning (lowers precision)
Independence:
No significance is given to two terms that frequently appear together
Explict/Latent semantic indexing addresses the first of these (synonymy), and the third (dependence)
22. Technical Memo Example: Titles
23. Technical Memo Example: Terms and Documents
24. Technical Memo Example: Query Query: Find documents relevant to "human computer interaction
Simple Term Matching:
Matches c1, c2, and c4
Misses c3 and c5
25. Mathematical concepts Define X as the term-document matrix, with t rows (number of index terms) and d columns (number of documents).
Singular Value Decomposition
For any matrix X, with t rows and d columns, there exist matrices T0, S0 and D0', such that:
X = T0S0D0
T0 and D0 are the matrices of left and right singular vectors
S0 is the diagonal matrix of singular values
26. Dimensions of matrices
27. Reduced Rank S0 can be chosen so that the diagonal elements are positive and decreasing in magnitude. Keep the first k and set the others to zero.
Delete the zero rows and columns of S0 and the corresponding rows and columns of T0 and D0. This gives:
X = TSD'
Interpretation
If value of k is selected well, expectation is that X retains the semantic information from X, but eliminates noise from synonymy and recognizes dependence.
28. Dimensionality Reduction
29. Recombination after Dimensionality Reduction
30. Mathematical Revision
31. Comparing a Query and a Document A query can be expressed as a vector in the term-document vector space xq.
xqi = 1 if term i is in the query and 0 otherwise.
(Ignore query terms that are not in the term vector space.)
Let pqj be the inner product of the query xq with document dj in the term-document vector space.
32. Comparing a Query and a Document