1 / 142

CSA4080: Adaptive Hypertext Systems II

CSA4080: Adaptive Hypertext Systems II. Topic 6: Information and Knowledge Representation. Dr. Christopher Staff Department of Computer Science & AI University of Malta. Aims and Objectives. Models of Information Retrieval Vector Space Model Probabilistic Model Relevance Feedback

shea
Download Presentation

CSA4080: Adaptive Hypertext Systems II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSA4080:Adaptive Hypertext Systems II Topic 6: Information and Knowledge Representation Dr. Christopher Staff Department of Computer Science & AI University of Malta 1 of 142 cstaff@cs.um.edu.mt

  2. Aims and Objectives • Models of Information Retrieval • Vector Space Model • Probabilistic Model • Relevance Feedback • Query Reformulation 2 of 142 cstaff@cs.um.edu.mt

  3. Aims and Objectives • Dealing with General Knowledge • Programs that reason • Conceptual Graphs • Intelligent Tutoring Systems 3 of 142 cstaff@cs.um.edu.mt

  4. Background • We’ve talked about how user information can be represented • We need to be able to represent information about the domain so that we can reason about what the user’s interests are, etc. • We covered the difference between data, information, and knowledge in CSA3080... 4 of 142 cstaff@cs.um.edu.mt

  5. Background • In 1945, Vannevar Bush writes “As We May Think” • Gives rise to seeking “intelligent” solutions to information retrieval, etc. • In 1949, Warren Weaver writes that if Chinese is English + codification, then machine translation should be possible • Leads to surface-based/statistical techniques 5 of 142 cstaff@cs.um.edu.mt

  6. Background • Even today, nearly 60 years later, there is significant effort in both directions • For years, intelligent solutions were hampered by the lack of fast enough hardware, software • Doesn’t seem to be an issue any longer, and the Semantic Web may be testimony to that • But there are sceptics 6 of 142 cstaff@cs.um.edu.mt

  7. Background • Take IR as an example • At the dumb end we have “reasonable” generic systems, but at other end, systems are domain specific, more expensive, but do they give “better” results? 7 of 142 cstaff@cs.um.edu.mt

  8. Background • At what point does it cease to be cost effective to attempt more intelligent solutions to the IR problem? 8 of 142 cstaff@cs.um.edu.mt

  9. Background • Is “Information” Retrieval a misnomer? • Consider your favourite Web-based IR system... does it retrieve information? • Can you ask “Find me information about all flights between Malta and London”? • And what would you get back? • Can you ask “Who was the first man on the moon?” 9 of 142 cstaff@cs.um.edu.mt

  10. Background • With many IR systems that we use, the “intelligence” is firmly rooted in the user • We must learn how to construct our queries so that we get the information we seek • We sift through relevant and non-relevant documents in the results list • What we can hope for is that “patterns” can be identified to make life easier for us - e.g., recommender systems 10 of 142 cstaff@cs.um.edu.mt

  11. Background • Surface-based techniques tend to look for and re-use patterns as heuristics, without attempting to encode “meaning” • The Semantic Web, and other “intelligent” approaches, try to encode meaning so that it can be reasoned with and about • Cynics/sceptics/opponents believe that there is more success to be had in giving users more support, than to encode meaning into documents to support automation 11 of 142 cstaff@cs.um.edu.mt

  12. However... • We will cover both surface-based and some knowledge-based approaches to supporting the user in his or her task 12 of 142 cstaff@cs.um.edu.mt

  13. Information Retrieval • We will discuss two IR models... • Vector Space Model • Probabilistic Model • ... and surface-based techniques that can improve their usability • Relevance Feedback • Query Reformulation • Question-Answering 13 of 142 cstaff@cs.um.edu.mt

  14. Knowledge • Conceptual graphs support the encoding and matching of concepts • Conceptual graphs are more “intelligent” and can be used to overcome some problems like the Vocabulary Problem 14 of 142 cstaff@cs.um.edu.mt

  15. Reasoning on the Web • REWERSE (FP6 NoE) is an attempt to represent meaning contained in documents and to reason with and about it so that a single high-level user request may be carried out even if it contains several sub-tasks • E.g., “Find me information about cheap flights between Malta and London” 15 of 142 cstaff@cs.um.edu.mt

  16. Vector-Space Model • Recommended Reading • p18-wong (Generalised Vector Space Model).pdf - look at refs [1],[2],[3] for original work 16 of 142 cstaff@cs.um.edu.mt

  17. Vector-Space Model • Documents are represented as m-dimensional vectors or “bags of words” • m is the size of the vocabulary • wk = 1, indicates term is present in document • wk = 0, indicates term is absent • dj = <1,0,0,1,...,0,0> 17 of 142 cstaff@cs.um.edu.mt

  18. Vector-Space Model 18 of 142 cstaff@cs.um.edu.mt

  19. Vector-Space Model • The query is then plotted into m-dimensional space and the nearest neighbours are the most relevant • However, the results set is usually presented as a list ranked by similarity to the query 19 of 142 cstaff@cs.um.edu.mt

  20. Vector-Space Model • Cosine Similarity Measure (from IR vector space model.pdf) 20 of 142 cstaff@cs.um.edu.mt

  21. Vector-Space Model • Calculating term weights • Term weights may be binary, integers, or reals • Binary values are thresholded, rather than simply indicating presence or absence • Integers or reals will be measure of relative significance of term in document • Usually, term weight is TFxIDF 21 of 142 cstaff@cs.um.edu.mt

  22. Vector-Space Model • Steps in calculating term weights • Remove stop words • Stem remaining words • Count term frequency (TF) • Count number of documents containing term (DF) • Invert it (log(C/DF)), where C is total number of documents in collection 22 of 142 cstaff@cs.um.edu.mt

  23. Vector-Space Model • Normalising weights for vector length • Documents with longer vectors have a better chance of being retrieved than short ones (simply because there are a larger number of terms that they will match in a query) • IR should treat all relevant documents as important for retrieval purposes • Solution: , where w is weight of term t 23 of 142 cstaff@cs.um.edu.mt

  24. Vector-Space Model • Why does this work? • Term discrimination • Assumes that terms with high TF and low DF are good discriminators of relevant documents • Because documents are ranked, documents do not need to contain precisely the terms expressed in the query • We cannot say anything (in VSM) about terms that occur in relevant and non-relevant documents - though we can in probabilistic IR 24 of 142 cstaff@cs.um.edu.mt

  25. Vector-Space Model • Vector-Space Model is also used by Recommender Systems to index user profiles and product, or item, features • Apart from ranking documents, results lists can be controlled (to list top n relevant documents), and query can be automatically reformulated based on relevance feedback 25 of 142 cstaff@cs.um.edu.mt

  26. Relevance Feedback • When a user is shown a list of retrieved documents, user can give relevance judgements • System can take original query and relevance judgements and re-compute the query • Rocchio... 26 of 142 cstaff@cs.um.edu.mt

  27. Relevance Feedback • Basic Assumptions • Similar docs are near each other in vector space • Starting from some initial query, the query can be reformulated to reflect subjective relevance judgements given by the user • By reformulating the query we can move the query closer to more relevant docs and further away from nonrelevant docs 27 of 142 cstaff@cs.um.edu.mt

  28. Relevance Feedback • In VSM, reformulating query means re-weighting terms in query • Not failsafe: may move query towards nonrelevant docs! 28 of 142 cstaff@cs.um.edu.mt

  29. Relevance Feedback • The Ideal Query • If we know the answer set rel, then the ideal query is: 29 of 142 cstaff@cs.um.edu.mt

  30. Relevance Feedback • In reality, a typical interaction will be: • User formulates query and submits it • IR system retrieves set of documents • User selects R’ and N’ where 0 <=  <= 1 (and vector magnitude usually dropped...) 30 of 142 cstaff@cs.um.edu.mt

  31. Relevance Feedback • What are the values of  and ? • is typically given a value of 0.75, but this can vary. Also, after a number of iterations, the original weights of terms can be highly reduced • If  and  have equal weight, then relevant and nonrelevant docs make equal contribution to reformulated query • If  = 1,  = 0, then only relevant docs are used in reformulated query • Usually, use  = 0.75,  = 0.25 31 of 142 cstaff@cs.um.edu.mt

  32. Relevance Feedback • Example Q: (5, 0, 3, 0, 1) R: (2, 1, 2, 0, 0) N: (1, 0, 0, 0, 2)  = 0.75,  = 0.50,  = 0.25 Q’ = 0.75Q + 0.5R – 0.25N = 0.75(5, 0, 3, 0, 1)+0.5(2, 1, 2, 0, 0)–0.25(1,0, 0, 0, 2) = (4.5, 0.5, 3.25, 0, 0.25) 32 of 142 cstaff@cs.um.edu.mt

  33. Relevance Feedback • How many docs to use in R’ and N’? • Use all docs selected by user • Use all rel docs and highest ranking nonrel docs • Usually, user selects only relevant docs... • Should entire document vector be used? • Really want to identify the significant terms... • Use terms with high-frequency/weight • Use terms in doc adjacent to terms from query • Use only common terms in R’ (and N’) 33 of 142 cstaff@cs.um.edu.mt

  34. Automatic Relevance Feedback • Users tend not to select nonrelevant documents, and rarely choose more than one relevant document (http://www.dlib.org/dlib/november95/11croft.html) • This makes it difficult to use relevance feedback • Current research uses automatic relevance feedback techniques... 34 of 142 cstaff@cs.um.edu.mt

  35. Automatic Relevance Feedback • Two main approaches • To improve precision • To improve recall 35 of 142 cstaff@cs.um.edu.mt

  36. Automatic Relevance Feedback • Reasons for low precision • Documents contain query terms, but documents are not “about” the “concept” or “topic” the user is interested in • E.g., user wants documents in which a cat chases a dog but the query <cat, chase, dog> also retrieves docs in which dogs chase cats • Term ambiguity 36 of 142 cstaff@cs.um.edu.mt

  37. Automatic Relevance Feedback • Improving precision • Want to promote relevant documents in the results list • Assume that top-n (typically 20) documents are relevant, and assume docs ranked 500-1000 are nonrelevant • Choose co-occurring discriminatory terms • Re-rank docs ranked 21-499 using (modified) Rocchio method p206-mitra.pdf 37 of 142 cstaff@cs.um.edu.mt

  38. Automatic Relevance Feedback • Improving precision • Does improve precision by 6%-13% at p-21 to p-100 • But remember that precision is to do with the ratio of relevant to nonrelevant documents retrieved • There may be many relevant documents that were never retrieved (i.e., low recall) 38 of 142 cstaff@cs.um.edu.mt

  39. Automatic Relevance Feedback • Reasons for low recall • “Concept” or “topic” that user is interested in can be described using terms additional to those express by user in query • E.g., think of all the different ways in which you can express “car”, including manufacturers names (e.g., Ford, Vauxhall, etc.) • There is only a small probability that user and author use the same term to describe the same concept 39 of 142 cstaff@cs.um.edu.mt

  40. Automatic Relevance Feedback • Reasons for low recall • “Imprudent” query term “expansion” improves recall, simply because more documents are retrieved, but hurts precision! 40 of 142 cstaff@cs.um.edu.mt

  41. Automatic Relevance Feedback • Improving recall • Manually or automatically generated thesaurus used to expand query terms before query is submitted • We’re currently working on other techniques to pick synonyms that are likely to be relevant • Semantic Web attempts to encode semantic meaning into documents p61-voorhees.pdf, qiu94improving.pdf, MandalaSigir99EvComboWordNet.pdf 41 of 142 cstaff@cs.um.edu.mt

  42. Indexing Documents • Obviously, comparing a query vector to each document vector to determine the similarity is expensive • So how can we do it efficiently, especially for gigantic document collections, like the Web? 42 of 142 cstaff@cs.um.edu.mt

  43. Indexing Documents • Inverted indices • An inverted index is a list of terms in the vocabulary together with a postings list for each term • A postings list is a list of documents containing the term 43 of 142 cstaff@cs.um.edu.mt

  44. Indexing Documents • Inverted index • Several pieces of information can be stored in the postings list • term weight • location of the term in the document (to support proximity operators) 44 of 142 cstaff@cs.um.edu.mt

  45. Indexing Documents • Results set is obtained using set operators • Once documents in results set are known, their vectors can be retrieved to perform ranking operations on them • The document vectors also allow automatic query reformulation to occur following relevance feedback • See brin.pdf and p2-arasu.pdf 45 of 142 cstaff@cs.um.edu.mt

  46. Probabilistic IR • VSM assumes that a document that contains some term x is about that term • PIR compares the probability of seeing term x in a relevant document as opposed to a nonrelevant document • Binary Independence Retrieval Model proposed by Robertson & Sparck Jones, 1976 robertson97simple.pdf, SparckJones98.pdf 46 of 142 cstaff@cs.um.edu.mt

  47. BIR • BIR Fundamentals: • Given a user query there is a set of documents which contains exactly the relevant documents and no other: • the “ideal” answer set • Given the ideal answer set, a query can be constructed that retrieves exactly this set • Assumes that relevant documents are “clustered”, and that terms used adequately discriminate against non-relevant documents 47 of 142 cstaff@cs.um.edu.mt

  48. BIR • We do not know what are, in general, the properties of the ideal answer set • All we know is that documents have terms which “capture” semantic meaning • When user submits a query, “guess” what might be the ideal answer set • Allow user to interact, to describe the probabilistic description of the ideal answer set (by marking docs as relevant/non-relevant) 48 of 142 cstaff@cs.um.edu.mt

  49. BIR • Probabilistic Principle: Assumption • Given a user query q and a document dj in the collection: • Estimate the probability that the user will find dj relevant to q • Rank documents in order of their probability of relevance to the query (Probability Ranking Principle) 49 of 142 cstaff@cs.um.edu.mt

  50. BIR • Model assumes that probability of relevance depends on q and doc representations only • Assumes that there is an ideal answer set! • Assumes that terms are distributed differently in relevant and non-relevant documents 50 of 142 cstaff@cs.um.edu.mt

More Related