CSA4080: Adaptive Hypertext Systems II

CSA4080:Adaptive Hypertext Systems II Topic 6: Information and Knowledge Representation Dr. Christopher Staff Department of Computer Science & AI University of Malta 1 of 142 cstaff@cs.um.edu.mt

Aims and Objectives • Models of Information Retrieval • Vector Space Model • Probabilistic Model • Relevance Feedback • Query Reformulation 2 of 142 cstaff@cs.um.edu.mt

Aims and Objectives • Dealing with General Knowledge • Programs that reason • Conceptual Graphs • Intelligent Tutoring Systems 3 of 142 cstaff@cs.um.edu.mt

Background • We’ve talked about how user information can be represented • We need to be able to represent information about the domain so that we can reason about what the user’s interests are, etc. • We covered the difference between data, information, and knowledge in CSA3080... 4 of 142 cstaff@cs.um.edu.mt

Background • In 1945, Vannevar Bush writes “As We May Think” • Gives rise to seeking “intelligent” solutions to information retrieval, etc. • In 1949, Warren Weaver writes that if Chinese is English + codification, then machine translation should be possible • Leads to surface-based/statistical techniques 5 of 142 cstaff@cs.um.edu.mt

Background • Even today, nearly 60 years later, there is significant effort in both directions • For years, intelligent solutions were hampered by the lack of fast enough hardware, software • Doesn’t seem to be an issue any longer, and the Semantic Web may be testimony to that • But there are sceptics 6 of 142 cstaff@cs.um.edu.mt

Background • Take IR as an example • At the dumb end we have “reasonable” generic systems, but at other end, systems are domain specific, more expensive, but do they give “better” results? 7 of 142 cstaff@cs.um.edu.mt

Background • At what point does it cease to be cost effective to attempt more intelligent solutions to the IR problem? 8 of 142 cstaff@cs.um.edu.mt

Background • Is “Information” Retrieval a misnomer? • Consider your favourite Web-based IR system... does it retrieve information? • Can you ask “Find me information about all flights between Malta and London”? • And what would you get back? • Can you ask “Who was the first man on the moon?” 9 of 142 cstaff@cs.um.edu.mt

Background • With many IR systems that we use, the “intelligence” is firmly rooted in the user • We must learn how to construct our queries so that we get the information we seek • We sift through relevant and non-relevant documents in the results list • What we can hope for is that “patterns” can be identified to make life easier for us - e.g., recommender systems 10 of 142 cstaff@cs.um.edu.mt

Background • Surface-based techniques tend to look for and re-use patterns as heuristics, without attempting to encode “meaning” • The Semantic Web, and other “intelligent” approaches, try to encode meaning so that it can be reasoned with and about • Cynics/sceptics/opponents believe that there is more success to be had in giving users more support, than to encode meaning into documents to support automation 11 of 142 cstaff@cs.um.edu.mt

However... • We will cover both surface-based and some knowledge-based approaches to supporting the user in his or her task 12 of 142 cstaff@cs.um.edu.mt

Information Retrieval • We will discuss two IR models... • Vector Space Model • Probabilistic Model • ... and surface-based techniques that can improve their usability • Relevance Feedback • Query Reformulation • Question-Answering 13 of 142 cstaff@cs.um.edu.mt

Knowledge • Conceptual graphs support the encoding and matching of concepts • Conceptual graphs are more “intelligent” and can be used to overcome some problems like the Vocabulary Problem 14 of 142 cstaff@cs.um.edu.mt

Reasoning on the Web • REWERSE (FP6 NoE) is an attempt to represent meaning contained in documents and to reason with and about it so that a single high-level user request may be carried out even if it contains several sub-tasks • E.g., “Find me information about cheap flights between Malta and London” 15 of 142 cstaff@cs.um.edu.mt

Vector-Space Model • Recommended Reading • p18-wong (Generalised Vector Space Model).pdf - look at refs [1],[2],[3] for original work 16 of 142 cstaff@cs.um.edu.mt

Vector-Space Model • Documents are represented as m-dimensional vectors or “bags of words” • m is the size of the vocabulary • wk = 1, indicates term is present in document • wk = 0, indicates term is absent • dj = <1,0,0,1,...,0,0> 17 of 142 cstaff@cs.um.edu.mt

Vector-Space Model 18 of 142 cstaff@cs.um.edu.mt

Vector-Space Model • The query is then plotted into m-dimensional space and the nearest neighbours are the most relevant • However, the results set is usually presented as a list ranked by similarity to the query 19 of 142 cstaff@cs.um.edu.mt

Vector-Space Model • Cosine Similarity Measure (from IR vector space model.pdf) 20 of 142 cstaff@cs.um.edu.mt

Vector-Space Model • Calculating term weights • Term weights may be binary, integers, or reals • Binary values are thresholded, rather than simply indicating presence or absence • Integers or reals will be measure of relative significance of term in document • Usually, term weight is TFxIDF 21 of 142 cstaff@cs.um.edu.mt

Vector-Space Model • Steps in calculating term weights • Remove stop words • Stem remaining words • Count term frequency (TF) • Count number of documents containing term (DF) • Invert it (log(C/DF)), where C is total number of documents in collection 22 of 142 cstaff@cs.um.edu.mt

Vector-Space Model • Normalising weights for vector length • Documents with longer vectors have a better chance of being retrieved than short ones (simply because there are a larger number of terms that they will match in a query) • IR should treat all relevant documents as important for retrieval purposes • Solution: , where w is weight of term t 23 of 142 cstaff@cs.um.edu.mt

Vector-Space Model • Why does this work? • Term discrimination • Assumes that terms with high TF and low DF are good discriminators of relevant documents • Because documents are ranked, documents do not need to contain precisely the terms expressed in the query • We cannot say anything (in VSM) about terms that occur in relevant and non-relevant documents - though we can in probabilistic IR 24 of 142 cstaff@cs.um.edu.mt

Vector-Space Model • Vector-Space Model is also used by Recommender Systems to index user profiles and product, or item, features • Apart from ranking documents, results lists can be controlled (to list top n relevant documents), and query can be automatically reformulated based on relevance feedback 25 of 142 cstaff@cs.um.edu.mt

Relevance Feedback • When a user is shown a list of retrieved documents, user can give relevance judgements • System can take original query and relevance judgements and re-compute the query • Rocchio... 26 of 142 cstaff@cs.um.edu.mt

Relevance Feedback • Basic Assumptions • Similar docs are near each other in vector space • Starting from some initial query, the query can be reformulated to reflect subjective relevance judgements given by the user • By reformulating the query we can move the query closer to more relevant docs and further away from nonrelevant docs 27 of 142 cstaff@cs.um.edu.mt

Relevance Feedback • In VSM, reformulating query means re-weighting terms in query • Not failsafe: may move query towards nonrelevant docs! 28 of 142 cstaff@cs.um.edu.mt

Relevance Feedback • The Ideal Query • If we know the answer set rel, then the ideal query is: 29 of 142 cstaff@cs.um.edu.mt

Relevance Feedback • In reality, a typical interaction will be: • User formulates query and submits it • IR system retrieves set of documents • User selects R’ and N’ where 0 <=  <= 1 (and vector magnitude usually dropped...) 30 of 142 cstaff@cs.um.edu.mt

Relevance Feedback • What are the values of  and ? • is typically given a value of 0.75, but this can vary. Also, after a number of iterations, the original weights of terms can be highly reduced • If  and  have equal weight, then relevant and nonrelevant docs make equal contribution to reformulated query • If  = 1,  = 0, then only relevant docs are used in reformulated query • Usually, use  = 0.75,  = 0.25 31 of 142 cstaff@cs.um.edu.mt

Relevance Feedback • Example Q: (5, 0, 3, 0, 1) R: (2, 1, 2, 0, 0) N: (1, 0, 0, 0, 2)  = 0.75,  = 0.50,  = 0.25 Q’ = 0.75Q + 0.5R – 0.25N = 0.75(5, 0, 3, 0, 1)+0.5(2, 1, 2, 0, 0)–0.25(1,0, 0, 0, 2) = (4.5, 0.5, 3.25, 0, 0.25) 32 of 142 cstaff@cs.um.edu.mt

Relevance Feedback • How many docs to use in R’ and N’? • Use all docs selected by user • Use all rel docs and highest ranking nonrel docs • Usually, user selects only relevant docs... • Should entire document vector be used? • Really want to identify the significant terms... • Use terms with high-frequency/weight • Use terms in doc adjacent to terms from query • Use only common terms in R’ (and N’) 33 of 142 cstaff@cs.um.edu.mt

Automatic Relevance Feedback • Users tend not to select nonrelevant documents, and rarely choose more than one relevant document (http://www.dlib.org/dlib/november95/11croft.html) • This makes it difficult to use relevance feedback • Current research uses automatic relevance feedback techniques... 34 of 142 cstaff@cs.um.edu.mt

Automatic Relevance Feedback • Two main approaches • To improve precision • To improve recall 35 of 142 cstaff@cs.um.edu.mt

Automatic Relevance Feedback • Reasons for low precision • Documents contain query terms, but documents are not “about” the “concept” or “topic” the user is interested in • E.g., user wants documents in which a cat chases a dog but the query <cat, chase, dog> also retrieves docs in which dogs chase cats • Term ambiguity 36 of 142 cstaff@cs.um.edu.mt

Automatic Relevance Feedback • Improving precision • Want to promote relevant documents in the results list • Assume that top-n (typically 20) documents are relevant, and assume docs ranked 500-1000 are nonrelevant • Choose co-occurring discriminatory terms • Re-rank docs ranked 21-499 using (modified) Rocchio method p206-mitra.pdf 37 of 142 cstaff@cs.um.edu.mt

Automatic Relevance Feedback • Improving precision • Does improve precision by 6%-13% at p-21 to p-100 • But remember that precision is to do with the ratio of relevant to nonrelevant documents retrieved • There may be many relevant documents that were never retrieved (i.e., low recall) 38 of 142 cstaff@cs.um.edu.mt

Automatic Relevance Feedback • Reasons for low recall • “Concept” or “topic” that user is interested in can be described using terms additional to those express by user in query • E.g., think of all the different ways in which you can express “car”, including manufacturers names (e.g., Ford, Vauxhall, etc.) • There is only a small probability that user and author use the same term to describe the same concept 39 of 142 cstaff@cs.um.edu.mt

Automatic Relevance Feedback • Reasons for low recall • “Imprudent” query term “expansion” improves recall, simply because more documents are retrieved, but hurts precision! 40 of 142 cstaff@cs.um.edu.mt

Automatic Relevance Feedback • Improving recall • Manually or automatically generated thesaurus used to expand query terms before query is submitted • We’re currently working on other techniques to pick synonyms that are likely to be relevant • Semantic Web attempts to encode semantic meaning into documents p61-voorhees.pdf, qiu94improving.pdf, MandalaSigir99EvComboWordNet.pdf 41 of 142 cstaff@cs.um.edu.mt

Indexing Documents • Obviously, comparing a query vector to each document vector to determine the similarity is expensive • So how can we do it efficiently, especially for gigantic document collections, like the Web? 42 of 142 cstaff@cs.um.edu.mt

Indexing Documents • Inverted indices • An inverted index is a list of terms in the vocabulary together with a postings list for each term • A postings list is a list of documents containing the term 43 of 142 cstaff@cs.um.edu.mt

Indexing Documents • Inverted index • Several pieces of information can be stored in the postings list • term weight • location of the term in the document (to support proximity operators) 44 of 142 cstaff@cs.um.edu.mt

Indexing Documents • Results set is obtained using set operators • Once documents in results set are known, their vectors can be retrieved to perform ranking operations on them • The document vectors also allow automatic query reformulation to occur following relevance feedback • See brin.pdf and p2-arasu.pdf 45 of 142 cstaff@cs.um.edu.mt

Probabilistic IR • VSM assumes that a document that contains some term x is about that term • PIR compares the probability of seeing term x in a relevant document as opposed to a nonrelevant document • Binary Independence Retrieval Model proposed by Robertson & Sparck Jones, 1976 robertson97simple.pdf, SparckJones98.pdf 46 of 142 cstaff@cs.um.edu.mt

BIR • BIR Fundamentals: • Given a user query there is a set of documents which contains exactly the relevant documents and no other: • the “ideal” answer set • Given the ideal answer set, a query can be constructed that retrieves exactly this set • Assumes that relevant documents are “clustered”, and that terms used adequately discriminate against non-relevant documents 47 of 142 cstaff@cs.um.edu.mt

BIR • We do not know what are, in general, the properties of the ideal answer set • All we know is that documents have terms which “capture” semantic meaning • When user submits a query, “guess” what might be the ideal answer set • Allow user to interact, to describe the probabilistic description of the ideal answer set (by marking docs as relevant/non-relevant) 48 of 142 cstaff@cs.um.edu.mt

BIR • Probabilistic Principle: Assumption • Given a user query q and a document dj in the collection: • Estimate the probability that the user will find dj relevant to q • Rank documents in order of their probability of relevance to the query (Probability Ranking Principle) 49 of 142 cstaff@cs.um.edu.mt

BIR • Model assumes that probability of relevance depends on q and doc representations only • Assumes that there is an ideal answer set! • Assumes that terms are distributed differently in relevant and non-relevant documents 50 of 142 cstaff@cs.um.edu.mt

CSA4080: Adaptive Hypertext Systems II