CSA3080: Adaptive Hypertext Systems I

CSA3080:Adaptive Hypertext Systems I Lecture 5:Information Retrieval I Dr. Christopher Staff Department of Computer Science & AI University of Malta 1 of 21 cstaff@cs.um.edu.mt

Aims and Objectives • Aims and objectives of IR • Boolean, Extended Boolean, Statistical Models 2 of 21 cstaff@cs.um.edu.mt

Aims and Objectives • You should end up knowing the major differences between the simple matching algorithms • And what each algorithm considers to be a relevant document… • Bear in mind that we will use IR in AHS to find information relevant to our user so that we can present it/lead the user to it… 3 of 21 cstaff@cs.um.edu.mt

Aims and Objectives of IR • To facilitate the identification and retrieval of documents that contain information relevant to an information need expressed by a user • We are particularly interested in the retrieval of information from unstructured data 4 of 21 cstaff@cs.um.edu.mt

Boolean Information Retrieval • Developed in 1950’s • A document is represented by a collection of terms that occur in the document (index) • The unique terms occurring in the collection is called the vocabulary • A document is represented by a bit sequence with a 1 representing a term that is present, and 0 otherwise 5 of 21 cstaff@cs.um.edu.mt

Boolean Information Retrieval • How is the query expressed? • User thinks of terms that describe an information need • Formalises query as a boolean expression • (Term27 OR Term46) NOT (Term30 AND Term16) 6 of 21 cstaff@cs.um.edu.mt

Boolean Information Retrieval • How does the matching algorithm work? • Each term in the vocabulary has a set (or postings list) of documents that contain the term • For each term in the query, the postings lists are retrieved • Set operations (union/disjunction/intersection) • All documents in the results set are returned 7 of 21 cstaff@cs.um.edu.mt

Boolean Information Retrieval 8 of 21 cstaff@cs.um.edu.mt

Questions Arising… • Is this reallyinformation retrieval? • Just because a document contains term x, does it mean that the document is about term x? • What about concepts? • What makes it possible for us to know that a fish cake is not a dessert? That “she is the apple of my eye” does not make her a piece of fruit? 9 of 21 cstaff@cs.um.edu.mt

Questions Arising… • Can we rank the results of a boolean query? • All we are doing is checking the presence and absence of terms • On what grounds would we rank? • And doesn’t it look suspiciously like RDBMS/SQL??? 10 of 21 cstaff@cs.um.edu.mt

Does Boolean IR work? • BIR works, and works well, when the vocabulary is reasonably small… • … when there is no ambiguity in the meaning of terms • … when the presence of a term in a document is significant • … when the absence of a term from a document means that the document cannot be about that term 11 of 21 cstaff@cs.um.edu.mt

Does Boolean IR work? • Boolean IR is typically applied to a document surrogate • And is used with tremendous success in RDBMS • Most general purpose IR systems in use on the Internet are derived from BIR with some extensions… 12 of 21 cstaff@cs.um.edu.mt

Vector Space Model of IR • Briefly… • Documents (query) represented by vector of term weights • Term weight describes relative importance of term to document (query) • Similarity of document to query measured • The more similar the document to the query, the more relevant it is 13 of 21 cstaff@cs.um.edu.mt

Vector Space Model of IR • VSM gives improved results over Boolean • Can rank documents • Can control output (limit the no. of documents returned) • But… not as easy to construct query • Query does not contain any structure • Can’t express synonymy, etc. 14 of 21 cstaff@cs.um.edu.mt

Extended Boolean Retrieval Model • Developed to address ranking problem in BIR, using VSM-like approach, while retaining Boolean query structures • E-BIR not as strict as BIR (fuzzy matches supported, as in VSM) • Term features can include frequency, location, … • Reference: • G. Salton, E. Fox, and U. Wu. (1983). Extended Boolean information retrieval. Communications of the ACM, 26(12):1022-1036. 15 of 21 cstaff@cs.um.edu.mt

Extended Boolean Retrieval Model • Matching is still based on presence or absence of terms, but now results can be ranked • Terms in docs and query are weighted according to term features • With structured documents (e.g., HTML), term features can also include structural information (title, heading, style, …) 16 of 21 cstaff@cs.um.edu.mt

Extended Boolean Retrieval Model • With location information possible to find terms NEAR each other • “computer NEAR science” not the same as “computer AND science” • ADJ (adjacent) refines the proximity measure 17 of 21 cstaff@cs.um.edu.mt

Questions Arising… • Ranked results are an improvement • NEAR is also useful to improve the quality of results • … as is ADJ • Are we any closer to information retrieval? 18 of 21 cstaff@cs.um.edu.mt

Phrase Matching • Concepts may be evidenced in text as complex/compound identifiers • New York, Computer Science, information retrieval, database management systems, … • Brings us closer to information retrieval, but still only identifies documents that contain phrases • Reference: • W. Bruce Croft, Howard R. Turtle, and David D. Lewis, (1991), The use of phrases and structured queries in information retrieval, ACM SIGIR, 32-45. 19 of 21 cstaff@cs.um.edu.mt

Phrase Matching • Extended/Boolean can express phrases using AND together with proximity operator • VSM cannot, unless the phrase has been indexed! • When is a sequence of words a phrase? • Croft et. al. use a probabilistic inference net model… 20 of 21 cstaff@cs.um.edu.mt

Conclusion • The Boolean and Extended Boolean Models give us a simple mechanism for representing documents • If we can represent a user’s interest by the presence or absence of terms, then the user model could be used as a query to locate interesting document • Phrase matching allows us to recognise complex nouns: useful only if phrase is pervasive 21 of 21 cstaff@cs.um.edu.mt

CSA3080: Adaptive Hypertext Systems I