150 likes | 299 Views
Information Retrieval. CSE 8337 Spring 2003 Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/
E N D
Information Retrieval CSE 8337 Spring 2003 Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ Data Mining Introductory and Advanced Topics by Margaret H. Dunham http://www.engr.smu.edu/~mhd/book
Motivation • IR: representation, storage, organization of, and access to information items • Focus is on the user information need • User information need: • Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament. • Emphasis is on the retrieval of information (not data)
DB vs IR • Records (tuples) vs. documents • Well defined results vs. fuzzy results • DB grew out of files and traditional business systesm • IR grew out of library science and need to categorize/group/access books/articles
DB vs IR (cont’d) • Data retrieval • which docs contain a set of keywords? • Well defined semantics • a single erroneous object implies failure! • Information retrieval • information about a subject or topic • semantics is frequently loose • small errors are tolerated • IR system: • interpret contents of information items • generate a ranking which reflects relevance • notion of relevance is most important
Motivation • IR in the last 20 years: • classification and categorization • systems and languages • user interfaces and visualization • Still, area was seen as of narrow interest • Advent of the Web changed this perception once and for all • universal repository of knowledge • free (low cost) universal access • no central editorial board • many problems though: IR seen as key to finding the solutions!
Retrieval Database Browsing Basic Concepts • The User Task • Retrieval • information or data • purposeful • Browsing • glancing around • cars, Le Mans, France, tourism
Accents spacing Noun groups Manual indexing Docs stopwords stemming structure structure Full text Index terms Basic Concepts Logical view of the documents Document representation viewed as a continuum: logical view of docs might shift
Text User Interface 4, 10 user need Text Text Operations 6, 7 logical view logical view Query Operations DB Manager Module Indexing user feedback 5 8 inverted file query Searching Index 8 retrieved docs Text Database Ranking ranked docs 2 The Retrieval Process
Fuzzy Sets and Logic • Fuzzy Set: Set membership function is a real valued function with output in the range [0,1]. • f(x): Probability x is in F. • 1-f(x): Probability x is not in F. • EX: • T = {x | x is a person and x is tall} • Let f(x) be the probability that x is tall • Here f is the membership function
IR is Fuzzy Reject Reject Accept Accept Simple Fuzzy
Information Retrieval • Information Retrieval (IR): retrieving desired information from textual data. • Library Science • Digital Libraries • Web Search Engines • Traditionally keyword based • Sample query: Find all documents about “data mining”.
Information Retrieval • Similarity: measure of how close a query is to a document. • Documents which are “close enough” are retrieved. • Metrics: • Precision = |Relevant and Retrieved| |Retrieved| • Recall= |Relevant and Retrieved| |Relevant|