210 likes | 397 Views
Intelligent Information Retrieval CS 336. Xiaoyan Li Spring 2006. Modified from Lisa Ballesteros’s slides. What is Information Retrieval?. Includes the following: Organization Storage/Representation Manipulation/Analysis Search/Retrieval How far back in history can we find examples?.
E N D
Intelligent Information RetrievalCS 336 Xiaoyan Li Spring 2006 Modified from Lisa Ballesteros’s slides
What is Information Retrieval? • Includes the following: • Organization • Storage/Representation • Manipulation/Analysis • Search/Retrieval • How far back in history can we find examples?
IR Through the Ages • 3rd Century BCE • Library of Alexandria • 500,000 volumes • catalogs and classifications • 13th Century A.D. • First concordance of the Bible • What is a concordance? • 15th Century A.D. • Invention of printing • 1600 • University of Oxford Library • All books printed in England
IR Through the Ages • 1755 • Johnson’s Dictionary • Set standard for dictionaries • Included common language • Helped standardize spelling • 1800 • Library of Congress • 1828 • Webster’s Dictionary • Significantly larger than previous dictionaries • Standardized American spelling • 1852 • Roget’s Thesaurus
IR Through the Ages • 1876 • Dewey Decimal Classification • 1880’s • Carnegie Public Libraries • 1,681 built (first public library 1850) • 1930’s • Punched card retrieval systems • 1940’s • Bush’s Memex • Shannon’s Communication Theory • Zipf’s “Law”
Historical Summary • 1960’s • Basic advances in retrieval and indexing techniques • 1970’s • Probabilistic and vector space models • Clustering, relevance feedback • Large, on-line, Boolean information services • Fast string matching • 1980’s • Natural Language Processing and IR • Expert systems and IR • Off-the-shelf IR systems
IR Through the Ages • Late 1980’s • First mini-computer and PC systems incorporating “relevance ranking” • Early 1990’s • information storage revolution • 1992 • First large-scale information service incorporating probabilistic retrieval (West’s legal retrieval system)
IR Through the Ages • Mid 1990’s to present • Multimedia databases • 1994 to present • The Internet and Web explosion • e.g. Google, Yahoo, Lycos, Infoseek (now Go) • 1995 to present • Digital Libraries • Data Mining • Agents and Filtering • Knowledge and Distributed Intelligence • Information Organization • Knowledge Management
Historical Summary • 1990’s • Large-scale, full-text IR and filtering experiments and systems (TREC) • Dominance of ranking • Many web-based retrieval engines • Interfaces and browsing • Multimedia and multilingual • Machine learning techniques
On-line Information Petabytes Image and Video Retrieval Visualization Terabytes Data Mining Distributed Retrieval Summarization Information Extraction Gigabytes Ranked Filtering Concept-Based Retrieval Technologies Ranked Retrieval Boolean Retrieval and Filtering 1970 1990 Time Batch systems...Interactive systems...Database Systems…Cheap Storage...Internet…Multimedia... Trends in IR Technology 1-page word document without any images = ~10 kilobytes (kb) of disk space. 1 terabyte = one-hundred million imageless word docs 1 petabyte = one-thousand terabytes.
Historical Summary • The Future • Logic-based IR? • NLP? • Integration with other functionality • Distributed, heterogeneous database access • IR in context • “Anytime, Anywhere”
Information Retrieval • Ad Hoc Retrieval • Given a query and a large database of text objects, find the relevant objects • Distributed Retrieval • Many distributed databases • Information Filtering • Given a text object from an information stream (e.g. newswire) and many profiles (long-term queries), decide which profiles match • Multimedia Retrieval • Databases of other types of unstructured data, e.g. images, video, audio
Information Retrieval • Multilingual Retrieval • Retrieval in a language other than English • Cross-language Retrieval • Query in one language (e.g. Spanish), retrieve documents in other languages (e.g. Chinese, French, and Spanish)
Information Retrieval • Text Representation (Indexing) • given a text document, identify the concepts that describe the content and how well they describe it • what makes a “good” representation? • how is a representation generated from text? • what are retrievable objects and how are they organized? • Representing an Information Need (Query Formulation) • describe and refine information needs as explicit queries • what is an appropriate query language? • how can interactive query formulation and refinement be supported?
Information Retrieval • Comparing Representations (Retrieval) • compare text and information need representations to determine which documents are likely to be relevant • what is a “good” model of retrieval? • how is uncertainty represented? • Evaluating Retrieved Text (Feedback) • present documents for user evaluation and modify query based on feedback • what are good metrics? • what constitutes a good experimental testbed
Information Need Text Objects Representation Representation Query Indexed Objects Comparison Evaluation/Feedback Retrieved Objects Information Retrieval and Filtering
Features of a Modern IR Product • Effective “relevance ranking” • Simple free text (“natural language”) query capability • Boolean and proximity operators • Term weighting • Query formulation assistance • Query by example • Filtering • Field-based retrieval • Distributed architecture • Index anything • Fast retrieval • Information Organization
Typical Systems • IR systems • Verity, Fulcrum, Excalibur • Database systems • Oracle, Informix • Web search and In-house systems • West, LEXIS/NEXIS, Dialog • Yahoo, Google, MSN, AskJeeves
IR vs. Database Systems • Emphasis on effective, efficient retrieval of unstructured data • IR systems typically have very simple schemas • Query languages emphasize free text although Boolean combinations of words is also common
IR vs. Database Systems • Matching is more complex than with structured data (semantics less obvious) • easy to retrieve the wrong objects • need to measure accuracy of retrieval • Less focus on concurrency control and recovery, although update is very important