350 likes | 364 Views
TP6084 CAPAIAN MAKLUMAT INFORMATION RETRIEVAL (IR). Introduction. What will be covered today…. Course overview Introduction to IR. What this course about. Search Engines What is it? How to build one? How to evaluate? What are the models? How do Google rank results? etc Models?
E N D
TP6084 CAPAIAN MAKLUMATINFORMATION RETRIEVAL (IR) Introduction
What will be covered today… • Course overview • Introduction to IR
What this course about • Search Engines • What is it? • How to build one? • How to evaluate? • What are the models? • How do Google rank results? • etc • Models? • What are the research in this area..? • What about Mutimedia data? • What about semantic web? • etc…..
Course Overview • What this course is …about • How people search and find information. • How computers store and retrieve information. • How computer systems are designed to help people find information they need.
Course Overview • The course will emphasize on • Understanding of • Theories • Tools • Algorithms, and • Evaluations for Information Retrieval Systems • Viewing web search engine as the practical application of IR system
Course Content (subject to change) • Introduction • IR and Search Engine • Architecture of Search Engine • Text processing • Indexing and Ranking • Queries & Interface • Retrieval Models • Evaluation • Classification & Clustering • Social Search
References • The textbook for this course: Croft, W.B., Metzler, D. & Strohman, T. 2009. Search Engines: Information Retrieval in Practice. New York: Addison Wesley • Other recommended books: • Grossman, D.A. & Frieder, D.A. 2004. Information Retrieval: Algorithms & Heuristics, 2nd Edition. Berlin: Springer. • Baeza-Yates, R. & Ribeiro-Neto, B. 1999. Modern Information Retrieval. New York: Addison Wesley • Manning, C., Raghavan, P. & Schutze, H. 2008. Introduction to Information Retrieval. New York: Cambridge University Press • For general reading on search engine, you must read: • Batella, J. 2005. The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture. New York: Portfolio Hardcover. • List of related journal/proceedings articles will be informed time by time during class.
Assessment • Exam – 50% • Project/Assignments – 50% • Lectures: • Monday (11 am – 12 noon) BK8 • Thursday (10 am – 12 noon) BK8
Any problem..? • Dr. Shereena Arif (PhD) • Room H-2-8, IT School, Faculty of Information Science & Technology, UKM Bangi. • E-mail : shereen@ftsm.ukm.my OR shereen.ukm@gmail.com • Website/blog : shereenarif.wordpress.com • Blog dedicated for this course : tp6084.wordpress.com • Any media suggested for communication?
What is IR? • Finding relevant information in large collections of data • In such a collection you may want to find: • ‘Give me information on the history of the Tun Razak’ An article about Tun Razak (text retrieval) • ‘What does a brain tumor look like on a CT-scan’ A picture of a brain tumor (image retrieval) • `It goes like this: I do, I do, I do, I do do do do do . . . ' A certain song (music retrieval)
What is IR? • IR is a branch of applied computer science focusing on the representation, storage, organization, access, and distribution of information. [System Centered] • IR involves helping users find information that matches their information needs. [User Centered]
Text Retrieval • Online library catalogs (OPAC) • Internet search engines, such as • AltaVista, Google, Ilse • Specialized systems (aka vendors): • MEDLINE (medical articles) • Lexis-Nexis (legal, business, academic, . . . ) • Westlaw (legal articles) • Dialog (business information)
Retrieval vs. Browsing • Popular Web Directories: • Yahoo!, Open Directory Project (dmoz) • The user has to ‘guess’ the ‘right’ directories to find the information • The user has to adapt to the designers' conceptualization of the directory • The goal of information retrieval is to provide immediate random access to the data • The user can specify his information need
IR vs. Database Querying • IR is not the same thing as querying a database • Database querying assumes that the data is in a standardized format. • Transforming all information, news articles, web sites into a database format is difficult and impossible for large data collections. • Text retrieval can work with plain, unformatted data.
Data Retrieval vs. Information Retrieval Data retrieval Information retrieval Content Data Information Data object Table Document Matching Exact match Partial match, best match Items wanted Matching Relevant Query language SQL(artificial) Natural Query specification Complete Incomplete Model Deterministic Probabilistic Highly structure Less structure
Relevance as Similarity • A fundamental idea within IR is: ‘A document is relevant to a query if they are similar’ • Similarity can be defined as: • string matching/comparison • similar vocabulary • same meaning of text
The Ubiquity of IR • Search engines • Information filtering • E-mail routing • Text categorization • Detecting information structure • Hyperlink generation • Topic/Information detection/Screening • Portal development and maintenance • Digital libraries • Question Answering
“Web brings IR to the Center of the Stage” IR has become a center of the focus in the Web era. Its theories, techniques, and applications have reached many fields where processing large amount of information is essential.
Information User Search/select Queries Stored Information Info. Needs Translating info. needs to queries Matching queries To stored information Query result evaluation: Does the information found match user’s information needs? Challenges of IR
Data and Information • Data • String of symbols associated with objects, people, and events • Values of an attribute • Data need not have meaning to everyone • Data must be interpreted with associated attributes. • Information • The meaning of the data interpreted by a person or a system • Data that changes the state of a person or system that perceives it. • Data that reduces uncertainty. • if data contain no uncertainty, there are no information with the data. • Examples: It snows in the winter. It does not snow this winter.
knowledge Data information Information and Knowledge • knowledge • Structured information • through structuring, information becomes understandable • Processed Information • through processing, information becomes meaningful and useful • information shared and agreed upon within a community
Text • Strings of ASCII symbols or Unicode • structured by the author • indexed by information service providers • Representation of natural languages people use • To convey meanings • To communicate between readers and authors. • Data or information? • If it can be understood, it’s information. • by Whom? A person or a system?
Documents • Logical unit of text • articles, books, • links, web pages • Other components that come with the text • figures, charts, graphics • multimedia
Textual Data • Repository of human intellectuals • Rich and diverse resources for all answers. • If it is written, it is there (in text) • Meaningful and understandable (to users). • Simple ASCII representation • Free of pre-formatted structures • continuous • separated into documents • Easy to process by the computer • Machine Intensive (not labor intensive)
Problems with Text • Massive • Any IR system needs the capability of large scale data processing. • Use of indexes and various representations are required. • Inconsistent • It’s a human language • Syntactical and semantic variances • Same information expressed in different ways. • Different information expressed in similar ways. • Incomplete • It uses common knowledge. • It’s an open system.
Retrieval • Retrieval • What do we retrieve? • Data • Information • Knowledge • We retrieve documents that contains text which carries information. • Information can be anywhere • in the text, in the links, in the process of text.
Information Retrieval • Are they the same? • Text retrieval • Document retrieval • Information retrieval
Information Retrieval • Conceptually, information retrieval is used to cover all related problems in finding needed information • Historically, information retrieval is about document retrieval, emphasizing document as the basic unit • Technically, information retrieval refers to (text) string manipulation, indexing, matching, querying, etc.
IR Systems • IR systems contain three components: • System • People • Documents (information items) SYSTEMS Browsing Retrieval Documents (Database) User
Historical Summary • 1960’s • Basic advances in retrieval and indexing techniques • 1950: Calvin N. Moors coins the term `Information Retrieval' • 1959: Luhn describes statistical retrieval • 1960: Maron and Kuhns dene a probabilistic model of IR • 1966: Craneld project denes evaluation measures • 1968: Gerard Salton's rst book about the SMART retrieval • system
Historical Summary • 1990’s and 2000’s • Large-scale, full-text IR and filtering experiments and systems • Dominance of ranking • Many Web-based retrieval engines • Interfaces and browsing • Multimedia and multilingual • Machine learning techniques • Question answering (factoids) • The Future • IR in context (the right answer for you now here) • Logic-based IR? • NLP? • Integration with other functionality • Distributed, heterogeneous database access