350 likes | 854 Views
Information Retrieval Overview. Dr. Aboud Madlin. Contents. Overview Text-based Information retrieval Documents and Query representation Information retrieval models Clustering and classification Evaluation Measures Search Engines on WWW - INFORMATION RETRIEVAL C. J. van RIJSBERGEN
E N D
Information Retrieval Overview Dr. AboudMadlin
Contents • Overview • Text-based Information retrieval • Documents and Query representation • Information retrieval models • Clustering and classification • Evaluation Measures • Search Engines on WWW - INFORMATION RETRIEVAL C. J. van RIJSBERGEN - Automatic Text Processing Gerard Salton
Overview • Definitions • Comparison of data retrieval and information retrieval • Text Based IR • Search Engines on WWW
DefinitionsInformation Retrieval • IR is a branch of applied computer science focusing on • the acquisition, • organization, • storage, • retrieval, and distribution of information. • IR involves helping users find information that matches their information needs. • IR has become a center of the focus in the web era. Its theories, techniques, and applications have reached many fields where processing large amount of information is essential.
Definitions • Document Retrieval • Automatic selection of a subset of Documents in a Corpus of Documents in a way that the selectedDocuments are relevant for the Query or Information Needof the User • Text Retrieval • Retrieval of Documents written or spoken in naturallanguage (the later = Spoken Document Retrieval)
IR Systems Human Components System Components Components of IR Systems • Human Components • Users -- who create the needs of the system (the user) • Organization -- who makes it possible to have the system (the funder) • Information professionals -- who operate the system and provide the services (the server) • System Components • Data -- the content of the system • Device & media -- hardware of the system • Algorithms & procedures -- software of the system
Users • The user • anyone who need to find some information • The user groups • group by their knowledge of the system • group by their domain knowledge • group by information needs • need to locate a particular item • need some information • need all information on a subject
Reality Goals ? Reality Goals ? Reality Goals ? Reality Goals ? Reality Goals ? Info. Needs ?? Info. Systems User’s information needs
Reality Goals ? Reality Goals ? Reality Goals ? Reality Goals ? Reality Goals ? Info. Needs Problems Request ?? Queries Info. Systems First Abstraction Principle Second Abstraction Principle ?? Data
Abstraction Principles • First Abstraction Principle • Abstract data from the “real world” And make them available to the system. • Second Abstraction Principles • Abstract the user’s information needs into a form the system understands.
Information User Search/select Queries Stored Information Info. Needs Translating info. needs to queries Matching queries To stored information Query result evaluation Does information found match user’s information needs? Challenges of IR
DOCUMENTS QUERY INTERPRETATION Analyze Indexation Matching Function QUERY Document Model Model Selected Documents INTERROGATION THESAURUS CONSTRUCTION Information Retrieval Model
Indexation • Indexing with Natural Language index Terms (keywords, Phrases, other sophisticated structures) and full text search: • Detection of content • Index Term weighting • Indexing with controlled language : Thesaurus class terms, Classification Codes, ... • Both can be Manual or Computerized
Interrogation • Allowing the user to express his Information Need using Query Model and answering this Query using the Matching Function: • Index Terms and Term weighting • Operators • Relevance Feedback and Query expansion with Synonyms and Related terms • The matching function is usually a numeric function that computes the Relevance degree of a Document and a Query
Retrieval Models • They are defined by: • Form in representing Document and Query • Matching Algorithms (D, Q, F, R(qi, dj) • Examples: • Boolean & Extended Boolean Model (set theory) • Vector Space and Generalized Vector Space Models( algebra theory) • Semantic Models • Probabilistic Models • Inference Network Models • Logic Models • ...
Evaluation Measures • Concept of Relevance • Classical Measures: • Recall • nb of retrieval relevant doc/ nb all relevant doc • Precision • nb of retrieval relevant doc/ nb of retrieval doc
User Reformulation Relevance Feedback Matching Function Documents • Consideration of User Decisions Query Selected Documents Results
Information Retrieval Vs Data Retrieval • Data and Information • Data • String of symbols associated with objects, people, .. • Values of an attribute • Data must be interpreted with associated attributes. • Information • The meaning of the data interpreted by a person or a system • Data that changes the state of a person or system that perceives it.
Information Retrieval Vs Data Retrieval • Information and Knowledge • knowledge • Structured information • through structuring, information becomes understandable • Processed Information • through processing, information becomes meaningful and useful knowledge Data information
Information Retrieval Vs Data Retrieval • Documents • Logical unit of text • articles, books, • links, web pages • Other components that come with the text • figures, charts, graphics • multimedia
Information Retrieval Vs Data Retrieval • Textual Data • Repository of human intellectuals • Rich and diverse resources for all answers. • Meaningful and understandable (to users). • Free of pre-formatted structures • continuous • separated into documents • Easy to process by the computer
Information Retrieval Vs Data Retrieval • Textual Data • Massive • Any IR system needs the capability of large scale data processing. • Use of indexes and various representations are required. • Inconsistent • It’s a human language • Same information expressed in different way • Different information expressed in similar ways. • Incomplete (It’s an open system)
Information Retrieval Vs Data Retrieval • Retrieval • Text retrieval • Document retrieval • Information retrieval • We can’t retrieve information! • We can only retrieve documents that contains text which carries information. • Information can be anywhere • in the text, in the links, in the process of text.
Information Retrieval Vs Data Retrieval Information Retrieval • Conceptually, information retrieval is used to cover all related problems in finding needed information • Historically, information retrieval is about document retrieval, emphasizing document as the basic unit • Technically, information retrieval refers to (text) string manipulation, indexing, matching, querying, etc.
Information Retrieval Vs Data Retrieval • Information Retrieval Systems • The goal of IR systems is to help users find information that satisfies their information needs. • The process of IR systems is to match two abstractions: • data abstracted in the system • queries abstracted from user’s information needs • Information retrieval is much more difficult than data retrieval
Comparison of data retrieval and information retrieval Data retrieval Information retrieval Content Data Information Data object Table Document Matching Exact match Partial match, best match Items wanted Matching Relevant Query language SQL(artificial) Natural Query specification Complete Incomplete Model Deterministic Probabilistic Highly structured less structure
Other methods for Retrieving Information • Browsing or Navigation System • Using Hypertext or Hypermedia links until relevant document • Question-Answering System • User asks Questions • Answer is directly extracted from Document Collection
Text-Based Information Retrieval • Fundamental Techniques • Document and Query Representation • Term weighting schemes based on Corpus Statistics • Retrieval Models • Document Clustering/Classification • Data Structures and Search Techniques • Evaluation Measures
Text-Based Information Retrieval • New Challenges • Statistical methods & Machine Learning Techniques applied to Text Retrieval • Text Categorization • Text Summarization • Cross-Language IR • Knowledge Representation and use of Knowledge Bases
Search Engines on WWW • Web Search Engines • Crawling Agents • Indexes of the Web pages • Query Interface • Answer Interface • Retrieval Models • Content based models • Link based models • Specific Purpose Search Engines
From web to semantic web • Current web problems (unstructured information) • How can we convert the unstructured information to structured information? • Semantic Web mining • Search engines
using clustering algorithms to facilitate the browsing of the search engine's results • Web Document Clustering Problem • Online Clustering framework VS Offline clustering framework • Modeling Online Clustering framework (Vector-space theory) • Decreasing the high dimension of the feature vector • Modeling Offline clustering framework (Graph-based theory) • Future research issues • Arabic language • Google and Semantic Web