Information Retrieval

Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)

Course Text • Modern Information Retrieval, • R. Baeza-yates and B. Ribeiro-Neto., • Addison-Wesley and ACM Press, 1999, • ISBN: 0-201-39829-X

Introduction • Example of information need in the context of the world wide web: • “Find all documents containing information on computer courses which: (1) are offered by universities in South England, and (2) are accredited by the BCS/IEE bodies. To be relevant, the document must include information on admission requirements, and e-mail and phone number for contact purpose.” •  Information Retrieval

Information Retrieval • Representation, storage, organisation, and access to information items • (Usually) keyword-based representation Information Need Query Documents Set of retrieved documents Useful or relevant information to the user Search Engine Retrieval System Primary goal of an IR system • “Retrieve all the documents which are relevant to a user query, while retrieving as few non-relevant documents as possible.”

Pull technology User requests information in an interactive manner 3 retrieval tasks Browsing (hypertext) Retrieval (classical IR systems) Browsing and retrieval (modern digital libraries and web systems) Push technology automatic and permanent pushing of information to user software agents example: news service filtering (retrieval task) relevant information for later inspection by user User tasks

Documents • Unit of retrieval • A passage of free text • composed of text, strings of characters from an alphabet • composed of natural language • newspaper article, a journal paper, a dictionary definition, email messages • size of documents • arbitrary • newspaper article vs. journal paper vs. email

What is a document?

Representation of documents • Set of index terms or keywords • extracted directly form text • specified by human subjects (information science)  metadata • Most concise representation • Poor quality of retrieval • Full text representation • Most complete representation • High computational cost • Large collections • Reduce set of representative keywords • Elimination of stop words • Stemming • Identification of noun phrases • Further compression • Structure representation • Chapter, section, sub-section, etc Document term descriptors to access texts • Generation of descriptors for text • By hand • By analysing the text

The retrieval process Information need Documents Formulation Indexing Document representation Query Relevance feedback Retrievalfunctions Retrieved documents

Queries User term descriptors characterising the user need • Information Need • Simple queries • composed of two or three, perhaps even dozens, of keywords • e.g., as in web retrieval • Boolean queries • “neural networks AND speech recognition” • Context Queries • Proximity search, phrase queries

Best-Match Retrieval Document term descriptors to access texts • Compare the terms in a document and query • Compute similarity between each document in the collection and the query based on the terms that they have in common • Sorting the documents in order of decreasing similarity with the query • The outputs are a ranked list and displayed to the user - the top ones are more relevant as judged by the system User term descriptors characterising the user need

Conceptual View of Text Retrieval Queries Similarity Computation Documents Retrieved Documents

Expanded view of text retrieval system Queries Indexed Documents Documents Indexing Similarity Computation Retrieved Documents Ranked Documents

Process of retrieving info User Interface Text User feedback User need Text Text Operations Logical view Logical view Document Repository Manager Query Operations Indexing Inverted file Query Similarity Computation Index Text repository Retrieved docs Ranked docs Ranking

Key Topics • Indexing text documents • Retrieving text documents • Evaluation • Query reformulations Search Engines = IR + Link Structure + Name Interpretation

Information Retrieval vs Information Extraction • Information Retrieval • Given a set of query terms and a set of document terms select only the most relevant documents [precision], and preferably all the relevant [recall]. • Information Extraction • Extract from the text what the document means. • IR systems can FIND documents but need not “understand” them

Information Retrieval