180 likes | 339 Views
Information Retrieval. Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow). Course Text. Modern Information Retrieval, R. Baeza-yates and B. Ribeiro-Neto., Addison-Wesley and ACM Press, 1999, ISBN: 0-201-39829-X. Introduction.
E N D
Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Course Text • Modern Information Retrieval, • R. Baeza-yates and B. Ribeiro-Neto., • Addison-Wesley and ACM Press, 1999, • ISBN: 0-201-39829-X
Introduction • Example of information need in the context of the world wide web: • “Find all documents containing information on computer courses which: (1) are offered by universities in South England, and (2) are accredited by the BCS/IEE bodies. To be relevant, the document must include information on admission requirements, and e-mail and phone number for contact purpose.” • Information Retrieval
Information Retrieval • Representation, storage, organisation, and access to information items • (Usually) keyword-based representation Information Need Query Documents Set of retrieved documents Useful or relevant information to the user Search Engine Retrieval System Primary goal of an IR system • “Retrieve all the documents which are relevant to a user query, while retrieving as few non-relevant documents as possible.”
Pull technology User requests information in an interactive manner 3 retrieval tasks Browsing (hypertext) Retrieval (classical IR systems) Browsing and retrieval (modern digital libraries and web systems) Push technology automatic and permanent pushing of information to user software agents example: news service filtering (retrieval task) relevant information for later inspection by user User tasks
Documents • Unit of retrieval • A passage of free text • composed of text, strings of characters from an alphabet • composed of natural language • newspaper article, a journal paper, a dictionary definition, email messages • size of documents • arbitrary • newspaper article vs. journal paper vs. email
Representation of documents • Set of index terms or keywords • extracted directly form text • specified by human subjects (information science) metadata • Most concise representation • Poor quality of retrieval • Full text representation • Most complete representation • High computational cost • Large collections • Reduce set of representative keywords • Elimination of stop words • Stemming • Identification of noun phrases • Further compression • Structure representation • Chapter, section, sub-section, etc Document term descriptors to access texts • Generation of descriptors for text • By hand • By analysing the text
The retrieval process Information need Documents Formulation Indexing Document representation Query Relevance feedback Retrievalfunctions Retrieved documents
Queries User term descriptors characterising the user need • Information Need • Simple queries • composed of two or three, perhaps even dozens, of keywords • e.g., as in web retrieval • Boolean queries • “neural networks AND speech recognition” • Context Queries • Proximity search, phrase queries
Best-Match Retrieval Document term descriptors to access texts • Compare the terms in a document and query • Compute similarity between each document in the collection and the query based on the terms that they have in common • Sorting the documents in order of decreasing similarity with the query • The outputs are a ranked list and displayed to the user - the top ones are more relevant as judged by the system User term descriptors characterising the user need
Conceptual View of Text Retrieval Queries Similarity Computation Documents Retrieved Documents
Expanded view of text retrieval system Queries Indexed Documents Documents Indexing Similarity Computation Retrieved Documents Ranked Documents
Process of retrieving info User Interface Text User feedback User need Text Text Operations Logical view Logical view Document Repository Manager Query Operations Indexing Inverted file Query Similarity Computation Index Text repository Retrieved docs Ranked docs Ranking
Key Topics • Indexing text documents • Retrieving text documents • Evaluation • Query reformulations Search Engines = IR + Link Structure + Name Interpretation
Information Retrieval vs Information Extraction • Information Retrieval • Given a set of query terms and a set of document terms select only the most relevant documents [precision], and preferably all the relevant [recall]. • Information Extraction • Extract from the text what the document means. • IR systems can FIND documents but need not “understand” them