350 likes | 412 Views
WMES3103. INFORMATION RETRIEVAL WEEK 1 AND 2. WHAT IS INFORMATION RETRIEVAL?. Information Retrieval – IR Information Retrieval Lancaster (1968) :
E N D
WMES3103 INFORMATION RETRIEVAL WEEK 1 AND 2
WHAT IS INFORMATION RETRIEVAL? • Information Retrieval – IR • Information • Retrieval • Lancaster (1968) : An information retrieval system does not inform (I.e change the knowledge) of the user on the subject of his inquiry. It merely inform on the existence (or non-existence ) and whereabouts of documents relating to his request
IR – process of getting/retrieving information • Now : a lot of information – print and electronic • Requirement : obtain information quickly and accurately • IR – aims to provide fast , effective and efficient methods of representing, managing , searching, retrieving and presenting such information • IR = the representation , storage, organization of and access to information items
Computer science perspective • Design and build a large scale system that will store, manipulate, retrieve and display electronic information of any kind • Text, audio, image and graphics that are stored in such a way that they are available for interaction with human or machine • Library and information perspectives • Search features – au, ti, su, keywords • Relevance of retrieve items
3 challenges for IR researchers and practitioners • Technical challenge : what tools should IR systems provide to allow effective and efficient manipulation of information within such diverse media as text, image, video and audio? • Interaction challenge : what features should IR systems provide in order to support a wide variety of users in their search for relevant information. • Evaluation challenge : how can we evaluate which tools and features are effective and usable, given the increasing diversity of end-users and information seeking situations?
3 basic areas of research • Content analysis – describing the contents of the documents in a form suitable for computer processing • Information structures – exploiting relationships between documents to improve the efficiency and effectiveness of retrieval strategies • Evaluation – measurement of effectiveness of retrieval
Information Retrieval System • Information Retrieval System = IRS • Before :index document and retrieve • Eg. OPAC of library – cataloguing • Now: modelling, document classification and categorization, system architecture, user interface, data visualization, filtering languages • Eg. WWW
Basic Information Retrieval Process Question OR Full description of user information needs Translate into query OR keywords which summarizes the description of user information needs Query processed by a search engine or IRS IRS retrieves information which is useful/relevant to the user
Basic Concepts in Information Retrieval • User Task • Logical View of documents
User Task • A user has to translate his information needs into query in the language provided by the system • Specify a set of words • English Language Statement : I want a book by J. K Rowling titled The Chamber of Secrets
Query entered in a computer system • Au = Rowling • Ti = Chamber of Secrets • “Chamber of Secret” • Rowling AND Stone • Au rowling ti chamber of secrets ti stone
2 User Task • 2 user task – browsing and retrieval • Browsing – the process of retrieving info. Whereby the main objective is not clearly defined from the beginning and whose purpose might change during the interaction with the system. • Eg. User search the internet for info about marine organism look for info. About Australian aborigines user is said to be browsing in the collection and not searching • Eg. Searching for a book in the library shelves
Retrieval – process of retrieving info whereby the main obj. is clearly defined from the onset of searching process – eg. Eg. Searching for a book in the library shelves
2 actions when user interacts with an IRS 2 actions can be identified when a user interacts with an IRSYS – pulling and pushing actions. Pulling action user request for info in interactive way eg browsing and retrieval Pushing action push info towards the user periodically through the use of a specified or specially designed s/ware also known as filtering eg. Yahoo Msgr Service alert user each time new message arrive Online Stock Exchange
Interaction of the user with IRSYS through distinct task IR DB Browsing USER
Logical View of Documents Documents in a collection are represented by a set on index terms or keywords Keywords Abstract Full text
Logical View of Documents • Documents in a collection are represented by a set of index term/keywords Documents Indexing Process Extracted from text of document Assigned by humans Keywords/subject headings = Logical view of document
If full text : • Each word in the text is a keyword • Most complex form • Expensive • If full text is too large, there are mechanisms built into the IRS to reduce the number of keyword :
Logical view of documents - continue • Stop words (eg articles and connectives – a, the , an, and, of, etc) • Stemming (reduce distinct words to their common grammatical root) eg diary** will find diary or diaries • Truncation – eg catalog* will retrieve catalog, catalogs, catalogue, catalogues • Noun words (eliminates adjectives, adverbs, verbs) eg run will represent runs, running • compression Conversion Process
Logical view of documents - continue • This conversion process is known as text operation or transformation • It reduce the complexity of the document representation and allow the logical view from that of a full text to a set of index terms • On the other hand, the human assigned keywords provides the most concise logical view of a document but might lead to retrieval of poor quality – different interpretations, limited keywords if using thesaurus
2 modes of retrieval • Ad-Hoc – the documents in the IRS remains static but new queries are submitted to the system – eg. CD-ROM Database • Filtering – the queries remain relatively static but new documents come into the IRS eg. Stock market
Filtering • Construct a user profile that reflects the user’s preferences and profile is matched against incoming documents to find a match or a hit • Retrieve only documents of interest to the user and as specified in the user profile • User select relevant documents from the list. • Filtered documents can also be ranked to further assist the user as to relevance • Construction of a user profile - user provide necessary keywords or collect info about preferences from the user and use this to construct a user profile dynamically
INFORMATION RETRIEVAL PROCESS • DEFINE TEXT DATABASE • The text database has to be defined before the retrieval process begins • Done by database manager – documents to be used, operations to be performed on the text, text model • Original documents is transformed into a logical view of the documents via the various text operations • The database manager will then build up the index of the text – manually / computer generated • The retrieval system is tested
B. RETRIEVAL PROCESS • The IRS can be used once the document database has been indexed • User puts or present his question/ user need to the IRS • Question is change to a logical view of the document via the text operation • The query operation will present this to the system in a form understandable by the system • Query is processed to obtain the retrieved documents.
Continue… • The retrieved document are ranked according to relevance • Retrieved document are sent to the user • User looks through at the ranked documents and can modify question/user need/ query via the user feedback cycle • Same process repeated
DEVELOPMENT • For the past 4000 years , man has always been organizing information for retrieval and usage. • It started out with a table of contents for a book. Then, the amount of information extended over a number of books • A specialized data structure is needed to ensure faster access to the stored info. • The oldest and the most popular data form of data structure for fast IR is a collections of words or concept with which are associated pointers to the related info = INDEX • Previously – Manual
Development…continue • Now, with the advent of computers, large indexes can be generated automatically. This automatic indexes provide the logical view of the document as perceived by the system and not the user • 2 different views of the IR problems: • Computer-centered building efficient indexes , processing user queries with high performance, develop ranking algorithm which will improve the quality of the answer set • Human-Centered studying the behavior of the user , understand his main needs, and of determining how such understanding affects the organization and the operation the the IRSYS.
IR in the Library • Libraries are the first users of IRSYS to retrieve information • Usually develop by academic institution and later by commercial vendors • 1st generation – automation of the card catalog and allowed searches based on author and title • 2nd generation – increased search functionality - searching by subject headings, keywords, complex queries -OPAC • 3rd generation – graphical interfaces, electronic forms, hypertext features, open system architecture – Digital Libraries
The Web and Digital Libraries • Search engine on the web are still using indexes which are similar to the ones used by libraries years ago. • So, what has change? • Advances in computer technology has led to: • Cheaper access to various sources of information • Greater access to network due to advances in all kind of digital communication • Freedom to post information on the web
Problems • People still find it difficult to retrieve info relevant to their information needs from the web • Issues to address: • Dynamic world on the web • Demand for access and quick response • Quality of retrieval task is affected by user interaction with the system