An information retrieval system for parliamentary documents

An information retrieval system for parliamentary documents Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso E. Romero Chapter: 12 CSE 655 Probabilistic Reasoning Faculty of Computer Science, Institute of Business Administration Presented by Quratulain

Outline • Introduction • Overview of information retrieval systems • Bayesian network and information retrieval • Theoretical foundations • Building the information retrieval system • Conclusion Quratulain

Introduction/Motivation • To fulfil the objective of democracy, need to make public all activities of parliament. • Previously, information was sent in a printed form to all official organization and libraries. • Currently, electronic document published on the web, which is fast, cheaper and an easier way. • The official bulletin, transcripts of all speeches in different session, after editing published on website in PDF. • The documents are accessible using database-like queries. Quratulain

Problems • To access information user must know about: • Session number • Date of legislature • Difficult to access information Quratulain

Goal • A website with real search engine based on content. • The natural language query is applied to access the information. • The obtained the relevant document through system. • The output will be a set of document components of varying granularity (from complete document to single paragraph, also sorted depending on degree of relevance). ** This will avoid manual search ** Quratulain

Overview of information retrieval • Information retrieval is concerned with representation, storage, organization, and accessing of information items. • Information retrieval systems work as: • Given a set of documents • Pre-processing • remove words not useful in search(stopwords) • Convert word to its stem word(reduce vocabulary) • Each word is associated with weights expressing their importance (in document or collection of documents) • NLP query indexed to match query representation with the stored document using any IR model. • Finally, a set of document identifiers is presented to the user sorted according to their relevance degree. Quratulain

Overview of information retrieval • Standard IR treat document as atomic entities. • XML allows structured documents with semantics. • Structured IR views documents as aggregates interrelated structural elements by indexing. • Structured IR models exploit the content and the structure of documents to estimate the relevance of document components to query. Quratulain

Bayesian Networks and information retrieval • Bayesian networks were first applied to IR at the beginning of 1990 by croft and turtle. • Bayesian network in IR models compute the probability of relevance given a document and a query. • Two important model of BNs within IR: • Belief network model • Bayesian network retrieval model. • Common feature are: • Each index term and document represented as nodes in network. • Links connecting each document node with all the term nodes. • Model differ in: • The direction of arc. • Additional arc (relationship b/w documents and terms.) Quratulain

BN-based retrieval model Terms T2 T3 T4 T5 T6 T7 T1 Documents D1 D2 D3 Quratulain

Drawback of Bayesian network • Time and space require to assess the distributions and store them(conditional probability per node is exponential with the parent nodes) • The efficiency of carrying out inference, because general inference in BNs is NP-hard problem Therefore The direct approach where we propagate the evidence contained in a query through the whole network is unfeasible . Quratulain

Theoretical foundations • Set of documents D={D1 ,D2 , ..., DM} • Set of terms used to index these documents • Each document Di is organized hierarchically, representing structural associations of elements in Di called structural unit. • These association to a document form a tree. For example scientific article. Quratulain

The structure of scientific article Index Terms Title Parag 1 Parag 2 Title Parag 1 Ref 1 Ref 2 Title Parag 1 Subsec 1 Subsec 2 Section 1 Bibligraphy Title Author Abstract Section 2 Document 1 Quratulain

BN model for document • BN modeling of document contain 3-kind of nodes • Terms set , T={T1, T2, ..., Tl} • Basic structural unit, Ub={B1, B2, ..., Bm} • Complex structural unit, Uc={S1, S2, ..., Sm} • Set of all structural unit U=UbUc • To each node T, B, S is associated a binary random variables as {t- , t+}, {b- , b+} or {s- , s+} respectively. (-) not relevant , (+) relevant. Quratulain

BN model for document Ub S2 Uc Uc Us • , with Pa(S1) Pa(S2) = , S1 Uc T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 B1 B2 B3 B5 B6 B7 B4 S1 S3 S2 S4 Quratulain

BN for document • Conditional Probability • P(t+) • P(b+|pa(B)) • P(s+|pa(S)) • Due to greater number of parent, efficient inference procedure is needed. Quratulain

Influence Diagram Model • Once the BN has been constructed transform it into influence diagram by including decision and utility nodes. • Chance node : previous BN • Decision node : • Utility node : Quratulain

Building the information retrieval system(PAIRS) • PAIRS is a software package (store document in relational database) • Written in C++ • Specifically developed to store and retrieve documents generated by the parliament of Andalusia • Based on probabilistic model. Indexing System PDF document collection XML document collection Query General scheme of PAIRS Indexed Query Indexed Document Collection Search Engine Retrieved Document Components Quratulain

Conclusion • This paper present a retrieval system based on probabilistic model belong to parliament information. • The system has been proven efficient in term of indexing and retrieval time. • Bayesian network technologies can be employed in problem domains whose dimensionality would earlier avoid its use. • The system is not a finished product, still several possible improvement are required. Quratulain

An information retrieval system for parliamentary documents

An information retrieval system for parliamentary documents

Presentation Transcript

An Overview of Information Retrieval

Building an Augmented Index for Genomic Information Retrieval

Galago for Information Retrieval

Information Retrieval System

An F-Measure for Context-Based Information Retrieval

Using for In Situ Information Retrieval System Evaluation

Information Retrieval

Bits and Pieces An information retrieval system for academic research

Medical Information Retrieval: eEvidence System

Automatically Building a Stopword List for an Information Retrieval System

An Adaptive XML Retrieval System

Intelligent Information Directory System for Clinical Documents

Information Retrieval

Information Retrieval System for IIT’s Writing Center

INFORMATION STROAGE AND RETRIEVAL SYSTEM

An Overview of Information Retrieval

Information Retrieval

Information Retrieval