230 likes | 345 Views
An information retrieval system for parliamentary documents.
E N D
An information retrieval system for parliamentary documents Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso E. Romero Chapter: 12 CSE 655 Probabilistic Reasoning Faculty of Computer Science, Institute of Business Administration Presented by Quratulain
Outline • Introduction • Overview of information retrieval systems • Bayesian network and information retrieval • Theoretical foundations • Building the information retrieval system • Conclusion Quratulain
Introduction/Motivation • To fulfil the objective of democracy, need to make public all activities of parliament. • Previously, information was sent in a printed form to all official organization and libraries. • Currently, electronic document published on the web, which is fast, cheaper and an easier way. • The official bulletin, transcripts of all speeches in different session, after editing published on website in PDF. • The documents are accessible using database-like queries. Quratulain
Problems • To access information user must know about: • Session number • Date of legislature • Difficult to access information Quratulain
Goal • A website with real search engine based on content. • The natural language query is applied to access the information. • The obtained the relevant document through system. • The output will be a set of document components of varying granularity (from complete document to single paragraph, also sorted depending on degree of relevance). ** This will avoid manual search ** Quratulain
Outline • Introduction • Overview of information retrieval systems • Bayesian network and information retrieval • Theoretical foundations • Building the information retrieval system • Conclusion Quratulain
Overview of information retrieval • Information retrieval is concerned with representation, storage, organization, and accessing of information items. • Information retrieval systems work as: • Given a set of documents • Pre-processing • remove words not useful in search(stopwords) • Convert word to its stem word(reduce vocabulary) • Each word is associated with weights expressing their importance (in document or collection of documents) • NLP query indexed to match query representation with the stored document using any IR model. • Finally, a set of document identifiers is presented to the user sorted according to their relevance degree. Quratulain
Overview of information retrieval • Standard IR treat document as atomic entities. • XML allows structured documents with semantics. • Structured IR views documents as aggregates interrelated structural elements by indexing. • Structured IR models exploit the content and the structure of documents to estimate the relevance of document components to query. Quratulain
Outline • Introduction • Overview of information retrieval systems • Bayesian network and information retrieval • Theoretical foundations • Building the information retrieval system • Conclusion Quratulain
Bayesian Networks and information retrieval • Bayesian networks were first applied to IR at the beginning of 1990 by croft and turtle. • Bayesian network in IR models compute the probability of relevance given a document and a query. • Two important model of BNs within IR: • Belief network model • Bayesian network retrieval model. • Common feature are: • Each index term and document represented as nodes in network. • Links connecting each document node with all the term nodes. • Model differ in: • The direction of arc. • Additional arc (relationship b/w documents and terms.) Quratulain
BN-based retrieval model Terms T2 T3 T4 T5 T6 T7 T1 Documents D1 D2 D3 Quratulain
Drawback of Bayesian network • Time and space require to assess the distributions and store them(conditional probability per node is exponential with the parent nodes) • The efficiency of carrying out inference, because general inference in BNs is NP-hard problem Therefore The direct approach where we propagate the evidence contained in a query through the whole network is unfeasible . Quratulain
Outline • Introduction • Overview of information retrieval systems • Bayesian network and information retrieval • Theoretical foundations • Building the information retrieval system • Conclusion Quratulain
Theoretical foundations • Set of documents D={D1 ,D2 , ..., DM} • Set of terms used to index these documents • Each document Di is organized hierarchically, representing structural associations of elements in Di called structural unit. • These association to a document form a tree. For example scientific article. Quratulain
The structure of scientific article Index Terms Title Parag 1 Parag 2 Title Parag 1 Ref 1 Ref 2 Title Parag 1 Subsec 1 Subsec 2 Section 1 Bibligraphy Title Author Abstract Section 2 Document 1 Quratulain
BN model for document • BN modeling of document contain 3-kind of nodes • Terms set , T={T1, T2, ..., Tl} • Basic structural unit, Ub={B1, B2, ..., Bm} • Complex structural unit, Uc={S1, S2, ..., Sm} • Set of all structural unit U=UbUc • To each node T, B, S is associated a binary random variables as {t- , t+}, {b- , b+} or {s- , s+} respectively. (-) not relevant , (+) relevant. Quratulain
BN model for document Ub S2 Uc Uc Us • , with Pa(S1) Pa(S2) = , S1 Uc T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 B1 B2 B3 B5 B6 B7 B4 S1 S3 S2 S4 Quratulain
BN for document • Conditional Probability • P(t+) • P(b+|pa(B)) • P(s+|pa(S)) • Due to greater number of parent, efficient inference procedure is needed. Quratulain
Influence Diagram Model • Once the BN has been constructed transform it into influence diagram by including decision and utility nodes. • Chance node : previous BN • Decision node : • Utility node : Quratulain
Outline • Introduction • Overview of information retrieval systems • Bayesian network and information retrieval • Theoretical foundations • Building the information retrieval system • Conclusion Quratulain
Building the information retrieval system(PAIRS) • PAIRS is a software package (store document in relational database) • Written in C++ • Specifically developed to store and retrieve documents generated by the parliament of Andalusia • Based on probabilistic model. Indexing System PDF document collection XML document collection Query General scheme of PAIRS Indexed Query Indexed Document Collection Search Engine Retrieved Document Components Quratulain
Outline • Introduction • Overview of information retrieval systems • Bayesian network and information retrieval • Theoretical foundations • Building the information retrieval system • Conclusion Quratulain
Conclusion • This paper present a retrieval system based on probabilistic model belong to parliament information. • The system has been proven efficient in term of indexing and retrieval time. • Bayesian network technologies can be employed in problem domains whose dimensionality would earlier avoid its use. • The system is not a finished product, still several possible improvement are required. Quratulain