440 likes | 458 Views
A proposal for an unsupervised extraction method based on information retrieval to perform various IETS tasks, eliminating the need for user involvement in source-specific training processes and offering flexibility in extraction styles.
E N D
ONDUXOn-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigranda Silva and Edleno de Moura Federal University of Amazonas (UFAM) - BRAZIL Marcos Gonçalves Federal University of Minas Gerais (UFMG) - BRAZIL UFMG
Agenda • Introduction • Information Extraction by Text Segmentation • Challenges • Related Work • ONDUX • Experiments • Conclusions and Future Work
Introduction (1) • Abundance of on-line sources of text documents containing implicit semi-structured data records • Addresses • Bibliographic References • Classified Ads • Product Descriptions
Introduction (1I) Classified Ad Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273 Address Dr. Robert A. Jacobson, 8109 Harford Road, Baltimore, MD 21214 Bibliographic Reference Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221, January 2006
Introduction (III) • Why extracting information? • Database Storage, Query… • Data Mining • Record Linkage. <Neighboorhood> : Regent Square <Price> : $228,900 <No.> : 1028 <Street> : Mifflin Ave, <Bed.> : 6 Bedrooms <Bath..> : 2 Bathrooms <Phone> : 412-638-7273 Classified Ad Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273
IETS – Challenges(I) • Information Extraction by Text Segmentation (IETS) • Borkar@SIGMOD'01, McCallum@ICML'01, Agichtein@SIGKDD'04, Mansuri@ICDE'06, Zhao@SICDM'08, Cortez@JASIST'09 • Diversity of templates and styles • Attribute Ordering • Capitalization • Abbreviations. • Different applications share similar domains • Ex.: Address and Ads • Records from both domains contain address information
IETS – Challenges(II) • Diversity of templates and styles • Attribute Ordering; Capitalization; Abbreviations. Link-based similarity measures for the classication of Web documents. Pável Calado. Journal of the American Society for the Information Science and Technology – 57(2) 2006 HomePage Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno Silva de Moura, Berthier A. Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST 57 (2) 208-221(2006) DBLP Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221, January 2006 ACM
IETS – Challenges(III) • Existing approaches deal with this problem use Machine Learning techniques • Hidden Markov Models (HMM) • Conditional Random Fields (CRF) • Structured Support Vector Machines (SSVM) • (semi) Supervised approaches require a hand-labeled training set created by an expert. • Each generated model is particular to a given application • High computational cost
Related Work • [Borkar et. al @ SIGMOD 2001] • Supervised extraction method based on Hidden Markov Models (HMM) • [McCallum et. al @ ICML 2001] • Proposed the usage of Conditional Random Fields (CRF), a supervised model – (S-CRF) • [Mansuri et. al @ ICDE 2006] • Semi-supervised approach based on CRF models All of these approaches require an expert to create a hand-labeled training set for each application.
Related Work (II) • [Agichtein et. al @ SIGKDD 2004] • Usage of Reference Tables to create an unsupervised model using Hidden Markov Models (HMM) • [Zhao et. al @ SIAM ICDM 2008] • Usage of reference tables to create unsupervised CRF models - (U-CRF) • [Cortez et. al @ JASIST 2009] • Unsupervised method to extract bibliographic information Both models assume single positioning and ordering of attributes in all test instances. (Distinct Orderings ?) Domain-specific heuristics, not general application.
Contributions • Proposal of extraction method based on information retrieval to perform IETS tasks; • Eliminate the need of a user involved in any source specific training process; • Flexible in the sense that do not rely on any particular style to perform the extraction • Unsupervised Reinforcement Phase • Attribute ordering and positioning learned On-Demand • Experimental comparison with the state-of-art information extraction approach (CRF).
Basic Concepts(1) • Given an input string I representing an implicit textual record (e.g. classified ad), the IETS task consists in: • Segmenting • Assigning to each segment a label corresponding to an attribute
Basic Concepts(I1) • Knowledge Base • Set of pairs KB = • Easily built from pre-existing sources • Bibliographic DBs, Freebase, Google Fusion Tables, etc. KB= { (Neighboorhhod, O ), (Street, O ), (Phone, O )} O = { “Regent Square”, “Milenight Park”} O = { “Regent St.”, “Morewood Ave.”, “Square Ave. Park”} O = { “323 462-6252”, “(171) 289-7527”} Neigh. Street Phone Neigh. Street Phone
ONDUX (I) • Three main steps • Blocking • Matching • Reinforcement
ONDUX (II) • General View 1
ONDUX (III) • Blocking • Split the input text in substrings called blocks; • Consider the co-occurrence of consecutive terms based on the KB Left separated (no presence in the KB) Co-occur in the KB (Neighborhood) Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273
ONDUX (IV) • General View 1 2
ONDUX (V) • Matching • Associate each block generated in the previous phase with an attribute according to the Knowledge Base • Use distinct functions to compute the similarity between a block and the know values of the attributes in in the KB
ONDUX (VI) • Matching • Textual Values: FF Function (Field Frequency) • Similarity between the terms on the block and the terms of a given attribute of the KB • Numeric Values : NM Function (Numeric Matching)[Agrawal @ CIDR 2003] • Similarity between the value on the block, the mean and the standard deviation of a numeric attribute in the KB
ONDUX (VI) • Matching Street Price No. ??? Street Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273 Bed. Bath. Phone
ONDUX (VII) • How can we deal with blocks that were incorrectly labeled or were not associated to any attribute? Street Price No. ??? Street Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273 Bed. Bath. Phone
ONDUX (VIII) • Reinforcement • Review the labeling task performed in the Matching step • Unmatchedblocks must receive a label of a given attribute • Mismatchedblocks must be correctly labeled • How to handle these cases? • Using positioning and sequencing information that are obtained On-Demand.
ONDUX (IX) • General View 3 2
ONDUX (X) • Reinforcement • Given the extraction output of the matching step • ONDUX automatically build a graphical structure, the PSM. • PSM: Positioning and Sequencing Model.
ONDUX (XI) In the PSM, each state represents attributes of the KB plus special states start and end • Reinforcement – PSM Edges represent transition probabilities Ordering and Positioning Probabilities are learned On-Demand based on the test instances trough the Matching Phase
ONDUX (XII) • Reinforcement • Remarks • The PSM is automatically learned On-Demand from test instances • No a priori training required • No assumptions regarding a particular order of attribute values • Relies on the very effective strategies deployed in the Matching Step
ONDUX (XIII) • Reinforcement • Once the PSM is built, we combine the matching, positioning and sequencing evidences using the Bayesian operator OR. Matching Sequence Positioning
ONDUX (XIV) • Reinforcement • Extraction Result Street Neighborhood Street Price No. ??? Street Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273 Bed. Bath. Phone
ONDUX (XV) • Overview 3 1 2
Experiments (1) • Setup • Wetestedourproposed approach withseveral sources from 3 distinctdomains: • Addresses • BigBook, Restaurants [RISE] • Bibilographic Data • CORA [Peng@IPM’ 06], PersonalBib [Mansuri@ICDE’ 06] • ClassifiedAds • 7 distinctnewspaper sites[Oliveira@SBBD’ 06] • Welimitedthepresentation to oneexperiment per domain. More onthepaper
Experiments (II) • Evaluation • Metrics • Precision, Recall and F-Measure • T-Test for the statistical validation of the results • Baselines • Conditional Random Fields (CRF) • U-CRF (Unsupervised method) [Zhao@SICDM’ 08] • S-CRF (Classical supervised method) [Peng@IPM’ 06]
Experiments (III) • Extraction Quality U-CRF results similar to Zhao@SICDM (validation) Dataset follows the single order assumption After Reinforcement ONDUX achieved similar quality
Experiments (IV) • Extraction Quality CORA includes a variety of citation styles (conference, journal, books, etc,) S-CRF achieved results higher than U-CRF due to the hand-labeled training In general, ONDUX outperformed CRF models
Experiments (V) • Extraction Quality U-CRF presented a poor performance (very heterogeneous dataset) Due to the Matching Phase and the PSM that is learned On-Demand, ONDUX achieve very high quality results
Experiments (VI) • Varying the number of terms common to test instances and the KB • Determine how dependent the quality of results is from the overlap between the previously known data and the text input. • These experiments were conducted with the BigBook dataset.
Experiments (VII) • Varying the number of shared terms Starting with a batch of 500 input strings, after having an overlap of 500 terms, ONDUX achieved high quality results Even presenting a poor quality in the Matching Phase, the PSM is able to increase ONDUX’s quality in the Reinforcement Step
Experiments (VIII) • Varying the number of shared terms As the number of shared terms increases, the best quality the Mathching phase achieves
Conclusions andFuture Work (I) • New approach for information extraction independent of the style of the data records • ONDUX • Flexible: Do not consider any particular style • Unsupervised: Do not require any human effort to create a training set • On-Demand: Ordering and Positioning Information are learned trough the Matching Phase
Conclusions and Future Work (II) • Proposed strategy achieve good results of precision and recall • Small size of the Knowledge Base • Comparison with the state-of-art • As a Future Work • Investigate different matching functions; • Nested structures?
Acknowledgements UFMG
Experimentes • Setup