Joint Unsupervised Structure Discovery and Information Extraction

Joint Unsupervised Structure Discovery and Information Extraction Eli Cortez, Daniel Oliveira, Altigran S. da Silva, Edleno S. de Moura Alberto H. F. Laender Univ. Fed. de Minas Gerais (UFMG) Brazil Univ. Fed. do Amazonas (UFAM) Brazil Presented by Eli Cortez ACM SIGMOD Conference Athens, Greece - June 2011

The IETS Problem • Information Extraction by Text Segmentation • Goal: • To extract attribute values occurring in implicit semi-structured data records • Current IETS methods are able to accurately predict a sequence of labels to be assigned to a sequence of text segments corresponding to attribute values • HMM – Borkar et al. (SIGMOD01) • CRF – Laferty et al. (ICML01) • ONDUX – Cortez et. al (SIGMOD10)

Examples – Delimited Records Product Descriptions Apple iPad 2 Wi-Fi + 3G 64 GB - Apple iOS 4 1 GHz - Black $589 LG - 32LE5300 - 32" LED-backlit LCD TV - 1080p (FullHD) - $400 Samsung - UN55D7000 - 55" Class ( 54.6" viewable ) LED-backlit LCD ... $2,048 Mixter Max Accessory Plasma TV Rack Tilt Bracket 248-A05 $65 HP Deskjet 3050 All-in-One Color Ink-jet - Printer / copier / scanner $50 Bibliographic Citations L. Barbosa and J. Freire. Using Latent-structure to Detect … In Proc. of the 13th WeDB, pages 1–6, 2010. A. Doan et. al. Information Extraction Challenges in Managing .. SIGMOD Record, 37(4):14–20, 2008. J. Pearl and G. Shafer. Probabilistic reasoning in intelligent systems: Morgan Kaufmann, 1988. Classified Ads $1106 / 2br - Luxury 2 BR, 1 BA apartment loaded with amenities - (Bothell) $1945 / 2br - Beautiful HighPoint Community "Built Green" 2 BR 2.5 Bth Town Home! - (West Seattle) $735 / 1br - Top floor 1 bedroom apt available just minutes from downtown!! - (Seattle,Burien,Highline) $820 / 1br - Lovely 1 bedroom 1k sq ft! Nearly a 2 bdrm! - (Federal Way,Edgewood,Milton, Tacoma) $895 / 2br - ****Lovely 2-Bedroom/2-Bathroom Condo with a View! FREE RENT!!!**** - (Monroe)

Example Non-delimited Records Chocolate Cake Recipe 1/2 cup butter 2 eggs 4 cups white sugar ground cinnamon 2 tablespoons dark rum 6 chopped pecans 1/2 cup milk 1 1/2 cups applesauce 2 cups all-purpose flour 1/4 cup cocoa powder 2 teaspoons baking soda 1/8 teaspoon salt 1 cup raisins 1/4 cup dark rum

Current IETS Methods • Assume input records are already separated • e.g., manually by a user or using HTML-based heuristics • Unfeasible in fully automatic settings 1/2 cup butter 2 eggs 4 cups white sugar ground cinnamon 2 tablespoons dark rum 6 chopped pecans 1/2 cup milk 1 1/2 cups applesauce … 1/2 cup butter 2 eggs 4 cups white sugar ground cinnamon 2 tablespoons dark rum 6 chopped pecans 1/2 cup milk 1 1/2 cups applesauce …

JUDIE • Structure Discovery + Information Extraction • Jointly carried out in an unsupervised way • Suitable for fully automatic settings: raw text streaming, crawler output, micro-blogs, etc 1/2 cup butter 2 eggs 4 cups white sugar ground cinnamon 2 tablespoons dark rum 6 chopped pecans 1/2 cup milk 1 1/2 cups applesauce … JUDIE

JUDIE • Joint Unsupervised Structure Discovery and Information Extraction • Introduces a new Structure Discovery Algorithm • Detects the structure of each individual record being extracted without any user intervention • Looks for frequent patterns of label repetitions or cycles • Integrates this algorithm in the IE process • Accomplished by successive refinement steps that alternate information extraction and structure discovery

Related Work – IETS Approaches/Methods • Probabilistic – Supervised • Hidden Markov Models (HMM) • Borkar et al.@SIGMOD’01;McCallum et al.@AAAI‘00 • Conditional Random Fields (CRF) • Lafferty et al.@ICML’01;McCallum et al.@IPM‘06) • Require training instances labeled on each input text <Neighboorhood>Regent Square </Neighboorhood> <Price> $228,900 </Price> <No>1028 </No><Street>Mifflin Ave, </Street> <Bed>6 Bedrooms </Bed> <Bath> 2 Bathrooms </Bath> <Phone>412-638-7273 </Phone>

Related Work - IETS Approaches / Methods • Probabilistic – Unsupervised • Rely on previously built datasets • Unsup. HMM (Agichtein et al.@SIGKDD ‘04) • Rely on records in references tables • Batches of fixed-order records as input • Unsup. CRF (Zhao et al. @SIAM ICDM’08) • Also reference tables • Batches of fixed-order records as input • ONDUX (Cortez et al. @SIGMOD’10) • Knowledge-base: sets of typical values per attribute – no records • All of them require one input record at time • No structure discovery

JUDIE Overview 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

JUDIE Overview 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla 1st IE Step: Structure-free Labeling

JUDIE Overview 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla 1st SD Step: Structure Sketching

JUDIE Overview 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U U I U I Q U I Q I U I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla 2nd IE Step: Structure-aware Labeling

JUDIE Overview 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U U I U I Q U I Q I U I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U U I U I Q U I Q I Q I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla 2nd SD Step: Structure Refinement

JUDIE – Structure-free Labeling • What is the best label for each segment? • No structural information is available • Initially labels potential values with attribute names • No information on the structure of the data records • Resort only to content-related features • Learned from the pre-existing KB 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

Features – Content Related • Features Considered: KB Bayes. Noisy OR A1 Attribute Vocabulary A2 Ingredient Value Range White sugar A3 Value Format

JUDIE – Structure-free Labeling • Initially labels potential values with attribute names • No information on the structure of the data records • Resort only to content-related features • Learned from the pre-existing KB 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Limitations: Label Fault : “Tbsp” Misassignment : “a little”

JUDIE – Structure Sketching • Organizes the labeled candidate values into records • Induces a structure on the unstructured text input • Outputs labeled values grouped into records • Uses a novel algorithm called Structure Discovery (SD) Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

The SD Algorithm • Uncover the structure of implicit records from the input text. • Used in the Structure Sketching and Structure Refinement • Takes as input a sequence of labels and generates the structure of each record • Assumption: It is possible to identify patterns of sequences by looking for cycles into a graph (Adjacency Graph) that models the ordering of labels

The SD Algorithm TitleConferenceYear Author Author TitleConferenceYear Author TitleConferenceYear … Author TitleJournalIssueYear Author TitleJournalIssueYear Author Author JournalIssueYearTitleYear … Author TitleConferenceYear Author Author Author TitleJournalIssueYear Conference Title Year Author Journal Issue

The SD Algorithm Exploits the occurrence of cycles in the adjacency graph [Author, Title, Conference, Year] [Author, Title, Journal, Issue, Year] [Title,Conference, Year] Conference Title Year Author Journal Issue

The SD Algorithm Coincident Cycles TitleConferenceYear Author Author TitleConferenceYear Author TitleConferenceYear … Author TitleJournalIssueYear Author TitleJournalIssueYear Author Author JournalIssueYearTitleYear … Author TitleConferenceYear Author Author Author TitleJournalIssueYear Viable Cycle Conference Title Year Author Journal Issue

The SD Algorithm • Dominant Cycles • Given the set of Coincident cycles that are also viable, the Dominant Cycle are most frequent in the input • Finally, the algorithm works by first identifying all dominant cycles in the adjacency graph and then processing each of these cycles • In our given examples, the dominant cycles are: • [Author, Title, Journal, Issue, Year] • [Author, Title, Conference, Year] • [Author, Journal, Issue, Year] • [Title,Conference, Year] • [Title, Year]

JUDIE – Structure Sketching • Organizes the labeled candidate values into records • Induces a structure on the unstructured text input • Outputs labeled values grouped into records • Uses a novel algorithm called Structure Discovery (SD) Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

JUDIE – Structure-aware Labeling • Now, what is the best label for each segment? • We already know some structural information • Re-labels segments considering content-related features and structure-based features • Structure-based features learned using a graphical model (PSM) Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

Positioning and Sequencing Model (PSM) • Built from the Structure Sketching output • States: attribute labels • Likelihood of: • absolute position of labels within text segments • relative position considering other labels 5% 80% UNIT 90% 10% 95% START END QUANTITY INGREDIENT 20% 100%

JUDIE – Structure-aware Labeling KB Content-related features Bayes. Noisy OR Quantity A little

JUDIE – Structure-aware Labeling • Labels textual values considering: • Uses a graphic model representing the likelihood of attribute transitions within the input text • Content-related features and structure-based features Q U I Q U ? I U I Q U I Q I I I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U U I U I Q U I Q I Q I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

JUDIE – Structure Refinement • Applies again the SD algorithm • Considers the output of the structure-aware labeling • Fixes structural problems • Structure-aware labeling produces more precise results Q U I Q U U I U I Q U I Q I Q I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla Q U I Q U U I U I Q U I Q I Q I 1/2 cup raising flour 2 level Tbsp Cocoa pinch Salt 1/4 cup Melted butter 1 Egg a little Vanilla

JUDIE Overview Phase 1 Structure-free Labeling Structure Sketching Phase 2 Structure-aware Labeling Structure Refinement

Experiments • Datasets previously used in other papers • Only 3 of the domains are discussed in this presentation. More results on the paper.

Metrics • F-Measure • Harmonic mean between precision and recall • Attribute-Level • Results considering values of a single attribute in all output records • Record-Level • Results considering all attributes in a single record • Average of all records results. • T-Test for the statistical validation of the results

Evaluation – Attribute Level - Recipes • High-quality results for all attributes even in Phase 1 • Structural information in Phase 2 led to gains above 5% on average

Evaluation – Attribute Level - CORA • Title and Journal have a large term overlap • Phase 2 was able to correct the mismatches from Phase 1

Evaluation – Attribute Level – Web Ads • Input strings from several websites • Still, F = 0.84 on average • Value range feature was useful for Phone, etc.

Evaluation – Record Level • Phase 1: acceptable (F≈0.7) • Phase 2: positive impact (Gains>9%) • In CORA, gains higher than 19% • Structural information led to significant improvements

Structure Diversity Impact • How our method deals with a heterogeneous dataset in terms of structure • CORA has 33 distinct styles were identified L. Barbosa and J. Freire. Using Latent-structure to Detect … In Proc. of the 13th WeDB, pages 1–6, 2010. A. Doan et. al. Information Extraction Challenges in Managing .. SIGMOD Record, 37(4):14–20, 2008. J. Pearl and G. Shafer. Probabilistic reasoning in intelligent systems: Morgan Kaufmann, 1988.

Structure Diversity Impact • Perfect Labeling: all segments are corrected labeled

Comparison with baselines – Attribute Level • Results very close to ONDUX and even better than U-CRF • Recall: JUDIE faces a harder task CORA Web Ads

Knowledge Base Impact Achieves results comparable with baselines for a task considerably harder JUDIE is more dependent of the KB: Input does not contain structural information # of common terms between the KB the input

Conclusions • Novel method for extracting semi-structured data records in the form of continuous text • Detects the structure of records being extracted • Integrates information extraction and structure discovery • Achieved good results in comparison with state-of-art methods while demanding less user effort • Suitable for fully automatic settings: raw text streaming, crawler output, micro-blogs, etc.

Conclusions • Content-related / Domain-dependent features • Learned from a previous existing KB on the domain • Used for executing a structure-free labeling step • Structure-related / Source-dependent features • Learned from the structure-free labeling over the input text • Content-related features are used to induce structured-based features through successive refinement steps • Thus, no manual training for each input is required

Future Work • Develop methods for automatically generating knowledge bases • Extend the SD algorithm to deal with nested structures

Acknowledgments UFMG

Thank you! Joint Unsupervised Structure Discovery and Information Extraction Eli Cortez, Daniel Oliveira, Altigran S. da Silva, Edleno S. de Moura Alberto H. F. Laender Univ. Fed. de Minas Gerais (UFMG) Brazil Univ. Fed. do Amazonas (UFAM) Brazil Presented by Eli Cortez ACM SIGMOD Conference Athens, Greece - June 2011

Summary: JUDIE x Previous IETS

Attribute Vocabulary

Value Range

Value Format • Value Format (Style) • First a Markov model is generated for each attribute. • Computes the probability of the input mask sequence represents a path in each Markov model of each attribute. 1.0 [A-Z][a-z]+ 1.0 End Start 0.2 0.8 White sugar [a-z][a-z]+ [A-Z]. [A-Z][a-z]+ [a-z][a-z]+ 1.0

Positioning and Sequencing Model

Joint Unsupervised Structure Discovery and Information Extraction

Joint Unsupervised Structure Discovery and Information Extraction

Presentation Transcript

DNA’s Discovery and Structure

Unsupervised Discovery of Morphemes

Information Extraction

Information Extraction

Unsupervised Commonality Discovery in Images

DNA’s Discovery and Structure

information extraction

Information Extraction

Information Retrieval and Information Extraction

IEPAD: Information Extraction Based on Pattern Discovery

Sparse Information Extraction: Unsupervised Language Models to the Rescue

IEPAD: Information Extraction based on Pattern Discovery

Information Extraction, Service Discovery and Semantic Services in HealthGrid Applications

Information Extraction

DNA Discovery and Structure

Structure Based Information Extraction (SBIE)

Information Extraction Data Mining and Topic Discovery with Probabilistic Models

ONDUX On-Demand Unsupervised Learning for Information Extraction

DNA’s Discovery and Structure