330 likes | 437 Views
Mining Reference Tables for Automatic Text Segmentation. Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research. Scenarios. Importing unformatted strings into a target structured database Data warehousing Data integration
E N D
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research
Scenarios • Importing unformatted strings into a target structured database • Data warehousing • Data integration • Requires each string to be segmented into the target relation schema • Input strings are prone to errors (e.g., data warehousing, data exchange)
Current Approaches • Rule-based • Hard to develop, maintain, and deploy comprehensive sets of rules for every domain • Supervised • E.g., [BSD01] • Hard to obtain comprehensive datasets needed to train robust models
Our Approach • Exploit large reference tables • Learn domain-specific dictionaries • Learn structure within attribute values • Challenges • Order of attribute concatenation in future test input is unknown • Robustness to errors in test input after training on clean and standardized reference tables
Problem Statement • Target schema: R[A1,…,An] • For a given string s (a sequence of tokens) • segments into s1,…,sn substrings at token boundaries • maps1,…,sn to Ai1,…,Ain • maximize P(Ai1|s1)*…*P(Ain|sn) among all possible segmentations of s • Product combination function handles arbitrary concatenation order of attribute values • P(Ai|x) that a string x belongs to Ai estimated by an Attribute Recognition Model ARMi • ARMs are learned from a reference relation r[A1,…,An]
ARMs • Design goals • Accurately distinguish an attribute value from other attributes • Generalize to unobserved/new attribute values • Robust to input errors • Able to learn over large reference tables
ARM: Instantiation of HMMs • Purpose: Estimate probabilities of token sequences belonging to attributes • ARM: instantiation of HMMs (sequential models) • Acceptance probability: product of emission and transition probabilities
Instantiating HMMs • Instantiation has to define • Topology: states & transitions • Emission & transition probabilities • Current automatic approaches for topology search from among a pre-defined class of topologies are based on cross validation [FC00, BSD01] • Expensive • Number of states in the ARM is small to keep the search space tractable
Intuition behind ARM Design • Street address examples • [nw 57th St], [Redmond Woodinville Rd] • Album names • [The best of eagles], [The fury of aquabats], [Colors Soundtrack] • Large dictionaries (e.g., aquabats, soundtrack, st…) to exploit • Begin and end tokens are very important to distinguish values of an attribute (nw, st, the,…) • Can learn patterns on tokens (e.g., 57thgeneralizes to *th) • Need robustness to input errors • [Best of eagles] for [The best of eagles], [nw 57th] for [nw 57th st]
Large Number of States • Associate a state per token: Each state only emits a single base token • More accurate transition probabilities • Model sizes for many large reference tables are still within a few megabytes • Not a problem with current main memory sizes! • Prune the number of states (say, remove low frequency tokens) to limit the ARM size
BMT Topology: Relax Positional Specificity A single state per distinct symbol within a category -- emission probability of a symbol within a category is same
Robustness Operations: Relax Sequential Specificity • Make ARMs robust to common errors in the input, i.e., maintain high probability of acceptance despite these errors • Common types of errors [HS98] • Token deletions • Token insertions • Missing values • Intuition: Simulate the effects of such erroneous values over each ARM
Robustness Operations Simulating the effect of token insertions: token and corresponding transition probabilities are copiedfrom BEGIN to MIDDLE state
Transition Probabilities • Transitions from BM and BT and MM and MT allowed • Learned from examples in reference table • Transition probabilities are also weighted by their ability to distinguish an attribute • A transition “*” “*” which is common across many attributes gets low weight
Summary of ARM Instantiation • BMT topology • Token hierarchy to generalize observed patterns • Robustness operations on HMMs to address input errors • One state per token in reference table to exploit large dictionaries
Attribute Order Determination • If attribute order is known • Can use dynamic programming algorithm to segment [Rabiner89] • If attribute order is unknown • Can ask the user to provide attribute order • Can discover attribute order • Naïve expensive strategy: evaluate all concatenation orders and segmentations for each input string • Consistent Attribute Order Assumption: the attribute order is the same across a batch of input tuples • Several datasets on the web satisfy this assumption • Allows us to efficiently • Determine the attribute order over a batch of tuples • Segment input strings (using dynamic programming)
Experimental Evaluation • Reference relations from several domains • Addresses: 1,000,000 tuples • [Name, #1, #2, Street Address, City, State, Zip] • Media: 280,000 tuples • [ArtistName, AlbumName, TrackName] • Bibliography: 100,000 tuples • [Title, Author, Journal, Volume, Month, Year] • Compare CRAM (our system) with DataMold [BSD01]
Test Datasets • Naturally erroneous datasets: unformatted input strings seen in operational databases • Media • Customer addresses • Controlled error injection: • Clean reference table tuples [Inject errors] Concatenate to generate input strings • Evaluate whether a segmentation algorithm recovered the original tuple • Accuracy Measure: % of attribute values correctly recognized
Overall Accuracy DBLP Addresses
Topology & Robustness Operations Addresses
Exploiting Dictionaries Accuracy vs Reference Table size
Conclusions • Reference tables leveraged for segmentation • Combining ARMs based on independence allows segmenting input strings with unknown attribute order • ARM models learned over clean reference relations can accurately segment erroneous input strings • BMT topology • Robustness operations • Exploiting large dictionaries
Model Sizes & Pruning Accuracy #States & Transitions Model Size in MB
Topology Media
Specificities of HMM Models • Model “specificity” restricts accepted token sequences • Positional specificity • Number ending in ‘th|st’ can only be the 2nd token in an address value • Token specificity • Last state only accepts “st, rd, wy, blvd” • Sequential specificity • “st, rd, wy, blvd” have to follow a number in ‘st|th’
Robustness Operations Token insertion Token deletion Missing values