230 likes | 239 Views
Ad Hoc Data and the Token Ambiguity Problem. Qian Xi * , Kathleen Fisher + , David Walker * , Kenny Zhu * 2009/1/19. * : Princeton University, + : AT&T Labs Research. Ad Hoc Data. Standardized data formats: HTML, XML Data processing tools: Visualizers (HTML browsers), XQuery.
E N D
Ad Hoc Data and the Token Ambiguity Problem Qian Xi*, Kathleen Fisher+, David Walker*, Kenny Zhu* 2009/1/19 *: Princeton University, +: AT&T Labs Research
Ad Hoc Data • Standardized data formats: HTML, XML • Data processing tools: Visualizers (HTML browsers), XQuery • Non-standard, semi-structured • Not many data processing tools • Examples: web server log (CLF), phone call provisioning data… 207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30 244.133.108.200 - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/ddorg/confirm HTTP/1.0" 200 941 9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii152272|EDTF_6|0|MARVINS1|UNO|10|1000295291 9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii15222|EDTF_6|0|MARVINS1|UNO|10|1000295291|20|1000295291|17|1001649600|19|1001 1/19
learnPADS Goal • Automatically generates a description of the format • Automatically generates a suite of data processing tools Punion payload { Pint32 i; PstringFW(3) s2; }; Pstruct source { ‘\”’; payload p1; “,”; payload p2; ‘\”’; } “0,24” “bar,end” “foo,16” Declarative Description XML converter, Grapher, etc. 2/19
learnPADS Architecture XML Analysis Report XML converter Raw Data Profiler Chunking & Tokenization Format Inference Engine Structure Discovery Format Refinement PADS Compiler Data Description 3/19
learnPADS framework Chunking & Tokenization “0,24” “bar,end” “foo,bag” “0,56” “cat,name” “int, int” “str, str” “str, str” “int, int” “str, str” Structure Discovery struct Format Refinement “ union “ struct struct struct “ union , union “ 0 , INT STR , STR INT STR INT STR 4/19
Token Ambiguity Problem (TAP) Given a string, there’re multiple ways to tokenize it. • Message • Word White Word White Word White... White URL • Word White Quote Filepath Quote White Word White... • old learnPADS • user defines a set of base tokens with fixed order • take the first, longest match • new solution: probabilistic tokenization • use probabilistic models to find most likely token sequences 5/19
Probabilistic Graphical Models earthquake burglar alarm parent comes home node: random variable edge: probabilistic relationship 6/19
Hidden Markov Model (HMM) • Observation/Character Ci • Character Features: upper/lower case, digit, punctuation... • Hidden state/Pseudo-token Ti • maximize probability P(token sequence|character sequence) tokens: Quote Word Comma Int Quote pseudo-tokens: Quote Word Word Word Comma Int Int Quote input characters: , “ f o o 1 6 “ transition probability: P(Ti|Ti-1) emission probability: P(Ci|Ti) 7/19
Hidden Markov Model Formula the probability of token sequence given character sequence = the probability that token T1 comes first * the probability that token Ti follows Ti-1 for all i * the probability that we see character Ci given token Ti for all i transition probability emission probability 8/19
Hidden Markov Model Parameters transition probability emission probability 9/19
Hierarchical Models Quote Word Comma Int Quote , “ foo 16 “ Maximum Entropy Support Vector Machines 10/19
Three Probabilistic Tokenizers • Character-by-character Hidden Markov Model (HMM) • One pseudo-token only depends on the previous one. • Hierarchical Maximum Entropy Model (HMEM) • The upper level models the transition probabilities. • The lower level constructs Maximum Entropy models for individual tokens. • Hierarchical Support Vector Machines (HSVM) • Same as HMEM, except that the lower level constructs Support Vector Machine models for individual tokens. 11/19
Tokenization By the old learnPADS, HMM and HMEM Sat Jun 24 06:38:46 crashreporterd[120]: mach_msg() reply failed: (ipc/send) invalid destination port date[Sat Jun 24] white[ ] time[06:38:46] white[ ] int[2006] white[ ] string[crashreporterd] char[[] int[120] char[]] char[:] white[ ] string[mach_msg] char[(] char[)] white[ ] string[reply] white[ ] string[failed] char[:] white[ ] char[(] string[ipc] char[/] string[send] char[)] white[ ] string[invalid] white[ ] string[destination] white[ ] string[port] word[Sat] white[ ] word[Jun] white[ ] int[24] white[ ] time[06:38:46] white[ ] int[2006] white[ ] word[crashreporterd] punctuation:[[[] int[120] punctuation:][]] punctuation::[:] message[mach_msg() reply failed] punctuation::[:] message[(ipc/send) invalid destination port] date[Sat Jun 24] white[ ] time[06:38:46] white[ ] int[2006] white[ ] word[crashreporterd] punctuation:[[[] int[120] punctuation:][]] punctuation::[:] message[mach_msg() reply failed] punctuation::[:] message[(ipc/send) invalid destination port] 12/19
Test Data Sources 13/19
Evaluation 1 – Tokenization Accuracy Token error rate = % misidentified tokens Token boundary error rate = % misidentified token boundaries input string: qian Jan/19/09 ideal token sequence: id white date inferred token sequence: id white filepath token error rate = 1/3 token boundary error rate = 0/3 14/19
Evaluation 1 – Tokenization Accuracy PT: probabilistic tokenization # testing data sources: 20 15/19
Evaluation 2 – Type and Data Costs PT: probabilistic tokenization # testing data sources: 20 type cost: cost in bits of transmitting the description data cost: cost in bits of transmitting the data given the description 16/19
Evaluation 3 – Execution Time • The old learnPADS system takes 10 secs to 25 mins. • The new system using probabilistic tokenization approaches takes a few seconds to several hours. • requires extra time to find all possible token sequences • requires extra time to find the most likely token sequences • fastest: Hidden Markov Model • most time-consuming: Hierarchical Support Vector Machines 17/19
Related Work • Grammar induction & structure discovery without token ambiguity problem Arasu & Garcia-Molina ’03 “extracting structure from web pages” Garofalakis et al. ’00 “XTRACT for infering DTDs” Kushmerick et al. ’97 “wrapper induction” • Detect row table components by Hidden Markov Model & Conditional Random Fields: Pinto et al. ’03 • Extract certain fields in records from text: Borkar et al. ’01 • Predict exons and introns in DNA sequences using generalized HMM: Kulp ‘96 • Part-of-speech tagging in natural language processing: • Heeman’99 (Decision Tree) • Speech Recognition: Rabiner ‘89 18/19
Contributions • Identify the Token Ambiguity Problem and take initial steps towards solving it by statistical models • Use all possible token sequences. • Integrate 3 statistical approaches into the learnPADS framework. • Hidden Markov Model • Hierarchical Maximum Entropy Model • Hierarchical Support Vector Machines Model • Evaluate correctness and performance by a number of measures • Results have shown that multiple token sequences and statistical methods achieve partial success. 19/19
Future Work • How to make use of “vertical” information • one record is not independent of others • key: alignment • Conditional Random Fields • Online learning: • old description + new data new description
Evaluation 3 – Qualitative Comparison The description is too general and it loses much useful information. 0 The description is too verbose and the structure is unclear. -2 -1 1 2 optimal