1 / 23

Ad Hoc Data and the Token Ambiguity Problem

Ad Hoc Data and the Token Ambiguity Problem. Qian Xi * , Kathleen Fisher + , David Walker * , Kenny Zhu * 2009/1/19. * : Princeton University, + : AT&T Labs Research. Ad Hoc Data. Standardized data formats: HTML, XML Data processing tools: Visualizers (HTML browsers), XQuery.

mgiles
Download Presentation

Ad Hoc Data and the Token Ambiguity Problem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ad Hoc Data and the Token Ambiguity Problem Qian Xi*, Kathleen Fisher+, David Walker*, Kenny Zhu* 2009/1/19 *: Princeton University, +: AT&T Labs Research

  2. Ad Hoc Data • Standardized data formats: HTML, XML • Data processing tools: Visualizers (HTML browsers), XQuery • Non-standard, semi-structured • Not many data processing tools • Examples: web server log (CLF), phone call provisioning data… 207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30 244.133.108.200 - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/ddorg/confirm HTTP/1.0" 200 941 9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii152272|EDTF_6|0|MARVINS1|UNO|10|1000295291 9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii15222|EDTF_6|0|MARVINS1|UNO|10|1000295291|20|1000295291|17|1001649600|19|1001 1/19

  3. learnPADS Goal • Automatically generates a description of the format • Automatically generates a suite of data processing tools Punion payload { Pint32 i; PstringFW(3) s2; }; Pstruct source { ‘\”’; payload p1; “,”; payload p2; ‘\”’; } “0,24” “bar,end” “foo,16” Declarative Description XML converter, Grapher, etc. 2/19

  4. learnPADS Architecture XML Analysis Report XML converter Raw Data Profiler Chunking & Tokenization Format Inference Engine Structure Discovery Format Refinement PADS Compiler Data Description 3/19

  5. learnPADS framework Chunking & Tokenization “0,24” “bar,end” “foo,bag” “0,56” “cat,name” “int, int” “str, str” “str, str” “int, int” “str, str” Structure Discovery struct Format Refinement “ union “ struct struct struct “ union , union “ 0 , INT STR , STR INT STR INT STR 4/19

  6. Token Ambiguity Problem (TAP) Given a string, there’re multiple ways to tokenize it. • Message • Word White Word White Word White... White URL • Word White Quote Filepath Quote White Word White... • old learnPADS • user defines a set of base tokens with fixed order • take the first, longest match • new solution: probabilistic tokenization • use probabilistic models to find most likely token sequences 5/19

  7. Probabilistic Graphical Models earthquake burglar alarm parent comes home node: random variable edge: probabilistic relationship 6/19

  8. Hidden Markov Model (HMM) • Observation/Character Ci • Character Features: upper/lower case, digit, punctuation... • Hidden state/Pseudo-token Ti • maximize probability P(token sequence|character sequence) tokens: Quote Word Comma Int Quote pseudo-tokens: Quote Word Word Word Comma Int Int Quote input characters: , “ f o o 1 6 “ transition probability: P(Ti|Ti-1) emission probability: P(Ci|Ti) 7/19

  9. Hidden Markov Model Formula the probability of token sequence given character sequence = the probability that token T1 comes first * the probability that token Ti follows Ti-1 for all i * the probability that we see character Ci given token Ti for all i transition probability emission probability 8/19

  10. Hidden Markov Model Parameters transition probability emission probability 9/19

  11. Hierarchical Models Quote Word Comma Int Quote , “ foo 16 “ Maximum Entropy Support Vector Machines 10/19

  12. Three Probabilistic Tokenizers • Character-by-character Hidden Markov Model (HMM) • One pseudo-token only depends on the previous one. • Hierarchical Maximum Entropy Model (HMEM) • The upper level models the transition probabilities. • The lower level constructs Maximum Entropy models for individual tokens. • Hierarchical Support Vector Machines (HSVM) • Same as HMEM, except that the lower level constructs Support Vector Machine models for individual tokens. 11/19

  13. Tokenization By the old learnPADS, HMM and HMEM Sat Jun 24 06:38:46 crashreporterd[120]: mach_msg() reply failed: (ipc/send) invalid destination port date[Sat Jun 24] white[ ] time[06:38:46] white[ ] int[2006] white[ ] string[crashreporterd] char[[] int[120] char[]] char[:] white[ ] string[mach_msg] char[(] char[)] white[ ] string[reply] white[ ] string[failed] char[:] white[ ] char[(] string[ipc] char[/] string[send] char[)] white[ ] string[invalid] white[ ] string[destination] white[ ] string[port] word[Sat] white[ ] word[Jun] white[ ] int[24] white[ ] time[06:38:46] white[ ] int[2006] white[ ] word[crashreporterd] punctuation:[[[] int[120] punctuation:][]] punctuation::[:] message[mach_msg() reply failed] punctuation::[:] message[(ipc/send) invalid destination port] date[Sat Jun 24] white[ ] time[06:38:46] white[ ] int[2006] white[ ] word[crashreporterd] punctuation:[[[] int[120] punctuation:][]] punctuation::[:] message[mach_msg() reply failed] punctuation::[:] message[(ipc/send) invalid destination port] 12/19

  14. Test Data Sources 13/19

  15. Evaluation 1 – Tokenization Accuracy Token error rate = % misidentified tokens Token boundary error rate = % misidentified token boundaries input string: qian Jan/19/09 ideal token sequence: id white date inferred token sequence: id white filepath token error rate = 1/3 token boundary error rate = 0/3 14/19

  16. Evaluation 1 – Tokenization Accuracy PT: probabilistic tokenization # testing data sources: 20 15/19

  17. Evaluation 2 – Type and Data Costs PT: probabilistic tokenization # testing data sources: 20 type cost: cost in bits of transmitting the description data cost: cost in bits of transmitting the data given the description 16/19

  18. Evaluation 3 – Execution Time • The old learnPADS system takes 10 secs to 25 mins. • The new system using probabilistic tokenization approaches takes a few seconds to several hours. • requires extra time to find all possible token sequences • requires extra time to find the most likely token sequences • fastest: Hidden Markov Model • most time-consuming: Hierarchical Support Vector Machines 17/19

  19. Related Work • Grammar induction & structure discovery without token ambiguity problem Arasu & Garcia-Molina ’03 “extracting structure from web pages” Garofalakis et al. ’00 “XTRACT for infering DTDs” Kushmerick et al. ’97 “wrapper induction” • Detect row table components by Hidden Markov Model & Conditional Random Fields: Pinto et al. ’03 • Extract certain fields in records from text: Borkar et al. ’01 • Predict exons and introns in DNA sequences using generalized HMM: Kulp ‘96 • Part-of-speech tagging in natural language processing: • Heeman’99 (Decision Tree) • Speech Recognition: Rabiner ‘89 18/19

  20. Contributions • Identify the Token Ambiguity Problem and take initial steps towards solving it by statistical models • Use all possible token sequences. • Integrate 3 statistical approaches into the learnPADS framework. • Hidden Markov Model • Hierarchical Maximum Entropy Model • Hierarchical Support Vector Machines Model • Evaluate correctness and performance by a number of measures • Results have shown that multiple token sequences and statistical methods achieve partial success. 19/19

  21. End

  22. Future Work • How to make use of “vertical” information • one record is not independent of others • key: alignment • Conditional Random Fields • Online learning: • old description + new data new description

  23. Evaluation 3 – Qualitative Comparison The description is too general and it loses much useful information. 0 The description is too verbose and the structure is unclear. -2 -1 1 2 optimal

More Related