From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data

From Dirt to Shovels:Fully Automatic Tool Generation from ASCII Data David Walker Pamela Dragosh Mary Fernandez Kathleen Fisher Andrew Forrest Bob Gruber Yitzhak Mandelbaum Peter White Kenny Q. Zhu www.padsproj.org

Data, data, everywhere • AT&T and other information technology companies spend huge amounts of time and energy processing “ad hoc data” • Ad hoc data = data in non-standard formats with no a priori data processing tools/libraries available • not free text; not html; not xml • Common problems: no documentation, evolving formats, huge volume, error-filled ... Router Configs Network Monitoring Web Logs Billing Info Call Details

Data, data, everywhere 207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30 tj62.aol.com - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/dd@grp.org/confirm HTTP/1.0" 200 941 234.200.68.71 - - [15/Oct/1997:18:53:33 -0700] "GET /tr/img/gift.gif HTTP/1.0” 200 409 240.142.174.15 - - [15/Oct/1997:18:39:25 -0700] "GET /tr/img/wool.gif HTTP/1.0" 404 178 188.168.121.58 - - [16/Oct/1997:12:59:35 -0700] "GET / HTTP/1.0" 200 3082 214.201.210.19 ekf - [17/Oct/1997:10:08:23 -0700] "GET /img/new.gif HTTP/1.0" 304 - web server common log format

Data, data, everywhere 9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii152272|EDTF_6|0|MARVINS1|UNO|10|1000295291 9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii15222|EDTF_6|0|MARVINS1|UNO|10|1000295291|20|1000295291|17|1001649600|19|1001 649600|27|1001649600|29|1001649600|IA0288|1001714400|IE0288|1001714400|EDTF_CRTE|1001908800|EDTF_OS_1|1001995201|16|1021309814|26|1054589982 AT&T phone call provisioning data

Data, data, everywhere HA00000000START OF TEST CYCLE aA00000001BXYZ U1AB0000040000100B0000004200 HE00000005START OF SUMMARY f 00000006NYZX B1QB00052000120000070000B000050000000520000 00490000005100+00000100B00000005300000052500000535000 HF00000007END OF SUMMARY k00000008LYXW B1KB0000065G0000009900100000001000020000 HB00000009END OF TEST CYCLE www.opradata.com

Data, data, everywhere format-version: 1.0 date: 11:11:2005 14:24 auto-generated-by: DAG-Edit 1.419 rev 3 default-namespace: gene_ontology subsetdef: goslim_goa "GOA and proteome slim" [Term] id: GO:0000001 name: mitochondrion inheritance namespace: biological_process def: "The distribution of mitochondria\, including the mitochondrial genome\, into daughter cells after mitosis or meiosis\, mediated by interactions between mitochondria and the cytoskeleton." [PMID:10873824, PMID:11389764, SGD:mcc] is_a: GO:0048308 ! organelle inheritance is_a: GO:0048311 ! mitochondrion distribution www.geneontology.org

Goal Visual Information End-user tools Billing Info ASCII log files Call Detail Raw Data CSV XML Standard formats & schema We want to create this arrow

Half-way there: The PADS System 1.0 [FG pldi 05, FMW popl 06, MFWFG popl 07] “Ad Hoc” Data Source PADS Data Description PADS Runtime System (I/O, Error Handling) PADS Compiler Generated Libraries (Parsing, Printing, Traversal) XML Converter Data Profiler Graphing Tool Query Engine Custom App generic description- directed programs coded once ? XML Analysis Report Graph Information

PADS Language Overview • Rich base type library: • integers:Pint8, Puint32, … • strings:Pstring(’|’), Pstring_FW(3), ... • systems data:Pdate, Ptime, Pip, … • Type constructors describe complex data sources: • sequences:Pstruct, Parray, • choices:Punion, Penum, Pswitch • constraints: arbitrary predicates describe expected semantic properties • parameterization: allows definition of generic descriptions Data formats are described using a specialized language of types A formal semantics gives meaning to descriptions in terms of both external format and internal data structures generated.

The Last Mile: The PADS System 2.0 Raw Data XML XMLifier Profiler Analysis Report Format Inference Engine Chunking & Tokenization Chunking & Tokenization Structure Discovery Structure Discovery PADS Data Description Format Refinement Scoring Function PADS Compiler

Chunking Process • Convert raw input into sequence of “chunks.” • Supported divisions: • Various forms of “newline” • File boundaries • Also possible: user-defined “paragraphs”

Tokenization • Tokens/Base types expressed as regular expressions. • Basic tokens • Integer, white space, punctuation, strings • Distinctive tokens • IP addresses, dates, times, MAC addresses, ...

Histograms

Clustering Group clusters with similar frequency distributions Cluster 1 Cluster 2 Cluster 3 Two frequency distributions are similar if they have the same shape (within some error tolerance) when the columns are sorted by height. Rank clusters by metric that rewards high coverage and narrower distributions. Chose cluster with highest score.

Partition chunks In our example, all the tokens appear in the same order in all chunks, so the union is degenerate.

Find subcontexts Tokens in selected cluster: Quote(2)CommaWhite

Then Recurse...

Inferred type

Structure Discovery Review • Compute frequency distribution for each token. • Cluster tokens with similar frequency distributions. • Create hypothesis about data structure from cluster distributions • Struct • Array • Union • Basic type (bottom out) • Partition data according to hypothesis & recurse • Once structure discovery is complete, later phases massage & rewrite candidate description to create final form “123, 24” “345, begin” “574, end” “9378, 56” “12, middle” “-12, problem” …

Testing and Evaluation • Evaluated overall results qualitatively • Compared with Excel -- a manual process with limited facilities for representation of hierarchy or variation • Compared with hand-written descriptions –- performance variable depending on tokenization choices & complexity • Evaluated accuracy quantitatively • For many formats: 95%+ accuracy from 5% of available data • Evaluated performance quantitatively • Hours to days to hand-write formats • after fixing the format, appears to scale linearly with data size • <1 min on 300K data

Technical Summary [www.padsproj.org] • PADS 1.0 is an effective implementation framework for many data processing tasks • PADS 2.0 improves programmer productivity further by automatically inferring formats & generating many tools & libraries Email struct { ........ ...... ........... } ASCII log files Binary Traces CSV XML

End

Execution Time SD: structure discovery Ref: refinement Tot: total HW: hand-written

Training Time

Minimum Necessary Training Sizes

Problem: Tokenization • Technical problem: • Different data sources assume different tokenization strategies • Useful token definitions sometimes overlap, can be ambiguous, aren’t always easily expressed using regular expressions • Matching tokenization of underlying data source can make a big difference in structure discovery. • Current solution: • Parameterize learning system with customizable configuration files • Automatically generate lexer file & basic token types • Future solutions: • Use existing PADS descriptions and data sources to learn probabilistic tokenizers • Incorporate probabilities into sophisticated back-end rewriting system • Back end has more context for making final decisions than the tokenizer, which reads 1 character at a time without look ahead

From Dirt to Shovels: Fully Automatic Tool Generation from ASCII Data