350 likes | 469 Views
Ad Hoc Data: From Uggh to Smug. David Walker Princeton University. 00000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872 ...............r 00000010: 6573 6561 7263 6803 6174 7403 636f 6d00 esearch.att.com. 00000020: 00fc 0001 c00c 0006 0001 0000 0e10 0027 ...............'
E N D
Ad Hoc Data: From Uggh to Smug David Walker Princeton University 00000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872 ...............r 00000010: 6573 6561 7263 6803 6174 7403 636f 6d00 esearch.att.com. 00000020: 00fc 0001 c00c 0006 0001 0000 0e10 0027 ...............' 00000030: 036e 7331 c00c 0a68 6f73 746d 6173 7465 .ns1...hostmaste 00000040: 72c0 0c77 64e5 4900 000e 1000 0003 8400 r..wd.I......... 00000050: 36ee 8000 000e 10c0 0c00 0f00 0100 000e 6............... 00000060: 1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00 ......linux..... 00000070: 0f00 0100 000e 1000 0c00 0a07 6d61 696c ............mail 00000080: 6d61 6ec0 0cc0 0c00 0100 0100 000e 1000 man............. :-
Ad Hoc Data is Everywhere • Lots of data in databases ==> even more data that isn’t • Ad Hoc Data: sets of semi-structured data files for which standard data processing tools are unavailable • Tasks:“getting the data into a database” (and other kinds of transformations), data cleaning, querying, editing, parsing... • Troubles:error prone, limited documentation, evolving formats, huge volume, ... Router Configs Network Monitoring Web Logs Billing Info Cosmology Data
Two New Systems • Anne: A “Mark-up Language” for Ad Hoc Data [PLDI 2010] • with Qian Xi (Princeton) • Forest: A Language for Specifying Environmental Assumptions • with Kathleen Fisher (AT&T) • Nate Foster (Princeton) • Kenny Zhu (Jiao Tong Shanghai University)
Anne: A Context-free Mark-up Language for Ad Hoc Data[PLDI 2010] Qian Xi
The Problem 207.136.97.49 - - "GET /turkey/amnty1.gif HTTP/1.0" 200 3013 polux.entelchile.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 ... What is the fastest, most reliable way to go from data like this: To a parse tree like this: And generate documentation (a grammar) and tools such as a parser, printer, query engine, editor, xml converter, ... EntryList Entry Message IP ... ... Sort Protocol Code Size 207.136.97.49 URL /turkey/amnty1.gif HTTP/1.0 200 3013 GET
Our Solution: Anne • Develop a “mark-up language” for ordinary text • programmers annotate raw text using a set of “grammatical directives” • a simple, predictable algorithm generates a complete grammar & processing tools from directives + the surrounding raw data Pros: • really easy to use • directives are simple -- applied when & where needed • you can do it at 3am • predictable • documentation and tools may be generated automatically Cons: • not completely automatic • but I’m skeptical any other more magical bullet exists anyway
Document: 207.136.97.49 - - "GET /turkey/amnty1.gif HTTP/1.0" 200 3013 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org - amnesty "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar:
Document: Edit document to add directives {Entry:207.136.97.49 - - "GET /turkey/amnty1.gif HTTP/1.0" 200 3013} 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org - amnesty "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar: Entry ::= int . int . int . int ‘ ‘ – ‘ ‘ – ‘ ‘ ‘”’ word ... int ‘ ‘ int Default tokenization of tagged data Non-terminal name drawn from directive
Document: Second directive {Entry:207.136.97.49 – {ID:-} "GET /turkey/amnty1.gif HTTP/1.0" 200 3013} 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org - amnesty "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar: New grammar rule ID ::= ‘-’ Entry ::= int . int . int . int ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ word ... int ‘ ‘ int Default grammar now incluldes new non-terminal
Document: multiple identical name occurrences imply union of grammars {Entry:207.136.97.49 – {ID:-} "GET /turkey/amnty1.gif HTTP/1.0" 200 3013} 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org – {ID:amnesty} "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar: union of grammars ID ::= ‘-’ + word Entry ::= int . int . int . int ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ word ... int ‘ ‘ int
Document: = denotes presence of constant string {Entry:207.136.97.49 – {ID:-} “{=GET} /turkey/amnty1.gif HTTP/1.0" 200 3013} 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org – {ID:amnesty} "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar: ID ::= ‘-’ + word Entry ::= int . int . int . int ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ ‘GET’ ... int ‘ ‘ int
Document: $ directs the system to infer a terminating symbol a space follows the closing brace {Entry:{Loc$:207.136.97.49} – {ID:-} “{=GET} /turkey/amnty1.gif HTTP/1.0" 200 3013} 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org – {ID:amnesty} "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar: any string terminated by a space Loc ::= {[^ ]*} ID ::= ‘-’ + word Entry ::= Loc ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ ‘GET’ ... int ‘ ‘ int
Interjection: The Config File • A config file provides a mechanism for defining regular expressions and giving them names • def is an internal definition • exp is an exported named regular expression • The default config file provides regular expressions for common systems data (IP, dates, times, URL, email, ... ) default.config: def db [0-9][0-9] def zone [+-][0-1][0-9]00 def ampm am\|AM\|pm\|PM def trip [0-9][0-9][0-9]\|[0-9][0-9]\|[0-9] ... exp Time {db}:{db}:{db}\([ ]*{ampm}\)?\([ \t]+{zone}\)? exp IP {trip}\.{trip}\.{trip}\.{trip}
Document: pre-defined token {Entry:{IP:207.136.97.49} – {ID:-} “{=GET} /turkey/amnty1.gi .... 200 3013} 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org – {ID:amnesty} "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar: Definition drawn from config file IP ::= ... from config file ... ID ::= ‘-’ + word Entry ::= IP ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ ‘GET’ ... int ‘ ‘ int
Other Features • Most features inspired by similar constructs found in PADS • Enumerations • Recursion (context-freedom) • Kleene Star • with optional element definitions, separators, and terminators) • Options • Prioritized Unions • Assertions • Tables • Generated Artifacts: • PADS description (and from there, the PADS tool suite) • XML & CSS for debugging • Semantics: connections to Relevance Logic [see PLDI 10]
Repetition (1) Kleene Star with elements separated by ‘|’ and defined by first element {Record*[|]:9152271|9152271|1|0|0|0|0|1} Elem ::= int Record ::= (Elem (‘|’ Elem)* )? Kleene Star with elements separated by ‘|’ and defined by Item Repetition (2) {Record/Item*[|]:9152271|{Item:9152271}|1|0|0|0|0|1} Item ::= int Record ::= (Item (‘|’ Item)* )?
? denotes optional data Optional Data {Record/Item*[|]:9152271|{Item?:9152271}|1|0||0||1} Item ::= int? Record ::= (Item (‘|’ Item)* )? missing elelments Assertions & Context-Freedom ! claims underlying data will satisfy nonterminal Parens {Parens?:({Parens!:(((())))})} Parens ::= (’(‘ Parens ‘)’)?
Table (1) {E#:Jason Blake, 78 25 38 63 -2 Alexei Ponikarovsky, 82 23 38 61 6 ...} Row ::= Word ‘ ‘ Word ‘,’ ‘\t’ int ... Record ::= Row (NL Row)* Table (2) {E#h:Name GP Goals Assists Points +/- Jason Blake, 78 25 38 63 -2 Alexei Ponikarovsky, 82 23 38 61 6 ...} Row ::= ... Header ::= ‘Name’ ‘\t’ ... Record ::= Header NL Row*
Forest:A SpecificationLanguagefor EnvironmentalAssumptions[work in progress!] Kathleen Fisher Nate Foster Kenny Zhu
Various causes for errors: • Missing files • Directories/files in wrong locations • Wrong permissions • Links to wrong targets
If only we could... • Describe required file and directory structure, including permissions, etc. • Check that the actual file system matches the spec. • Eliminate a whole class of errors!
CORAL Monitoring System • Monitoring system for an “Internet-scale, self-organizing, web-content distribution network” developed by Mike Freedman, Princeton.
Observations on Monitoring • Coral is similar to other monitoring systems: PlanetLab and a multitude of systems at AT&T. • Often a configuration file specifies which hosts to monitor, what data to collect, and how often. • File and directory names encode meta-data. • Want to ask questions such as: • what was the total load on planetlab1 last week? • on what days and at what times are files are missing? • what is the maximum memory usage? • Answering questions requires formulating queries both in terms of the contents of files and the structure of the file system (directory names, files names)
Other Possible Examples • File Hierarchy Standard (FHS) for unix-like installations • Haskell code base, PADS Source Tree • source code, data, examples, executables, ... • Cabal system for GHC libraries • Disk cache for browser history, IMAP mail • Scientific data sets • CVS, SVN, other source control systems
To Do! • We need a language not just for specifying the contents (formats) of ad hoc data files but also for the structure of file system fragments • specify files • directory structure • dependencies (config files determine file system structure) • meta-data (permissions, sizes, owners, modification times) • The Plan • Build such a specification language on top of PADS • Generate a checker from the specifications • Interface that allows programs to slurp up specified data from the file system • Stand-alone tools: query engine, monitor, etc...
Example: CORAL ptype conf_t = ... {- pads description -} ptype corald_t = ... {- pads description -} ptype dns_t = ... {- pads description -} ptype web_t = ... {- pads description -} ptype probe_t = ... {- pads description -}
Example: CORAL ptype conf_t = ... {- pads description -} ptype corald_t = ... {- pads description -} ptype dns_t = ... {- pads description -} ptype web_t = ... {- pads description -} ptype probe_t = ... {- pads description -} ptype date_d(t::pdate) = pdirectory { corald is "corald.log" :: corald_t <| timestamp >= t |>; coraldns is "nssrv.log" :: dns_t <| timestamp >= t |>; coralweb is "websrv.log" :: web_t <| timestamp >= t |>; probe is "probed.log" :: probe_t <| timestamp >= t |>; time :: pdate = t; }
Example: CORAL ptype conf_t = ... {- pads description -} ptype corald_t = ... {- pads description -} ptype dns_t = ... {- pads description -} ptype web_t = ... {- pads description -} ptype probe_t = ... {- pads description -} ptype date_d(t::pdate) = pdirectory { ... as before ... } ptype host_d = pdirectory { times is [t::date_d(t) | t <- pdate]; }
Example: CORAL ptype conf_t = ... {- pads description -} ptype corald_t = ... {- pads description -} ptype dns_t = ... {- pads description -} ptype web_t = ... {- pads description -} ptype probe_t = ... {- pads description -} ptype host_d(h::phostname, t::pdate) = pdirectory { ... as before ... } ptype host_d () = pdirectory { hosts is [t::date_d(t) | t <- pdate]; } ptype coral_d () = pdirectory { hostNames is “Config” :: conf_t; hosts is [h::host_d | h <= hostNames]; }
Current & Future Plans • Designing a semantics based on a classical logic of trees • We considered using one of the substructural (“separating”) tree logics but we discarded it as the substructural logics gave us the wrong defaults & made the system harder to design and understand (especially in the presence of parent pointers) • Building a “file system parser” & tool generation infrastructure in Haskell • Leverage type-directed programming. • Leverage laziness in loading structures. • Envision a collection of file system management tools based on descriptions • valid –desc d -- check for conformance to d • ls –desc d -- list files described by d • grep pattern –desc d -- grep for pattern in files described by d • mv –desc d foo bar -- move files described by d rooted at foo to bar • Thinking about a query engine & continuous monitoring system • Considering extensions to handle other elements of the programming environment: environment variables