260 likes | 275 Views
This project aims to help end-user programmers infer data formats from examples, eliminating the need for learning specialized notations. The Topes system converts human-readable formats into context-free grammars, enabling efficient validation and customization of formats.
E N D
Unsupervised Inference of Data Formats inHuman-Readable Notation Christopher Scaffidi Carnegie Mellon University
Target audience • In 2012, we project that there will be 90 millioncomputer end users (“EUs”) in American workplaces. • Of these, at least half will create spreadsheets, databases, and/or web applications. These are called end-user programmers (“EUPs”). • For professional programmers, programs are a deliverable. • For EUPs, programs are a means to an end. motivation ●overview ● algorithm ●evaluation ● conclusion
Current practice:Storing data as strings • Typical tools of EUPs: • Excel “text” cells • Access/MSSQL “varchar” fields • FrontPage/Dreamweaver “textfield” inputs • Validation involves… • Learning an exotic new notation (VBScript, regexps, etc) • Writing cumbersome expressions in that notation • Most EUPs do not know these notations and have no time, interest, or incentive to learn the notations. motivation ●overview ● algorithm ●evaluation ● conclusion
Our Topes system to date… • Formats are presented in human-readable notationin our format editor • Format = sequence of parts with constraints on parts • Constraints can be “often” true (rather than “always”) • The format is automatically converted to a context-free grammar, with constraints attached to productions. • At runtime, our parser checks values against formats, returning a confidence in the range [0,1] for each value. motivation ●overview ● algorithm ●evaluation ● conclusion
Needed: Format inference • Problem: To date, the system offers limited support for helping users to get started. • Although users do not need to learn specialized notation, there is still the cognitive work of… • examining data • breaking it into parts • representing parts in the format • Solution: Infer a boilerplate format from examples motivation ●overview ● algorithm ●evaluation ● conclusion
Talk Outline • Motivation / Problem • Solution • Overview of Topei • Inference algorithm • Evaluation • Conclusion motivation ●overview ● algorithm ●evaluation ● conclusion
PrototypeTask flow diagram User creates a format from scratch or User highlights spreadsheet cells Plug-in flags cells that don’t match format User loads an existing format from a file or Algorithm infers a format from cell values User reviews and customizes format [1][6] motivation ●overview ● algorithm ●evaluation ● conclusion
Sample task: validating a spreadsheetwith the prototype we have built • The second column is “supposed” to contain first names, but some outlier values containing initials have snuck in. motivation ●overview ● algorithm ●evaluation ● conclusion
Sample task: validating a spreadsheetCustomizing an inferred format • Inferred format is presented in editor with sentence-like prompts to improve human-readability • User can specify meaningful names for parts motivation ●overview ● algorithm ●evaluation ● conclusion
Sample task: validating a spreadsheetCustomizing constraints in our prototype • User can add/edit constraints motivation ●overview ● algorithm ●evaluation ● conclusion
Sample task: validating a spreadsheetFlagging potential errors • A red flag (reviewer comment, actually) appears on cells that do not match the format; mouse over for message motivation ●overview ● algorithm ●evaluation ● conclusion
Our algorithm has 2 phases Input: An array of strings Phase 1: Identify format parts Phase 2: Identify constraints on each part of each format Output: An array of formats • Sorted according to how many examples they match motivation ●overview ● algorithm●evaluation ● conclusion
Phase 1: Identify format parts • For each string, replace each character with its class, then collapse runs, generating a string “signature” • Supported character classes: A uppercase letter a lowercase letter 0 digit • Example: apple@gmail.com a@a.a banana1@hotmail.com a0@a.a carrot@company.com a@a.a DATE@UNIVERSITY.edu A@A.a eggplant@firm-name.com a@a-a.a fig.plant@mail.univ.edu a.a@a.a.a motivation ●overview ● algorithm●evaluation ● conclusion
Phase 1: Identify format parts • Pack strings with identical signatures (often leads to significant performance improvement) • Example: apple@gmail.com a@a.a banana1@hotmail.com a0@a.a carrot@company.com <<packed with 1st, above>> DATE@UNIVERSITY.edu A@A.a eggplant@firm-name.com a@a-a.a fig.plant@mail.univ.edu a.a@a.a.a motivation ●overview ● algorithm●evaluation ● conclusion
Phase 1: Identify format parts • Align signatures based on separators • Example: apple@gmail.com a @ a . a banana1@hotmail.com a0 @ a . a DATE@UNIVERSITY.edu A @ A . a eggplant@firm-name.com a@a-a.a fig.plant@mail.univ.edu a.a@a.a.a motivation ●overview ● algorithm●evaluation ● conclusion
Phase 1: Identify format parts • Abstract to least general composite character class, yielding the parts of each format. • Example (3 formats below): apple@gmail.com a @ a . a banana1@hotmail.com a0 @ a . a DATE@UNIVERSITY.edu A @ A . a a0A aA a eggplant@firm-name.com a @ a - a . a fig.plant@mail.univ.edu a . a @ a . a . a motivation ●overview ● algorithm●evaluation ● conclusion
Phase 2: Identify constraints on each part • Constrain each part’s contents to the character classes • Require indicated separators before/after parts • Infer an additional content constraint that is “often” true: • Must be in a set of 3 or fewer literals? • Must be in a numeric range? • Must start with or end with certain characters? • A content constraint is inferred if it covers at least 95% of the examples supporting that format’s signature. • Afterward, the user can review/customize format. motivation ●overview ● algorithm●evaluation ● conclusion
Evaluation as an outlier finder • Outlier finding: • Infer a format from example values • Use the inferred format to check the examples Reveals “outliers” that might contain typos or other errors • Comparison algorithm: Lapis Lapis example @DayOfMonth is Number equal to /[12][0-9]|3[01]|0?[1-9]/ ignoring nothing @ShortMonth is Number equal to /1[012]|0?[1-9]/ ignoring nothing @ShortYear is Number equal to /\d\d/ ignoring nothing Date is flatten @ShortMonth then @DayOfMonth then @ShortYear ignoring either Spaces or Punctuation motivation ●overview ● algorithm ●evaluation ● conclusion
Evaluation dataDrawn from EUSES spreadsheet corpus • 6288 US phone numbers in 37 columns • First cell in column contains “phone” • And at least 20 cells have exactly 10 digits • And at least 2/3 of cells have exactly 10 digits • 1124 country names in 7 columns • First cell in column contains “country” • And there are at least 20 cells • And at least one cell contains “Portugal” motivation ●overview ● algorithm ●evaluation ● conclusion
Run each algorithm (Topei & Lapis)and compare their output to hand-labeling • For determining “true outliers” in calculating accuracy: • Outlier phone numbers have an area code that is not in service, or if they contain errant separators such as spaces not shared by most cells in the column. • Outlier country names contain abbreviations, misspellings or a different name than the one usually used by English-speakers, except for a specific list of allowed exceptions that are very commonly used (e.g.: Brasil, US, UK) [note: allowing these exceptions hurts Topei’s accuracy] motivation ●overview ● algorithm ●evaluation ● conclusion
Results: Topei’s precision/recall exceed Lapis’s • Standard machine learning measures for outlier finding • Precision = # outliers found / # outliers claimed • Recall = # outliers found / # true outliers motivation ●overview ● algorithm ●evaluation ● conclusion
Limitations & future work • Topei still makes mistakes: • Doesn’t infer constraints aggressively enough • Doesn’t recognize non-ASCII chars in character classes • Doesn’t handle formats with repeating parts • Need deeper integration with EUPs’ tools • Computational complexity: • Is intended to be O(# examples), seems to be true • More careful verification needed • Usability has not yet been evaluated in user study motivation ●overview ● algorithm ●evaluation ● conclusion
Thank You… • …to you for your interest and attention • …to INSTICC for the opportunity to present • …to NSF and EUSES for funding (ITR-0325273 and CCF-0438929) motivation ●overview ● algorithm ●evaluation ● conclusion
Another example:Carnegie Mellon University phone #: 8-1234 motivation ●overview ● algorithm ●evaluation ● conclusion
Integration • Formats can be inferred from… • Spreadsheet cells • Database queries (e.g.: Access/MSSQL) • Arbitrary collection of text strings (via C#) • Formats can then be used without modification in other venues, as well. • E.g.: infer a format from spreadsheet cells, then use it to create a trigger for a database table motivation ●overview ● algorithm ●evaluation ● conclusion
Related Work • Many algorithms train a recognizer to notice features • See (Mitchell, 1997) for a summary • Such algorithms do not infer a human-editable format. • Others generate formats in specialized notation. • (Miller, 2001) (Blackwell, 2001) (Lerman, 2000), (Lieberman, 2001) (Nardi, 1998) • Regular expressions and CFGs have limited readability. • Several tools recognize or manipulate some of the same kinds of data as Topei. • (Hong, 2006) (Pandit, 1997) (Stylos 2004) • Custom formats are unsupported (only hardcoded formats) motivation ●overview ● algorithm ●evaluation ● conclusion