Unsupervised Inference of Data Formats in Human-Readable Notation

Unsupervised Inference of Data Formats inHuman-Readable Notation Christopher Scaffidi Carnegie Mellon University

Target audience • In 2012, we project that there will be 90 millioncomputer end users (“EUs”) in American workplaces. • Of these, at least half will create spreadsheets, databases, and/or web applications. These are called end-user programmers (“EUPs”). • For professional programmers, programs are a deliverable. • For EUPs, programs are a means to an end. motivation ●overview ● algorithm ●evaluation ● conclusion

Current practice:Storing data as strings • Typical tools of EUPs: • Excel “text” cells • Access/MSSQL “varchar” fields • FrontPage/Dreamweaver “textfield” inputs • Validation involves… • Learning an exotic new notation (VBScript, regexps, etc) • Writing cumbersome expressions in that notation • Most EUPs do not know these notations and have no time, interest, or incentive to learn the notations. motivation ●overview ● algorithm ●evaluation ● conclusion

Our Topes system to date… • Formats are presented in human-readable notationin our format editor • Format = sequence of parts with constraints on parts • Constraints can be “often” true (rather than “always”) • The format is automatically converted to a context-free grammar, with constraints attached to productions. • At runtime, our parser checks values against formats, returning a confidence in the range [0,1] for each value. motivation ●overview ● algorithm ●evaluation ● conclusion

Needed: Format inference • Problem: To date, the system offers limited support for helping users to get started. • Although users do not need to learn specialized notation, there is still the cognitive work of… • examining data • breaking it into parts • representing parts in the format • Solution: Infer a boilerplate format from examples motivation ●overview ● algorithm ●evaluation ● conclusion

Talk Outline • Motivation / Problem • Solution • Overview of Topei • Inference algorithm • Evaluation • Conclusion motivation ●overview ● algorithm ●evaluation ● conclusion

PrototypeTask flow diagram User creates a format from scratch or User highlights spreadsheet cells Plug-in flags cells that don’t match format User loads an existing format from a file or Algorithm infers a format from cell values User reviews and customizes format [1][6] motivation ●overview ● algorithm ●evaluation ● conclusion

Sample task: validating a spreadsheetwith the prototype we have built • The second column is “supposed” to contain first names, but some outlier values containing initials have snuck in. motivation ●overview ● algorithm ●evaluation ● conclusion

Sample task: validating a spreadsheetCustomizing an inferred format • Inferred format is presented in editor with sentence-like prompts to improve human-readability • User can specify meaningful names for parts motivation ●overview ● algorithm ●evaluation ● conclusion

Sample task: validating a spreadsheetCustomizing constraints in our prototype • User can add/edit constraints motivation ●overview ● algorithm ●evaluation ● conclusion

Sample task: validating a spreadsheetFlagging potential errors • A red flag (reviewer comment, actually) appears on cells that do not match the format; mouse over for message motivation ●overview ● algorithm ●evaluation ● conclusion

Our algorithm has 2 phases Input: An array of strings Phase 1: Identify format parts Phase 2: Identify constraints on each part of each format Output: An array of formats • Sorted according to how many examples they match motivation ●overview ● algorithm●evaluation ● conclusion

Phase 1: Identify format parts • For each string, replace each character with its class, then collapse runs, generating a string “signature” • Supported character classes: A uppercase letter a lowercase letter 0 digit • Example: apple@gmail.com a@a.a banana1@hotmail.com a0@a.a carrot@company.com a@a.a DATE@UNIVERSITY.edu A@A.a eggplant@firm-name.com a@a-a.a fig.plant@mail.univ.edu a.a@a.a.a motivation ●overview ● algorithm●evaluation ● conclusion

Phase 1: Identify format parts • Pack strings with identical signatures (often leads to significant performance improvement) • Example: apple@gmail.com a@a.a banana1@hotmail.com a0@a.a carrot@company.com <<packed with 1st, above>> DATE@UNIVERSITY.edu A@A.a eggplant@firm-name.com a@a-a.a fig.plant@mail.univ.edu a.a@a.a.a motivation ●overview ● algorithm●evaluation ● conclusion

Phase 1: Identify format parts • Align signatures based on separators • Example: apple@gmail.com a @ a . a banana1@hotmail.com a0 @ a . a DATE@UNIVERSITY.edu A @ A . a eggplant@firm-name.com a@a-a.a fig.plant@mail.univ.edu a.a@a.a.a motivation ●overview ● algorithm●evaluation ● conclusion

Phase 1: Identify format parts • Abstract to least general composite character class, yielding the parts of each format. • Example (3 formats below): apple@gmail.com a @ a . a banana1@hotmail.com a0 @ a . a DATE@UNIVERSITY.edu A @ A . a a0A aA a eggplant@firm-name.com a @ a - a . a fig.plant@mail.univ.edu a . a @ a . a . a motivation ●overview ● algorithm●evaluation ● conclusion

Phase 2: Identify constraints on each part • Constrain each part’s contents to the character classes • Require indicated separators before/after parts • Infer an additional content constraint that is “often” true: • Must be in a set of 3 or fewer literals? • Must be in a numeric range? • Must start with or end with certain characters? • A content constraint is inferred if it covers at least 95% of the examples supporting that format’s signature. • Afterward, the user can review/customize format. motivation ●overview ● algorithm●evaluation ● conclusion

Evaluation as an outlier finder • Outlier finding: • Infer a format from example values • Use the inferred format to check the examples  Reveals “outliers” that might contain typos or other errors • Comparison algorithm: Lapis Lapis example @DayOfMonth is Number equal to /[12][0-9]|3[01]|0?[1-9]/ ignoring nothing @ShortMonth is Number equal to /1[012]|0?[1-9]/ ignoring nothing @ShortYear is Number equal to /\d\d/ ignoring nothing Date is flatten @ShortMonth then @DayOfMonth then @ShortYear ignoring either Spaces or Punctuation motivation ●overview ● algorithm ●evaluation ● conclusion

Evaluation dataDrawn from EUSES spreadsheet corpus • 6288 US phone numbers in 37 columns • First cell in column contains “phone” • And at least 20 cells have exactly 10 digits • And at least 2/3 of cells have exactly 10 digits • 1124 country names in 7 columns • First cell in column contains “country” • And there are at least 20 cells • And at least one cell contains “Portugal” motivation ●overview ● algorithm ●evaluation ● conclusion

Run each algorithm (Topei & Lapis)and compare their output to hand-labeling • For determining “true outliers” in calculating accuracy: • Outlier phone numbers have an area code that is not in service, or if they contain errant separators such as spaces not shared by most cells in the column. • Outlier country names contain abbreviations, misspellings or a different name than the one usually used by English-speakers, except for a specific list of allowed exceptions that are very commonly used (e.g.: Brasil, US, UK) [note: allowing these exceptions hurts Topei’s accuracy] motivation ●overview ● algorithm ●evaluation ● conclusion

Results: Topei’s precision/recall exceed Lapis’s • Standard machine learning measures for outlier finding • Precision = # outliers found / # outliers claimed • Recall = # outliers found / # true outliers motivation ●overview ● algorithm ●evaluation ● conclusion

Limitations & future work • Topei still makes mistakes: • Doesn’t infer constraints aggressively enough • Doesn’t recognize non-ASCII chars in character classes • Doesn’t handle formats with repeating parts • Need deeper integration with EUPs’ tools • Computational complexity: • Is intended to be O(# examples), seems to be true • More careful verification needed • Usability has not yet been evaluated in user study motivation ●overview ● algorithm ●evaluation ● conclusion

Thank You… • …to you for your interest and attention • …to INSTICC for the opportunity to present • …to NSF and EUSES for funding (ITR-0325273 and CCF-0438929) motivation ●overview ● algorithm ●evaluation ● conclusion

Another example:Carnegie Mellon University phone #: 8-1234 motivation ●overview ● algorithm ●evaluation ● conclusion

Integration • Formats can be inferred from… • Spreadsheet cells • Database queries (e.g.: Access/MSSQL) • Arbitrary collection of text strings (via C#) • Formats can then be used without modification in other venues, as well. • E.g.: infer a format from spreadsheet cells, then use it to create a trigger for a database table motivation ●overview ● algorithm ●evaluation ● conclusion

Related Work • Many algorithms train a recognizer to notice features • See (Mitchell, 1997) for a summary • Such algorithms do not infer a human-editable format. • Others generate formats in specialized notation. • (Miller, 2001) (Blackwell, 2001) (Lerman, 2000), (Lieberman, 2001) (Nardi, 1998) • Regular expressions and CFGs have limited readability. • Several tools recognize or manipulate some of the same kinds of data as Topei. • (Hong, 2006) (Pandit, 1997) (Stylos 2004) • Custom formats are unsupported (only hardcoded formats) motivation ●overview ● algorithm ●evaluation ● conclusion

Unsupervised Inference of Data Formats in Human-Readable Notation

Unsupervised Inference of Data Formats in Human-Readable Notation

Presentation Transcript

NASA SATELLITE DATA FORMATS

Bayesian models of human inference

Data Formats and Tools

EPC Data Formats

Writing HUMAN READABLE CODE

Data formats in Bioinformatics

Spatial Data Formats

Human-Understandable Inference of Causal Relationships

Data Formats

CHAPTER 3: Data Formats

MTEM data formats

Standard Metadata in Scientific Data Formats

Spatial Data Formats

Data Formats

Processing different formats of data

Data formats in e-Science

Other formats for data

Interoperable Data Formats

CHAPTER 3: Data Formats

Data Storage Formats

Bayesian models of human inference

Data Formats and Codecs