730 likes | 904 Views
A Data Model and Development Environment to Help End-User Programmers Validate and Reuse Data. Christopher Scaffidi Thesis Proposal, May 8, 2007 Committee. Target audience. In 2012, we project that there will be 90 million computer end users (“EUs”) in American workplaces.
E N D
A Data Model and Development Environmentto Help End-User Programmers Validate and Reuse Data Christopher Scaffidi Thesis Proposal, May 8, 2007 Committee
Target audience • In 2012, we project that there will be 90 millioncomputer end users (“EUs”) in American workplaces. • Of these, at least half will create spreadsheets, databases, and/or web applications. These are called end-user programmers (“EUPs”). [5] • Both EUs and EUPs will benefit from the proposed research, though the proposed research is primarily aimed at EUPs (including EUs who become EUPs because of the research). introduction ● related work ● studies ●prototype ● proposed work ● evaluation ● summary
Contextual inquiry:What are the problems of EUs and EUPs? • Observed 3 administrative assistants, 4 managers, and 3 webmasters/graphic designers (1-3 hrs, each) introduction ● related work ● studies ●prototype ● proposed work ● evaluation ● summary
How do you validate web formsif you do not know JavaScript? Is the input valid? “EDSH 225” Is the input nearly valid? “EDXH 225” Does it just need reformatting? “Smith 225” Or is it obviously badly invalid? “Robotics Institute” introduction ● related work ● studies ●prototype ● proposed work ● evaluation ● summary
Other tasks, other data, other problems • When building a staff roster by merging data sources into a single spreadsheet, one of the EUs: • Had to manually transform data to consistent format(e.g.: Put person names in Lastname, Firstname format) • Had to scrutinize data to identify questionable values that deserved double-checking(e.g.: A first name with 15 characters might be right) • Had to manually check for (near-) duplicates(e.g.: “Scaffidi, Christopher” and “Scaffidi, Chris”) • We and research collaborators identified many additional data validation and data reuse tasks that were poorly supported by existing tools. [3][7][9] introduction ● related work ● studies ●prototype ● proposed work ● evaluation ● summary
Underlying problem: abstraction mismatch • Tools support strings, integers, floats, sometimes dates. • Problem domain involves higher-level categories of data: • University names “Carnegie Mellon”, “CMU” • Person names “Scaffidi, Christopher”, “Chris Scaffidi” • CMU phone numbers “8-1234”, “x8-1234” • CMU room numbers “WeH 4623”, “Wean 4623” • These data categories are: • Human-readable • Short (~ 1 input field) • Multi-format • Sometimes ambiguous / fuzzy (non-binary scale of validity) • Often particular to certain groups of people introduction ● related work ● studies ●prototype ● proposed work ● evaluation ● summary
A New Direction: Create a new abstraction for each category of data • Like software “libraries,” implementations of these abstractions could be reused in many programs. • Abstractions would need to include functionality for: • Recognizing instances of the category (for automating data validation) • Transforming instances among various formats (for automating data reformatting) • Testing instances for equality (for automating removal of duplicates) introduction ● related work ● studies ●prototype ● proposed work ● evaluation ● summary
A New Direction: Other requirements for abstractions • EUPs over a range of programming expertise must be able to create custom new abstractions. • Flexibility: • Abstractions must capture fuzziness when recognizing instances of the category and when testing equivalence. • EUPs must have the option of configuring abstractions to learn exceptional cases. • Sharability: • EUPs must still be able to share and find useful abstractions even as the number of abstractions grows. • Latency and throughput of operations must not become burdensome as EUPs share numerous abstractions. introduction ● related work ● studies ●prototype ● proposed work ● evaluation ● summary
Thesis The proposed data model and development environment will enable end-user programmers to implement and share custom abstractions for flexibly recognizing, transforming and equivalence-testing values in categories of short, human-readable data. The model and environment will help end-user programmers to more quickly and correctly validate and reuse data than is possible through currently practiced methods. introduction ● related work ● studies ●prototype ● proposed work ● evaluation ● summary
Topes • Tope = an abstraction implementation for a data category • Greek word for “place,” because each corresponds to a data category with a natural place in the problem domain • Topes in practice: • EUPs create new topes by using the basic tope editor (or by writing topes in another language, such as JavaScript) • EUPs publish topes on repositories. • Other EUs & EUPs download topes to their local cache. • Tool plug-ins let EUs & EUPs browse their local cache and associate topes with variables and input fields. • Plug-ins get topes from local cache and use them to recognize, transform, and equivalence-test data. introduction ● related work ● studies ●prototype ● proposed work ● evaluation ● summary
Outline • Introduction • Related work • Exploratory studies • Prototype • Proposed work • Evaluation • Summary and schedule Existing approaches lack an easy way for EUPs to create flexible, sharable abstractions for data categories introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Existing programming tools for EUPs(eg: Excel, Visual Studio Express, Robofox) • Limited support for a closed set of data categories: • Spreadsheets (like Excel) allow EUs to associate certain formats with cells, but these do not actually validate data • Web application design tools (like Visual Studio) allow EUPs to apply certain limited constraints to validate input • Web macro tools (like Robofox) allow EUPs to store certain personal data (eg: phone #) and reuse it • No straightforward mechanisms for EUPs to create new abstractions for unsupported categories of data introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
User-definable data formats(eg: SWYN, Grammex, Lapis, Data Detectors) • EUPs struggle to understand and create regexps/CFGs • These formats are binary (non-fuzzy) recognizers • Formats alone do not transform or equivalence-test data • Only Apple Data Detectors offers sharing mechanisms Lapis example @DayOfMonth is Number equal to /[12][0-9]|3[01]|0?[1-9]/ ignoring nothing @ShortMonth is Number equal to /1[012]|0?[1-9]/ ignoring nothing @ShortYear is Number equal to /\d\d/ ignoring nothing Date is flatten @ShortMonth then @DayOfMonth then @ShortYear ignoring either Spaces or Punctuation introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Formal and OO types(eg: ML, Java, C#) • Type systems are inflexible: • A value is or is not a valid instance of a type (non-fuzzy) • If a value is invalid at compile-time, it cannot become valid at runtime • Typed languages are probably difficult for EUPs who are uncomfortable with untyped scripting languages. introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Format-inference and constraint-enforcing(eg: info. extraction, Lapis, Cues, Slate) • Various approaches: • Many algorithms infer an abstract model, CFG-like grammar, or other format with very low editability. • Other algorithms enforce constraints (either inferred or specified by EUPs) that cannot handle string-like data • Formats, grammars, and constraints are not able to transform or equivalence-test data. introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Outline • Introduction • Related work • Exploratory studies • Prototype • Proposed work • Evaluation • Summary and schedule • Tasks commonly involve • Recognizing • Transforming • Equivalence-testing • values in categories of short, human-readable data. introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Survey of EUPs:Better data-manipulation features needed • Asked 831 information workers about use of 23 features in 5 tools (eg: creating spreadsheet macros, database stored procedures, and web forms) [4][9] • The most widely used features were related to manipulating linked structures of data (eg: database tables) rather than imperative or macro programming • Yet respondents complained about these features: • “Not always easy to move sturctured [sic] data or text” • “Not always integrated a lot of data manipulation redundant” • “Information entered inconsistently into database fields by different people leaves a lot of database cleaning” introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Contextual inquiry of EUs and EUPs:Specific data-manipulation features needed • Observed 3 administrative assistants, 4 managers, and 3 webmasters/graphic designers (1-3 hrs, each) [3][9] • They needed better support for automatically: • Transforming data values among different formats within the same category of data (eg: ST to State) • Identifying questionable data values that could be acceptable for a task but deserve double-checking • Identifying duplicate values, including values that were probably equivalent introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Interviews of web site creators:Confirmation of specific features needed • Interviewed 6 people involved in creating “person locator” web sites after Hurricane Katrina [7][9] • Many omitted data validation on web forms • Hard to detect that “12 Years old” is an invalid street address (what would the regexp look like?) • “Aggregator” sites were built to scrape and consolidate data from numerous person locator sites. • Hard to transform data into a single consistent format • Hard to identify probable duplicates in the merged data set introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Outline • Introduction • Related work • Exploratory studies • Prototype • Proposed work • Evaluation • Summary and schedule How could flexible formats be expressed? introduction ● related work ● studies ● prototype● proposed work ● evaluation ● summary
PrototypeTask flow diagram User creates a format from scratch or User highlights spreadsheet cells Plug-in flags cells that don’t match format User loads an existing format from a file or Algorithm infers a format from cell values User reviews and customizes format [1][6] introduction ● related work ● studies ● prototype● proposed work ● evaluation ● summary
Sample task: validating a spreadsheetwith the prototype we have built • The second column is “supposed” to contain first names, but some initials have snuck in. introduction ● related work ● studies ● prototype● proposed work ● evaluation ● summary
Sample task: validating a spreadsheetCustomizing an inferred format • User can specify meaningful names for parts introduction ● related work ● studies ● prototype● proposed work ● evaluation ● summary
Sample task: validating a spreadsheetCustomizing constraints in our prototype • User can add/edit constraints introduction ● related work ● studies ● prototype● proposed work ● evaluation ● summary
Sample task: validating a spreadsheetFlagging potential errors • A red flag (reviewer comment, actually) appears on cells that do not match the format; mouse over for message introduction ● related work ● studies ● prototype● proposed work ● evaluation ● summary
Sample task: web form validationThe painful old way • Drag widgets and validator onto page, select a regexp, customize if desired. introduction ● related work ● studies ● prototype● proposed work ● evaluation ● summary
Sample task: web form validationResults of the painful old way • Invalid inputs cause a hard-coded message to appear. Oops, forgot to enter a message at design-time. • For valid inputs, no error message appears. Hm, didn’t realize the area code was optional. What if I want to allow campus phone numbers? introduction ● related work ● studies ● prototype● proposed work ● evaluation ● summary
Sample task: web form validationThe wonderful new way • Drag widgets and validator onto page, select a format, customize if desired. introduction ● related work ● studies ● prototype● proposed work ● evaluation ● summary
Sample task: web form validationCreating this format took 55 seconds introduction ● related work ● studies ● prototype● proposed work ● evaluation ● summary
Sample task: web form validationResults of the new way • Invalid inputs cause a targeted message to appear. • Inputs that violate an always or never constraint cannot be submitted to the server. • Inputs that violate an oftenconstraint cause a warning, which the application user can override. introduction ● related work ● studies ● prototype● proposed work ● evaluation ● summary
Prototype implementationSystem block diagram Microsoft Excel Plug-in Microsoft Visual Studio.NET Web application Plug-in Validator Spreadsheet Format editor Parser introduction ● related work ● studies ● prototype● proposed work ● evaluation ● summary
Benefits of the format editor • Exotic regexp notation is replaced with sentence-like screen prompts. • Soft constraints (“often”) are supported. • Negation constraints (“never”) are supported. • In terms of expressiveness, Augmented context-free grammars > context-free grammars > regexps But is the expressiveness adequate for common data? introduction ● related work ● studies ● prototype● proposed work ● evaluation ● summary
Expressiveness evaluation • Four administrative assistants’ use of a web browser was logged for three weeks, resulting in nearly 6000 sample data values that they typed into web forms. • Not logged verbatim: characters were generalized • Eg: Cscaffid0@gmail.com Aa{7}0@a{5}.a{3} • We manually grouped values into 19 semantic families (eg: email address) based on widget’s HTML name and words visually nearby to the widgets • Created and tested formats for 14 families (4250 values) • Omitted: username/passwords and long blocks of “text” • Inference & testing features were not used during format creation introduction ● related work ● studies ● prototype● proposed work ● evaluation ● summary
Expressiveness evaluation results • 9 families needed 1 format each; 5 needed 2 formats each • Easy to quickly express a reasonably correct format? • 11 families took < 1 minute each; others 3, 5, 7 minutes • No errors found in formats for 9 families; 5 had errors • Most errors: forgetting to mark a part as optional • Testing feature was added after this evaluation • The only error attributable to editor expressiveness: • 1 of the 4250test values had a trailing period on a street type (in an address line) • This particular version of the editor had no way to say that a part could contain a period but only at the end [6] introduction ● related work ● studies ● prototype● proposed work ● evaluation ● summary
Extension and further evaluation needed • The editor evaluation again highlighted the need for supporting multiple formats within each data category. • The proposed work will add this support. • Then, usability of the editor as a whole will be evaluated. introduction ● related work ● studies ● prototype● proposed work ● evaluation ● summary
Outline • Introduction • Related work • Exploratory studies • Prototype • Proposed work • Evaluation • Summary and schedule Generalizing the prototype: A lightweight data model + A development environment to help EUPs create, share and use topes introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Proposed data model • 1 tope implementation contains executable functions: • 1 isa:string[0,1] function per format, for recognizing instances of the format • 0 or 1 eqc:string x string[0,1] function per format, for testing equivalence of two values in a format(default is a binary test for being exactly identical) • 0 or more trf:stringstring function linking formats, for transforming values form one format to another • A lightweight data model… • Only contains 3 kinds of functions (isa/eqc/trf) • These correspond to the operations that people had to keep performing manually in our studies. introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Example topeNotional representation • An example tope for CMU room numbers • 3 isa functions, up to 3 eqc functions, 4 trf functions • A tope’s eqc and trf functions can be omitted if desired Formal building name& room number Elliot Dunlap Smith Hall 225 Building abbreviation& room number EDSH 225 Colloquial building name& room number Smith 225 introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Proposed development environmentFunctional decomposition diagram Development Environment Repository Software Plug-Ins Basic Topes Editor Publishing Tools Search Tools Normalization EUPs implement topes in basic topes editor (or JavaScript), then publish in repositories. Other EUs and EUPs search for topes, download them, then use them through plug-ins. introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Proposed development environmentEnhanced basic topes editor Development Environment Repository Software Plug-Ins Basic Topes Editor Publishing Tools Search Tools Normalization introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Proposed workEnhancing the basic topes editor • Extend isa support • Improve error message generation • Add trf support • EUPs will specify a series of steps: • Select a part, select an operator • Operators: permutation, lookup, arithmetic, capitalization • Add (regression) testing features to facilitate consistency • Add eqc support • For each part, EUPs will specify a comparison operator, returning value in [0,1], and these will be multiplied. • Operators: exactly identical, case-insensitive comparison, ~arithmetic distance, ~edit distance introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Proposed development environmentRepository software Development Environment Repository Software Plug-Ins Basic Topes Editor Publishing Tools Search Tools Normalization introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Proposed workRepository software • Clients will have a list of “known” repository servers • Generally pre-configured to include a global server at CMU • Organizations will configure clients to include the organizational server • EUs and EUPs will be able to add new servers to their list • To support publishing/searching, the repository will house meta-information about topes. • (EUPs can also simply email topes to EUs and other EUPs, bypassing the repository system.) introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Proposed development environmentPublishing tools Development Environment Repository Software Plug-Ins Basic Topes Editor Publishing Tools Search Tools Normalization introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Proposed workPublishing topes • Publishing a tope on a repository • Anonymously, or authenticated • EUPs can gather into groups, publish group-private topes • Each tope can have a non-unique name & description • Internally, each tope will have a globally unique id (guid) • For published tope, guid = URL of the master copy • (For emailed tope, guid based on sender’s email address) • Tope aliases • EUPs can publish tope aliases • Alias has no implementation; just points to another tope • Alias can have its own name, description introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Proposed development environmentSearch tools Development Environment Repository Software Plug-Ins Basic Topes Editor Publishing Tools Search Tools Normalization introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Proposed workSearching for relevant topes • Search by keyword: • Search tope name and description • And match based on words that are visually near to topes • Search by groups of people: • Within an organization, or by author’s email domain • Within spaces that are “group-private” • Search by groups of topes: • “If you liked this tope, you may also like XYZ” • Similar to Amazon.com’s product recommendations • Search by example: • “Find me a tope that recognizes 412-555-1212” • For efficiency, filter based on “signature” (\d{3}-\d{3}-\d{4}) introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Proposed workSearching for trustworthy topes introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Proposed development environmentEnhanced plug-ins Development Environment Repository Software Plug-Ins Basic Topes Editor Publishing Tools Search Tools Normalization introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary
Proposed workEnhancing plug-ins • Microsoft Excel • Outlier finding infer format on selected cells, run isa • Assertions run isa on selected cells • Transformation run trf on selected cells • De-duplication run eqc on selected cells, cluster the cells • Microsoft Visual Studio.NET • Input validation run isa on form widget, show error message • Input consistency run trf on value if in wrong format • Robofox • Assertions run isa on selected variable • Transformation run trf on selected variable • In each, support basic editor topes & JavaScript topes introduction ● related work ● studies ● prototype ● proposed work ● evaluation ● summary