300 likes | 374 Views
My Redneck Brother's Tire Size, and Other Unrelated Topes. Christopher Scaffidi Carnegie Mellon University. Even when lives are at stake, people still make typos. Hurricane Katrina “Person Locator” Web site. intro ● current practice ● real data ● topes ● closing vignette.
E N D
My Redneck Brother's Tire Size, and Other Unrelated Topes Christopher Scaffidi Carnegie Mellon University
Even when lives are at stake,people still make typos. Hurricane Katrina “Person Locator” Web site intro ● current practice ● real data ● topes ● closing vignette
Data errors reduce the usefulness of data. Age is not useful for flying my helicopter to come rescue you. Age belongs in the description/additional information field so I can recognize you. And a “city name” with 1 letter is no use at all. Even little typos impede data de-duplication. intro ● current practice ● real data ● topes ● closing vignette
The website creators omitted input validation. • Reasons: They thought… • it would be too hard to write the validation. • catching obviously-wrong inputs would prevent collecting maybe-correct data. This is the UI code for the web formwhere people could type in the data. A RAD tool called CodeCharge Studiowas used to create the UI. intro ● current practice ● real data ● topes ● closing vignette
Outline and main points • Current practice • Currently, writing validation code is hard… • Real data • because how do you express, “this is questionable data?” • Topes • Topes can express that—and they’re easy to create, too. • Closing vignette: my brother’s truck tires • Email me some vignettes of your own. intro ● current practice ● real data ● topes ● closing vignette
Programmers have lots of tricks to simplify writing validation code. Split inputs into multiple easy-to-validate fields. Who cares if the user has to type tabs now, or if he can’t just copy-paste into one field? Make users pick from drop-downs. Who cares if it’s faster for users to type “NJ” or “1/2007”? (Disclaimer: drop-downs sometimes good!) I implemented this codefor NJTransit.com. intro ● current practice ● real data ● topes ● closing vignette
Even with these tricks, writing validation is still very time-consuming. Overall, the site had over 1100 lines of JavaScript just for validation….Plus equivalent server-side Java code (can’t trust some users) Sample code below. if (!rfcCheckEmail(frm.primaryemail.value)) return messageHelper(frm.primaryemail, "Please enter a valid Primary Email address."); var atloc = frm.primaryemail.value.indexOf('@'); if (atloc > 31 || atloc < frm.primaryemail.value.length-33) return messageHelper(frm.primaryemail, "Sorry. You may only enter 32 characters or less for your email name\r\n”+ ”and 32 characters or less for your email domain (including @)."); intro ● current practice ● real data ● topes ● closing vignette
That was worst case.Best case: reusable regexps. • Many IDEs allow the programmer to enter oneregular expression for validating each input field. • Usually, this drastically reduces the amount of code, since most validation ain’t fancy. • Unfortunately… intro ● current practice ● real data ● topes ● closing vignette
Regexps are a good bullet but not a silver bullet—so lots of data goes unvalidated. • The world is full of programmers who can’t read regexps. • Do a search on Google some time for “regular expressions” and read what people say in the forums. • USA alone has over 55 million non-expert creators of web sites, databases, and spreadsheets (which have most of the same data problems that web sites do). • Regexps only work for data where you can say, “Yes, this is definitely ok” or “No, this is definitely wrong”. • What would a regexp for a valid company name look like? intro ● current practice ● real data ● topes ● closing vignette
So we did a preliminary review ofreal data needing validation. • Sources: • Comments from Information Week readers to a survey • Observations of people as they created and used websites • Many Hurricane Katrina sites • Cursory browsing of the EUSES spreadsheet corpus • Browsing around the web • My own experience as a professional webapp developer • We found 3 primary problems with regexps… intro ● current practice ● real data● topes ● closing vignette
1. Real data doesn’t always conform wellto the simple “binary” regexp model. • Data is sometimes questionable… yet valid. • Remember the suspiciously long NJTransit email address? • In practice, person names and other proper nouns are nevervalidated with regexps… too brittle. • Life is full of corner cases and exceptions. • If your code can identify questionable data, then it can double-check the data: • Ask an application end user to confirm the input • Flag the input for checking by a system administrator • Compare the value to a list of known allowable exceptions • Call up a server and see if it can confirm the value intro ● current practice ● real data● topes ● closing vignette
2. Real data often can occur in multipledifferent formats. • Two different strings can be equivalent. • How many ways can you write a date? • What if an end user types a date in the wrong format? • “Jan-1-2007” and “1/1/2007” mean the same thing because of the category that they are in: date. • Sometimes the interpretation is ambiguous. In real life, we use preferences and experience to guide interpretation. • If your code can transform among formats (ie: not just recognize formats with regexps), then it can put data in an unambiguous format of your choice. • Display the result so users can fix interpretations if needed intro ● current practice ● real data● topes ● closing vignette
3. The meaning of data is often tied toits “parts”, not directly to its characters. • Real data often has parts, each with their own meaning. • What are the parts of a date, 1/1/2007? • Valid data conforms to intra- and inter-part constraints. • Writing regexps requires you to translate constraints into a character sequence… tough in many cases, practically or truly impossible in others. • No wonder most people can’t read or write regexps. • If your code could succinctly state the parts, as well as mandatory and optional constraints on the parts, wouldn’t the code be easier to write and maintain? intro ● current practice ● real data● topes ● closing vignette
Imagine a world… • Where your code could say to an oracle, “Is this input a company name?”, and the oracle would say yes, no, almost definitely, probably not, and other shades of gray. • Where your code could accept an input in any reasonable format, since your code could ask the oracle to put the input into whatever format you actually want. • Where you could teach the oracle about a new category of data by concisely stating the parts and constraints of that data. intro ● current practice ● real data● topes ● closing vignette
Tope = an abstraction for a data category • Greek word for “place,” because each corresponds to a data category with a natural place in the problem domain • Topes in practice: • People implement new topes by using the basic tope editor (or another language such as JavaScript) • People publish tope implementations on repositories. • People download tope implementations to local caches. • Tool plug-ins let people browse their local cache and associate topes with variables and input fields. • Plug-ins use tope implementations from local cache to recognize, transform, and equivalence-test data. intro ● current practice ● real data ● topes ● closing vignette
A tope is a graph.Node = format, edge = transformation • A notional representation for a CMU room number tope. • Note that edges (transformations) can be chained Formal building name& room number Elliot Dunlap Smith Hall 225 Building abbreviation& room number EDSH 225 Colloquial building name& room number Smith 225 intro ● current practice ● real data ● topes ● closing vignette
A tope is a conceptual abstraction.A tope implementation is code you can run. • Each tope implementation contains executable functions: • 1 isa:string[0,1] function per format, for recognizing instances of the format • 0 or more trf:stringstring function linking formats, for transforming values form one format to another intro ● current practice ● real data ● topes ● closing vignette
Common kinds of topes:enumerations and proper nouns • Multi-format, non-binary enums, e.g.: US states • Fixed list of definitely valid names (e.g.: “Maryland”) • Transformed to other formats via lookup tables (“MD”) • Augmented with a list of unusual values that technically might be ok in some circumstances (“PR”) • Open-set proper nouns, e.g.: Company names • You certainly can’t list all of these • Collect a whitelist of definitely valid names (“Google”), with alternate formats (e.g. “Google Corporation”, “GOOG”) • Augment with a pattern for recognizing promising inputs that are not yet on the whitelist intro ● current practice ● real data ● topes ● closing vignette
Two more common kinds of topes:numeric and hierarchical • Numeric, e.g.: area codes • Check that inputs are numeric and in a certain range • Values outside the range might be valid but questionable • Very rarely, numeric data are explicitly flagged with a unit • Hierarchical, e.g.: address lines • Parts are described with other topes (e.g.: “100 Main St.” uses a numeric, a proper noun, and an enum) • Each part has its own internal constraints; the hierarchical tope may add inter-part constraints. • Simple isa functions can be implemented with regexps. • Transformations involve permutation of parts, changes to separators, simple arithmetic, and lookup tables. intro ● current practice ● real data ● topes ● closing vignette
We have a tool to help people succinctly express common kinds of topes. • Features: • Format inference • Format/part names • Soft constraints • “isa” generation • Testing features • Format reusability • (Similar UI style for implementing trfs) intro ● current practice ● real data ● topes ● closing vignette
And we have plug-ins for using topesin web forms, databases, and spreadsheets Visual Studio: drag-and dropcode generation Microsoft Excel:buttons and menus intro ● current practice ● real data ● topes ● closing vignette
We have conducted a variety of evaluations. • Expressiveness: • We have implemented formats for dozens of kinds of data(1) EUSES spreadsheet corpus(2) Hurricane Katrina, and Google Base website data(3) logs of admin assistants’ web browsing • … and topes were very effective at identifying data errors. • Usability: • Controlled experiment shows that our format editor enables admin assistants and master’s students to validate data more quickly and accurately than with Lapis patterns or with regexps. intro ● current practice ● real data ● topes ● closing vignette
For more details… • Ask me for the papers on… • Surveys and other studies of programmers and users • The topes model • Our user interfaces • Our evaluations • Ask me for the tools: • Some modules are already open-sourced • Modules have a clean API (if you just want a binary) • The evaluations pointed out some places for improvement • (when this is done, the rest will be open-sourced, too) intro ● current practice ● real data ● topes ● closing vignette
A closing vignette • This vignette illustrates many of the characteristics of data that I mentioned today. • I would value similar (true story) vignettes from you • To help highlight what real data looks like • To help communicate the concept of a tope • To provide me with test cases for topes intro ● current practice ● real data ● topes ● closing vignette
My brother (Ben) and I hit a rake while driving his truck around the backyard. We got a flat. intro ● current practice ● real data ● topes ● closing vignette
Fortunately, Ben knows A LOT about trucks… and their tires. Observe the pencil marks on the tire, where my brother drew while explaining what the parts of the tire size meant. (I tweaked the contrast on this image to make the lettering stand out.) intro ● current practice ● real data ● topes ● closing vignette
So Ben went online to order a tire. My brother is very web savvy. He is an electrician by day, but he assembles computers and sets up web sites as a side job. Observe the red neck and boots. intro ● current practice ● real data ● topes ● closing vignette
But even though Ben knows tires really well, he couldn’t implement tire size validation. • Each part has meaning (cross section, sidewall aspect ratio, internal construction, etc). • Though parts must be selected from a simple enumeration, there are inter-part constraints. • “Questionable” sizes? Dunno. Maybe those that are reasonable but hard-to-find? intro ● current practice ● real data ● topes ● closing vignette
Summary • Real data is full of input errors. • Real validation is currently hard to write. • Topes enable accurate, convenient validation by capturing soft constraints and the multiple formats of real data. • Please email me some vignettes of your own. intro ● current practice ● real data ● topes ● closing vignette
Thank you… • … to lots of people (including Mary Shaw, Brad Myers, Jim Herbsleb, and Jonathan Aldrich) for encouraging feedback about topes and lots of suggestions. • … to anybody who emails me a vignette. • …to NSF and EUSES for funding (ITR-0325273 and CCF-0438929)