Empirical Evaluation in End-User Software Engineering

Empirical Evaluation in End-User Software Engineering Janice Singer National Research Council Canada

Outline • Summary of papers wrt empirical evaluation looking at: • Research Question • Domain • Method • Subjects/Objects of study • Results • Themes/cross-cutting issues • Questions for Discussion Singer WEUSE IV

Spreadsheet debugging behaviour of expert and novice end-users - Bishop, McDaid • Research Question • 4 Basic RQ concerning performance of expert vs. novice users in detecting and correcting errors, debugging behaviour, and cell inspection coverage • Domain • Spreadsheet • Method • Experiment, Qualitative inquiry • Subjects/Objects of study • 13 professionals and 34 accounting and finance students (experts and novices) Singer WEUSE IV

Results • Experts perform better than novices at detecting errors that require ‘deep’ understanding • Cell coverage correlates with performance - experts look at more cells than novices • There is a specific pattern of cell inspection depending on the characteristics and place of the cell in the spreadsheet • A tool whose aim was to increase cell inspection coverage showed a trend, but did not significantly improve performance. Singer WEUSE IV

Gender in EUSE - Burnett, et al. • Research Question • Are the strategies employed by male and female EUSE in debugging different? • Domain • Spreadsheets • Method • Experiment, qualitative study • Subjects/Objects of study • Males, females, professionals, students Singer WEUSE IV

Results • There are significant gender differences in strategies for approaching testing and debugging • Some of the strategies preferred by females are not well supported in end-user environments • Modeling of problem solving behaviour may improve females’ confidence, and therefore their performance on tasks • Gender matters Singer WEUSE IV

End users as unwitting software developers - Costabile et al. • Research Question • Are our theories/descriptions accurate • Domain • Group of companies who cooperate in candy distribution • Method • Not clear… • Subjects/Objects of study • Users of a web-based portal Singer WEUSE IV

Results • In industrial practice, we see a variety of end-users with their own attendant needs. Singer WEUSE IV

An EU oriented graph based visualization for spreadsheets - Kankuzi, Ayalew. • Research Question • Is our tool technically correct? • Domain • Spreadsheet • Method • Generate and test ? • Subjects/Objects of study • Generated visualizations Singer WEUSE IV

Results • For the most part, the algorithm that drives the visualization is correct • Need to evaluate whether the visualization is actually useable by and useful to end-users Singer WEUSE IV

Using two heads in practice - Karlsson • Research Question • Are dyads more effective than singletons (or nominal dyads) in debugging spreadsheets? Is there a process loss for two people working together? • Domain • spreadsheets • Method • Field experiment - experiment conducted in the field • Subjects/Objects of study • Professionals Singer WEUSE IV

Results • Dyads make fewer spreadsheet development errors than monads • There is no significant difference in performance between nominal dyads and real dyads - therefore unable to determine whether there is a process loss or not Singer WEUSE IV

TDD: can it work for spreadsheets? McDaid, Rust, Bishop • Research Question • Is test-driven development an appropriate process for spreadsheet development? • Domain • spreadsheet • Method • Developed tool to support TDD then case studies (real professionals using tool to develop spreadsheets) • Subjects/Objects of study • Professionals Singer WEUSE IV

Results • TDSD easy to understand and use • Development time increase • Overall, participants seemed to believe TDSD effective in reducing errors • Some improvements to the tool suggested Singer WEUSE IV

Software support for building EUP environments in the automation domain - Prähofer, et al. • Research Question • Is our solution technically correct/feasible? • Domain • Automation • Method • Case study, Reimplementation of existing systems • Subjects/Objects of study • Developers of system/Two existing systems Singer WEUSE IV

Results • Existing systems were able to be implemented in framework, with a great reduction in the code size Singer WEUSE IV

Patterns in mash-ups - Wong, Hong • Research Question • Are there typical application domains for mash-ups? • Domain • Web programming/mash-ups • Method • Survey (in the sense of categorization of) of and qualitative analysis mash-ups • Subjects/Objects of study • Popular GreaseMonkey scripts and 22 recommended mash-ups Singer WEUSE IV

Results • Mashups can be categorized according to their functionality. These patterns include personalization, search, aggregation amongst others Singer WEUSE IV

Summary of empirical studies • Wide variety of research questions • Not so much use of theory (is it necessary?) • Not so wide variety of domain • Mostly spreadsheet • Wide variety of methods • Experiments, surveys, case studies, tool correctness • Subjects/Objects of study • Varied and related to research question Singer WEUSE IV

Themes and Questions - THEME: Domain What other domains should we be looking at in terms of empirical evaluation or tool support? Related to this, shouldn’t we be doing more qualitative and observational work in real settings? Singer WEUSE IV

THEME: End-User Characteristics What other end-user characteristics do we need to be aware of when studying and designing tools for end-users? Are there general cognitive limitations in terms of abilities or is it mostly poor tool support that limits the ability of end-users Singer WEUSE IV

THEME: Technical Correctness How do we help end-users test, debug, determine the technical correctness of their solutions? What methods can we use to test the technical correctness/quality of our solutions - can experimentation with humans alone do this? Singer WEUSE IV

THEME: Software Engineering What can studies in EUSE tell us about SE, and vice versa? Given the changing world of SE (e.g., SOA, component and model based development, interoperability and integration issues), is there any longer a difference between EUSE and SE? Singer WEUSE IV

THEME: Building the Research Area What are the critical/BIG research questions? Is there enough information to start building meta-theories, do meta-analyses? Singer WEUSE IV

Discussion • Already using theories, but many of them are implicit, and what we need to do is be explicit about them. • Concept of theories have many meanings - but how can we generalize findings from individual cases to more general findings. So many of our findings are context dependent, it is very difficult to generalize to other contexts. So, in this case, how do we transfer knowledge. But can’t work with simply theories. • Do we really need qualitative studies first - yes, they can provide much information. Who are end-users? Can we characterize them. There is a lot of diversity and those points of diversity matter in terms of how to help people. Singer WEUSE IV

In EUSE, there is a lot of qualitative work, and highly regarded. • Question bears on difference significance and meaning. Ask about two classes - e.g., male vs. female. But really there is a huge overlap in the two populations. But changing for one population will often help the other. Is it true for experiments reported here. Can tools for EUSE benefit software engineers. • Huge individual differences - no typical male and no typical female. So when find differences, really, what finding is a barrier that affects performance. • Statistics lie - must accept that the numbers only make sense in particular situations. Must be explicit about what the numbers mean, how we apply the generalizations to discover meaning • External validity - even if you create a tool that is useable in a lab experiment and performs well according to a set of tasks - it doesn’t mean that the general effect is there. Is it possible to run field studies by releasing the tool and seeing if it works in context. Singer WEUSE IV

But difficult because if you fail, you can’t always be sure exactly why it went wrong. Other people had experience with trying to measure success in the field, share experience • May need to look at a certain kind of EU applications. Spreadsheets are mainly defined tasks, whereas tailoring is a different issue. These areas may not have comparable measures across instances of the case. Here you need to use other methods and types of measures. By emphasizing one side of the method question, you are only asking certain kinds of questions • One of the major difficulties in introducing tool in field is that users get annoyed when it doesn’t work. You can’t get good results with a semi-finished prototype. Need to have almost product-quality prototype Singer WEUSE IV

Important to use real spreadsheets with real errors. Powell, et al. made conclusion that you shouldn’t look at cell errors, but rather cell error instances. • EUSPRIG.com more practitioner focused than this group. • Connection between SE and EUSE - strong connection on two levels - system that are to be developed by non-professionals need to modularized more. Very difficult to decide what is the right decomposition. In the future SE will need to watch what users and how requirements will evolve. Techniques are needed there. SE can perhaps benefit from the methodological practices of EUSE - where there is a huge degree of expertise in looking at practice. Understanding of how practices in software are evolved. Singer WEUSE IV

Speculate about how conclusions would be different if gender wasn’t a factor. Could not have found the same things if hadn’t thought about gender. Was a basic emphasis in all work because of theoretical perspective. • Professional and EUSE - Good case in point was curb cuts. Now we all benefit from curb cuts even though they were designed for disabled people. Some times when you look at a case that you haven’t looked at before, you can come back to the majority and improve their tools. Whyline is another case of this. • Used differentiation between EU and professional for many reasons. Also postulated that there is a continuum. Are the empirical studies too broad - perhaps we should try to figuring out which part of this continuum we can apply our results to Singer WEUSE IV

About 10K hours necessary to distinguish someone as an expert. May be that experts are really rare. Small percentage of people that are very good, and large continuum of skills • Point out some differences. EUSE is need-driven which means that I am a user using technology and I encounter a problem or see innovation problem, and try to solve this problem using technology. Not necessary for end-user to implement it him/herself. For a SE, he is bound to the artifact he is about to create. Considerations that lead to the choice are different in both cases. What type of technological structure is necessary for problems in practice? • 84% of people surveyed by Umarji are self-taught. Perhaps educating people or creating a set of guidelines is one of the low-hanging fruits. Singer WEUSE IV

Empirical Evaluation in End-User Software Engineering