270 likes | 296 Views
Lecture 5 Experimental Design. Lecturer – Prof Jim Warren (with references to Dix et al. chapter 9). Evaluating Implementations. Requires an artefact: a simulation, a prototype, or a full implementation. Experimental evaluation.
E N D
Lecture 5Experimental Design Lecturer – Prof Jim Warren (with references to Dix et al. chapter 9)
Evaluating Implementations Requires an artefact: a simulation, a prototype, or a full implementation
Experimental evaluation • controlled evaluation of specific aspects of interactive behaviour • evaluator chooses hypothesis to be tested • a number of experimental conditions are considered which differ only in the value of some controlled variable. • changes in behavioural measure are attributed to different conditions
Experimental factors • Subjects (i.e., the users, aka ‘participants’) • who – representative, sufficient sample • not the programmer’s friend, boss, etc. • huge variability in performance of individuals • Variables • things to modify and measure • Hypothesis • what you’d like to show • Experimental design • how you are going to show it • Includes ‘Protocol’ – what the subjects do
Variables • independent variable (IV) • characteristic changed to produce different conditions • e.g. interface style, number of menu items • dependent variable (DV) • characteristics measured in the experiment • e.g. time taken, number of errors.
Hypothesis • prediction of outcome • framed in terms of IV and DV • e.g. “error rate will increase as font size decreases” • null hypothesis: • states no difference between conditions • aim is to disprove this • e.g. null hyp. = “no change with font size”
Experimental design • “within groups” design (also called “repeated measures”) • each subject performs experiment under each condition • transfer of learning possible (practice makes performance better; or alternatively fatigue or boredom makes it worse) • less costly and less likely to suffer from user variation (each user is compared to themselves) • between groups design • each subject performs under only one condition • no transfer of learning • more users required
Within v. Between • Consider a test on the difference of beer v. vodka martinis on reaction time • Null hypothesis – no difference in increase in reaction time between the two beverages • Design 1: • 30 people try beer; 30 other people try vodka – D.V. is change in reaction time pre- v. post drinking • Not bad – be sure to randomize who goes into beer group v. vodka group • But ‘power’ of the experiment will be reduced due to the great variability of individuals in reaction to alcohol
Within v. Between (contd.) • Design 2: • All 60 people first try beer, then immediately try vodka • Problem of carryover effect • Better Design: • All 60 try beer, then a week later try vodka • Now each individual is compared with themselves • Still possible problem of ordering effect (e.g., they might get a little better at the reaction time test) • Best Design: • 30 try beer, then a week later vodka; 30 try vodka and then a week later beer
Analysis of data • Before you start to do any statistics: • look at data (e.g. average=5.25 – but 4.9 without the “outlier”) • Choice of statistical technique depends on • type of data • information required • Type of data • discrete • finite number of values • may be ordered, orunordered (e.g., colors) • continuous • any value
ANOVA – analysis of variance • Quite easy to test whether there’s a significant difference between groups in Excel • Need to invokeTools/Add-ins/Analysis Toolpack to enable • Then just apply Tools/Data Analysis/ANOVA: Single Factor to the data
ANOVA from Excel Say we have three columns of numbers representing the time to complete a task for 5, 5 and 7 users using three variations of an interface If P-value < 0.05 then we usually say the result is ‘significant’ (result is more than expected chance variation)
When is a difference a difference? • In the world of parametric stats, we look for a statistic to be large enough to be ‘significant’ • On the Gaussian (‘normal’) curve a |Z|=1.96 leaves 95% of the area of thecurve behind so is acommon ‘criticalvalue’ forclaimingsignificance
Parametric assumptions • Parametric statistics assume that some mathematically elegant assumptions hold true for the data • E.g., ANOVA (and standard ‘regression’) assume, among other things, normally distributed random error • Trivia: The mathematical form of the probability density function for the normal distribution is remarkably formidable • Centres on mean, m, and is flattened by standard deviation, s Galton machine simulates normal distribution (aka ‘bell curve’) Exponential distribution models time between events happening with a constant average rate
So, what to measure? • Usually one (or several) of these things: • Speed / efficiency • How many (or whatever) per unit time can the user process with this interface? • Accuracy (errors) • Learnability (time to acquire the ability to do something in particular with the interface) • And retention (how well they can do it some particular time later) • Satisfaction • Subjective assessment – how does the user feel about the interface
Subjective assessment • Very important • If the user seriously doesn’t like it, probably there’s something really wrong with the design (just maybe they can’t articulate what) • Quantifiable • Yes/No, or better, Likert scale (ratings) through interview or questionnaire • Need to ask the right questions • E.g., don’t have leading questions (BAD: “Is this the absolute worst system you have used ever?”) • And need to ask the questions well (so user reliably expresses what they mean) • E.g., don’t trip them up with double negatives
Likert Scale See Heim – pages 117-119 • Can be from 4 to 7 “points” • Usually about agreement to a phrase • E.g., “I found the search function easy to use” • Strongly Agree, Agree, Neutral (optional), Disagree, Strongly Disagree • May also be about importance • E.g., “A site search function is…” • “not very important” to “extremely important” • Or a general assessment • E.g., “The performance of the search function was…” • Poor, Fair, Satisfactory, Good, Excellent • Great to ask open-ended questions, too • E.g., “What was the best aspect of the search function?” • But it’s the Likert scale data that you can quantify
Dimensions and validity • When designing a questionnaire… • Have in mind a few underlying issues that you are trying to assess • Ask a few different questions that are coordinated around each issue • Ask different ways – vary whether positive or negative favours the issue in question • Ideally, verify the questionnaire by having people role-play particularly happy or angry, and middle-of-the-road users • And see if they answer the questions the way you’d expect!
Questionnaire administration • Face-to-face or telephone interviews • Esp. efficient (and unbiased) to have third party give the telephone survey • Mail-out (including email) questionnaire • Web-based questionnaire (maybe email out URL) • Even a questionnaire is an experiment on humans from the point of view of research ethics • Easier to achieve an ethical questionnaire administration if it’s truly anonymous
Response rates and bias • Low response rate is a problem • Below 50% response rate, one wonders whether respondents were exceptional (happiest, angriest or a mix; but not the “normal” folks) • Better if you have some authority to motivate a response • But back to issues of ethics – e.g., not truly anonymous if you know who to nag about non-response; and the “pressure” may be unfair • Similar bias problems when you use volunteers for any experiment • Are these volunteers representative of your “normal” users?
Experimental studies on groups More difficult than single-user experiments Problems with: • subject groups • choice of task • data gathering • Analysis • Unfortunately (in terms of experimental requirements) a lot of things that are interesting in the real world, involve computers mediating group behaviour
Subject groups larger number of subjects more expensive longer time to `settle down’ … even more variation! difficult to timetable so … often only three or four groups
Groups (contd.): Data gathering several video cameras + direct logging of application problems: • synchronisation • sheer volume! one solution: • record from each perspective
Groups (contd.): Analysis N.B. vast variation between groups solutions: • ‘within groups’ experiments (each group works under various conditions) • micro-analysis (e.g., gaps in speech) • anecdotal and qualitative analysis controlled experiments may `waste' resources! • Experiments dominated by dynamics of group formation • Field studies are apt to be more realistic
About statistics • It’s an amazingly complex field • A lot of hidden complexities in running experiments and saying that the observed differences really make a difference • ‘threats to validity’ – are those things that make it possible that your experimental conclusion is in error • Threats to internal validity: like carryover effects, or lack of randomization • Threats to external validity: like that your whole population of subjects were unusual in some way, or the task was not representative of real use of the tool • When the outcomes are serious (e.g., medical trials) professional statisticians are always used in design of the experiment as well as analysis and reporting of the findings • Plenty of texts and courses on stats available (the Wikipedia is pretty good on this topics, too – e.g., for ANOVA)
Unwanted biases in studies • You can’t always take a study result at face value… must be attentive to what subjects are feeling • Hawthorne effect • Worker is more productive when observed • John Henry effect • Worker is [stubbornly] more productive when using his old tools (see http://www.ibiblio.org/john_henry/) • Placebo effect • [Patient usually] gets some benefit just because they expect a benefit • Pygmalion effect • Student performs better simply because they are expected to do so
Usability Analysis - Conclusion • Remember: the ultimate goal is to learn • Learn what’s working and, most critically, what isn’t working for the end user • Do the usability testing that helps you make the best possible interface • Test within your constraints • A quick talk aloud protocol session is far better than nothing and will probably find the most critical flaws • Then again, if it’s a “bet the business” interface, and it’s a big business, than organise testing on an appropriate scale! • Hardest bit might be finding the time; nobody likes to delay a product release (but nobody wants to release a failure, either)