SBD: Usability Evaluation

SBD:Usability Evaluation Chris North CS 3724: HCI

ANALYZE analysis of stakeholders, field studies claims about current practice Problem scenarios Scenario-Based Design DESIGN Activity scenarios metaphors, information technology, HCI theory, guidelines iterative analysis of usability claims and re-design Information scenarios Interaction scenarios PROTOTYPE & EVALUATE summative evaluation formative evaluation Usability specifications

Evaluation • Formative vs. Summative • Analytic vs. Empirical

Usability Engineering Reqs Analysis Design Evaluate Develop many iterations

Usability Engineering Formative evaluation Summative evaluation

Usability Evaluation • Analytic Methods: • Usability inspection, Expert review • Heuristic: Nielsen’s 10 • Cognitive walk-through • GOMS analysis • Empirical Methods: • Usability Testing • Field or lab • Observation, problem identification • Controlled Experiment • Formal controlled scientific experiment • Comparisons, statistical analysis

User Interface Metrics • Ease of learning • Ease of use • User satisfaction

User Interface Metrics • Ease of learning • learning time, … • Ease of use • performance time, error rates… • User satisfaction • surveys… Not “user friendly”

Usability Testing

Usability Testing • Formative: helps guide design • Early in design process • when architecture is finalized, then its too late! • Small # of users • Usability problems, incidents • Qualitative feedback from users • Quantitative usability specification

Usability Specification Table metrics • e.g. frequent tasks should be fast

Usability Test Setup • Set of benchmark tasks • Derived from scenarios (Reqs analysis phase) • Derived from claims analysis (Design phase) • Easy to hard, specific to open-ended • Coverage of different UI features • E.g. “Find the 5 most expensive houses for sale” • Different types: learnability vs. performance • Consent forms • Not needed unless recording user’s face/voice (new rule) • Experimenters: • Facilitator: instructs user • Observers: take notes, collect data, video tape screen • Executor: run the prototype for faked parts • Users • Solicit from target user community (Reqs analysis) • 3-5 users, quality not quantity

Usability Test Procedure • Goal: mimic real life • Do not cheat by helping them complete tasks • Initial instructions • “We are evaluating the system, not you.” • Repeat: • Give user next benchmark task • Ask user to “think aloud” • Observe, note mistakes and problems • Avoid interfering, hint only if completely stuck • Interview • Verbal feedback • Questionnaire • ~1 hour / user

Usability Lab • E.g. McBryde 102

Data • Note taking • E.g. “&%$#@ user keeps clicking on the wrong button…” • Verbal protocol: think aloud • E.g. user thinks that button does something else… • Rough quantitative measures • HCI metrics: e.g. task completion time, … • Interview feedback and surveys • Video-tape screen & mouse • Eye tracking, biometrics?

Analyze • Initial reaction: • “stupid user!”, “that’s developer X’s fault!”, “this sucks” • Mature reaction: • “how can we redesign UI to solve that usability problem?” • the data is always right • Identify usability problems • Learning issues: e.g. can’t figure out or didn’t notice feature • Performance issues: e.g. arduous, tiring to solve tasks • Subjective issues: e.g. annoying, ugly • Problem severity: critical vs. minor

Cost-Importance Analysis • Importance 1-5: (task effect, frequency) • 5 = critical, major impact on user, frequent occurance • 3 = user can complete task, but with difficulty • 1 = minor problem, small speed bump, infrequent • Ratio = importance / cost • Sort by this, highest to lowest • 3 categories: Must fix, next version, ignored

Refine UI • Solve problems in order of: importance/cost • Simple solutions vs. major redesigns • Iterate: • Test, refine, test, refine, test, refine, … • Until?

Refine UI • Solve problems in order of: importance/cost • Simple solutions vs. major redesigns • Iterate: • Test, refine, test, refine, test, refine, … • Until? Meets usability specification

Examples • Learnability problem: • Problem: user didn’t know he could zoom in to see more… • Potential solutions: • Better labeling: Better zoom button icon, tooltip • Clearer affordance: Add a zoom bar slider (like google maps) • … • NOT: more “help” documentation! You can do better. • Performance problem: • Problem: user took too long to repeatedly zoom in… • Potential solutions: • Faster affordance: Add a real-time zoom bar • Shortcuts: Icons for each zoom level: state, city, street • …

Project (step 6): Usability Test • Usability Evaluation: • >=3 users: Not (tainted) HCI students • ~10 benchmark tasks • Simple data collection (Biometrics optional!) • Exploit this opportunity to improve your design • Report: • Procedure (users, tasks, specs, data collection) • Usability problems identified, specs not met • Design modifications

Controlled Experiments

Usability test vs. Controlled Expm. • Usability test: • Formative: helps guide design • Single UI, early in design process • Few users • Usability problems, incidents • Qualitative feedback from users • Controlled experiment: • Summative: measure final result • Compare multiple UIs • Many users, strict protocol • Independent & dependent variables • Quantitative results, statistical significance Engineering oriented Science oriented

What is Science?

What is Science? Phenomenon Engineering Measurement Science Modeling

Scientific Method?

Scientific Method • Form Hypothesis • Collect data • Analyze • Accept/reject hypothesis • How to “prove” a hypothesis in science?

Scientific Method • Form Hypothesis • Collect data • Analyze • Accept/reject hypothesis • How to “prove” a hypothesis in science? • Easier to disprove things, by counterexample • Null hypothesis = opposite of hypothesis • Disprove null hypothesis • Hence, hypothesis is proved

Example • Typical question: • Which visualization is better for which user tasks? Spotfire vs. TableLens

Cause and Effect • Goal: determine “cause and effect” • Cause = visualization tool (Spotfire vs. TableLens) • Effect = user performance time on task T • Procedure: • Vary cause • Measure effect • Problem: random variation • Cause = vis tool OR random variation? random variation Realworld Collecteddata uncertain conclusions

Stats to the Rescue • Goal: • Measured effect unlikely to result by random variation • Hypothesis: • Cause = visualization tool (e.g. Spotfire ≠ TableLens) • Null hypothesis: • Visualization tool has no effect (e.g. Spotfire = TableLens) • Hence: Cause = random variation • Stats: • If null hypothesis true, then measured effect occurs with probability < 5% (e.g. measured effect >> random variation) • Hence: • Null hypothesis unlikely to be true • Hence, hypothesis likely to be true

Variables • Independent Variables (what you vary), and treatments (the variable values): • Visualization tool: • Spotfire, TableLens, Excel • Task type: • Find, count, pattern, compare • Data size (# of items): • 100, 1000, 1000000 • Dependent Variables (what you measure) • User performance time • Errors • Subjective satisfaction (survey) • HCI metrics

Example: 2 x 3 design • n users per cell Ind Var 2: Task Type Ind Var 1: Vis. Tool Measured user performance times (dep var)

Groups • “Between subjects” variable • 1 group of users for each variable treatment • Group 1: 20 users, Spotfire • Group 2: 20 users, TableLens • Total: 40 users, 20 per cell • “With-in subjects” (repeated) variable • All users perform all treatments • Counter-balancing order effect • Group 1: 20 users, Spotfire then TableLens • Group 2: 20 users, TableLens then Spotfire • Total: 40 users, 40 per cell

Issues • Eliminate or measure extraneous factors • Randomized • Fairness • Identical procedures, … • Bias • User privacy, data security • IRB (internal review board)

Procedure • For each user: • Sign legal forms • Pre-Survey: demographics • Instructions • Do not reveal true purpose of experiment • Training runs • Actual runs • Give task • measure performance • Post-Survey: subjective measures • * n users

Data • Measured dependent variables • Spreadsheet:

Step 1: Visualize it • Dig out interesting facts • Qualitative conclusions • Guide stats • Guide future experiments

Step 2: Stats Ind Var 2: Task Type Ind Var 1: Vis. Tool Average user performance times (dep var)

TableLens better than Spotfire? • Problem with Averages? Avg Perf time (secs) Spotfire TableLens

TableLens better than Spotfire? • Problem with Averages? lossy • Compares only 2 numbers • What about the 40 data values? (Show me the data!) Avg Perf time (secs) Spotfire TableLens

The real picture • Need stats that compare all data • What if all users were 1 sec faster on TableLens? • What if only 1 user was 20 sec faster on TableLens? Avg Perf time (secs) Spotfire TableLens

Statistics • t-test • Compares 1 dep var on 2 treatments of 1 ind var • ANOVA: Analysis of Variance • Compares 1 dep var on n treatments of m ind vars • Result: • p = probability that difference between treatments is random (null hypothesis) • “statistical significance” level • typical cut-off: p < 0.05 • Hypothesis confidence = 1 - p

In Excel

p < 0.05 • Woohoo! • Found a “statistically significant” difference • Averages determine which is ‘better’ • Conclusion: • Cause = visualization tool (e.g. Spotfire ≠ TableLens) • Vis Tool has an effect on user performance for task T … • “95% confident that TableLens better than Spotfire …” • NOT “TableLens beats Spotfire 95% of time” • 5% chance of being wrong! • Be careful about generalizing

p > 0.05 • Hence, no difference? • Vis Tool has no effect on user performance for task T…? • Spotfire = TableLens ?

p > 0.05 • Hence, no difference? • Vis Tool has no effect on user performance for task T…? • Spotfire = TableLens ? • NOT! • Did not detect a difference, but could still be different • Potential real effect did not overcome random variation • Provides evidence for Spotfire = TableLens, but not proof • Boring, basically found nothing • How? • Not enough users • Need better tasks, data, …

Data Mountain • Robertson, “Data Mountain” (Microsoft)

Data Mountain: Experiment • Data Mountain vs. IE favorites • 32 subjects • Organize 100 pages, then retrieve based on cues • Indep. Vars: • UI: Data mountain (old, new), IE • Cue: Title, Summary, Thumbnail, all 3 • Dependent variables: • User performance time • Error rates: wrong pages, failed to find in 2 min • Subjective ratings

Data Mountain: Results • Spatial Memory! • Limited scalability?

SBD: Usability Evaluation