420 likes | 743 Views
SBD: Usability Evaluation. Chris North cs3724: HCI. ANALYZE. analysis of stakeholders, field studies. claims about current practice. Problem scenarios. Scenario-Based Design. DESIGN. Activity scenarios. metaphors, information technology, HCI theory, guidelines. iterative
E N D
SBD:Usability Evaluation Chris North cs3724: HCI
ANALYZE analysis of stakeholders, field studies claims about current practice Problem scenarios Scenario-Based Design DESIGN Activity scenarios metaphors, information technology, HCI theory, guidelines iterative analysis of usability claims and re-design Information scenarios Interaction scenarios PROTOTYPE & EVALUATE summative evaluation formative evaluation Usability specifications
Evaluation • Formative vs. Summative • Analytic vs. Emprical
Usability Engineering Reqs Analysis Design Evaluate Develop many iterations
Usability Engineering Formative evaluation Summative evaluation
Usability Evaluation • Analytic Methods: • Usability inspection, Expert review • Heuristic Evaluation • Cognitive walk-through • GOMS analysis • Empirical Methods: • Usability Testing • Field or lab • Observation, problem identification • Controlled Experiment • Formal controlled scientific experiment • Comparisons, statistical analysis
User Interface Metrics • Ease of learning • learning time, … • Ease of use • perf time, error rates… • User satisfaction • surveys… Not “user friendly”
Usability Testing • Formative: helps guide design • Early in design process • when architecture is finalized, then its too late! • A few users • Usability problems, incidents • Qualitative feedback from users • Quantitative usability specification
Usability Test Setup • Set of benchmark tasks • Easy to hard, specific to open-ended • Coverage of different UI features • E.g. “find the 5 most expensive houses for sale” • Different types: learnability vs. performance • Consent forms • Not needed unless video-taping user’s face (new rule) • Experimenters: • Facilitator: instructs user • Observers: take notes, collect data, video tape screen • Executor: run the prototype if faked • Users • 3-5 users, quality not quantity
Usability Test Procedure • Goal: mimic real life • Do not cheat by showing them how to use the UI! • Initial instructions • “We are evaluating the system, not you.” • Repeat: • Give user a task • Ask user to “think aloud” • Observe, note mistakes and problems • Avoid interfering, hint only if completely stuck • Interview • Verbal feedback • Questionnaire • ~1 hour / user
Usability Lab • E.g. McBryde 102
Data • Note taking • E.g. “&%$#@ user keeps clicking on the wrong button…” • Verbal protocol: think aloud • E.g. user thinks that button does something else… • Rough quantitative measures • HCI metrics: e.g. task completion time, .. • Interview feedback and surveys • Video-tape screen & mouse • Eye tracking, biometrics?
Analyze • Initial reaction: • “stupid user!”, “that’s developer X’s fault!”, “this sucks” • Mature reaction: • “how can we redesign UI to solve that usability problem?” • the user is always right • Identify usability problems • Learning issues: e.g. can’t figure out or didn’t notice feature • Performance issues: e.g. arduous, tiring to solve tasks • Subjective issues: e.g. annoying, ugly • Problem severity: critical vs. minor
Cost-Importance Analysis • Importance 1-5: (task effect, frequency) • 5 = critical, major impact on user, frequent occurance • 3 = user can complete task, but with difficulty • 1 = minor problem, small speed bump, infrequent • Ratio = importance / cost • Sort by this • 3 categories: Must fix, next version, ignored
Refine UI • Simple solutions vs. major redesigns • Solve problems in order of: importance/cost • Example: • Problem: user didn’t know he could zoom in to see more… • Potential solutions: • Better zoom button icon, tooltip • Add a zoom bar slider (like moosburg) • Icons for different zoom levels: boundaries, roads, buildings • NOT: more “help” documentation!!! You can do better. • Iterate • Test, refine, test, refine, test, refine, … • Until? Meets usability specification
Project: Usability Evaluation • Usability Evaluation: • >=3 users: Not (tainted) HCI students • Simple data collection (Biometrics optional!) • Exploit this opportunity to improve your design • Report: • Procedure (users, tasks, specs, data collection) • Usability problems identified, specs not met • Design modifications
Usability test vs. Controlled Expm. • Usability test: • Formative: helps guide design • Single UI, early in design process • Few users • Usability problems, incidents • Qualitative feedback from users • Controlled experiment: • Summative: measure final result • Compare multiple UIs • Many users, strict protocol • Independent & dependent variables • Quantitative results, statistical significance
What is Science? • Measurement • Modeling
Scientific Method • Form Hypothesis • Collect data • Analyze • Accept/reject hypothesis • How to “prove” a hypothesis in science? • Easier to disprove things, by counterexample • Null hypothesis = opposite of hypothesis • Disprove null hypothesis • Hence, hypothesis is proved
Empirical Experiment • Typical question: • Which visualization is better in which situations? Spotfire vs. TableLens
Cause and Effect • Goal: determine “cause and effect” • Cause = visualization tool (Spotfire vs. TableLens) • Effect = user performance time on task T • Procedure: • Vary cause • Measure effect • Problem: random variation • Cause = vis tool OR random variation? random variation Realworld Collecteddata uncertain conclusions
Stats to the Rescue • Goal: • Measured effect unlikely to result by random variation • Hypothesis: • Cause = visualization tool (e.g. Spotfire ≠ TableLens) • Null hypothesis: • Visualization tool has no effect (e.g. Spotfire = TableLens) • Hence: Cause = random variation • Stats: • If null hypothesis true, then measured effect occurs with probability < 5% (e.g. measured effect >> random variation) • Hence: • Null hypothesis unlikely to be true • Hence, hypothesis likely to be true
Variables • Independent Variables (what you vary), and treatments (the variable values): • Visualization tool • Spotfire, TableLens, Excel • Task type • Find, count, pattern, compare • Data size (# of items) • 100, 1000, 1000000 • Dependent Variables (what you measure) • User performance time • Errors • Subjective satisfaction (survey) • HCI metrics
Example: 2 x 3 design • n users per cell Ind Var 2: Task Type Ind Var 1: Vis. Tool Measured user performance times (dep var)
Groups • “Between subjects” variable • 1 group of users for each variable treatment • Group 1: 20 users, Spotfire • Group 2: 20 users, TableLens • Total: 40 users, 20 per cell • “With-in subjects” (repeated) variable • All users perform all treatments • Counter-balancing order effect • Group 1: 20 users, Spotfire then TableLens • Group 2: 20 users, TableLens then Spotfire • Total: 40 users, 40 per cell
Issues • Eliminate or measure extraneous factors • Randomized • Fairness • Identical procedures, … • Bias • User privacy, data security • IRB (internal review board)
Procedure • For each user: • Sign legal forms • Pre-Survey: demographics • Instructions • Do not reveal true purpose of experiment • Training runs • Actual runs • Give task • measure performance • Post-Survey: subjective measures • * n users
Data • Measured dependent variables • Spreadsheet:
Step 1: Visualize it • Dig out interesting facts • Qualitative conclusions • Guide stats • Guide future experiments
Step 2: Stats Ind Var 2: Task Type Ind Var 1: Vis. Tool Average user performance times (dep var)
TableLens better than Spotfire? • Problem with Averages: lossy • Compares only 2 numbers • What about the 40 data values? (Show me the data!) Avg Perf time (secs) Spotfire TableLens
The real picture • Need stats that compare all data Avg Perf time (secs) Spotfire TableLens
Statistics • t-test • Compares 1 dep var on 2 treatments of 1 ind var • ANOVA: Analysis of Variance • Compares 1 dep var on n treatments of m ind vars • Result: • p = probability that difference between treatments is random (null hypothesis) • “statistical significance” level • typical cut-off: p < 0.05 • Hypothesis confidence = 1 - p
p < 0.05 • Woohoo! • Found a “statistically significant” difference • Averages determine which is ‘better’ • Conclusion: • Cause = visualization tool (e.g. Spotfire ≠ TableLens) • Vis Tool has an effect on user performance for task T … • “95% confident that TableLens better than Spotfire …” • NOT “TableLens beats Spotfire 95% of time” • 5% chance of being wrong! • Be careful about generalizing
p > 0.05 • Hence, no difference? • Vis Tool has no effect on user performance for task T…? • Spotfire = TableLens ? • NOT! • Did not detect a difference, but could still be different • Potential real effect did not overcome random variation • Provides evidence for Spotfire = TableLens, but not proof • Boring, basically found nothing • How? • Not enough users • Need better tasks, data, …
Data Mountain • Robertson, “Data Mountain” (Microsoft)
Data Mountain: Experiment • Data Mountain vs. IE favorites • 32 subjects • Organize 100 pages, then retrieve based on cues • Indep. Vars: • UI: Data mountain (old, new), IE • Cue: Title, Summary, Thumbnail, all 3 • Dependent variables: • User performance time • Error rates: wrong pages, failed to find in 2 min • Subjective ratings
Data Mountain: Results • Spatial Memory! • Limited scalability?