380 likes | 480 Views
Evaluating Visualizations. cs5764: Information Visualization Chris North. Evaluating Visualizations. Usability Test Observation, problem identification Controlled Experiment Formal controlled scientific experiment Comparisons, statistical analysis Expert Review
E N D
Evaluating Visualizations cs5764: Information Visualization Chris North
Evaluating Visualizations • Usability Test • Observation, problem identification • Controlled Experiment • Formal controlled scientific experiment • Comparisons, statistical analysis • Expert Review • Examination by visualization expert • Heuristic Evaluation • Principles, Guidelines • Algorithmic
Projects • Implementation projects: • Small usability test of implementation • Short usability report • Experiment projects: • Main controlled experiment • Experiment materials and raw data • Then data analysis
Usability test vs. Controlled Expm. • Usability test: • Formative: helps guide design • Single UI, early in design process • Few users • Usability problems, incidents • Qualitative feedback from users • Controlled experiment: • Summative: measure final result • Compare multiple UIs • Many users, strict protocol • Independent & dependent variables • Quantitative results, statistical significance
What is Science? • Measurement • Modeling
Scientific Method • Form Hypothesis • Collect data • Analyze • Accept/reject hypothesis • How to “prove” a hypothesis in science? • Easier to disprove things, by counterexample • Null hypothesis = opposite of hypothesis • Disprove null hypothesis • Hence, hypothesis is proved
Empirical Experiment • Typical question: • Which visualization is better in which situations? Spotfire vs. TableLens
Cause and Effect • Goal: determine “cause and effect” • Cause = visualization tool (Spotfire vs. TableLens) • Effect = user performance time on task T • Procedure: • Vary cause • Measure effect • Problem: random variation • Cause = vis tool OR random variation? random variation Realworld Collecteddata uncertain conclusions
Stats to the Rescue • Goal: • Measured effect unlikely to result by random variation • Hypothesis: • Cause = visualization tool (e.g. Spotfire ≠ TableLens) • Null hypothesis: • Visualization tool has no effect (e.g. Spotfire = TableLens) • Hence: Cause = random variation • Stats: • If null hypothesis true, then measured effect occurs with probability < 5% • But measured effect did occur! (e.g. measured effect >> random variation) • Hence: • Null hypothesis unlikely to be true • Hence, hypothesis likely to be true
Variables • Independent Variables (what you vary), and treatments (the variable values): • Visualization tool • Spotfire, TableLens, Excel • Task type • Find, count, pattern, compare • Data size (# of items) • 100, 1000, 1000000 • Dependent Variables (what you measure) • User performance time • Errors • Subjective satisfaction (survey) • HCI metrics
Example: 2 x 3 design • n users per cell Ind Var 2: Task Type Ind Var 1: Vis. Tool Measured user performance times (dep var)
Groups • “Between subjects” variable • 1 group of users for each variable treatment • Group 1: 20 users, Spotfire • Group 2: 20 users, TableLens • Total: 40 users, 20 per cell • “With-in subjects” (repeated) variable • All users perform all treatments • Counter-balancing order effect • Group 1: 20 users, Spotfire then TableLens • Group 2: 20 users, TableLens then Spotfire • Total: 40 users, 40 per cell
Issues • Eliminate or measure extraneous factors • Randomized • Fairness • Identical procedures, … • Bias • User privacy, data security • IRB (internal review board)
Procedure • For each user: • Sign legal forms • Pre-Survey: demographics • Instructions • Do not reveal true purpose of experiment • Training runs • Actual runs • Give task, measure performance • Post-Survey: subjective measures • * n users
Data • Measured dependent variables • Spreadsheet:
Step 1: Visualize it • Dig out interesting facts • Qualitative conclusions • Guide stats • Guide future experiments
Step 2: Stats Ind Var 2: Task Type Ind Var 1: Vis. Tool Average user performance times (dep var)
TableLens better than Spotfire? • Problem with Averages: lossy • Compares only 2 numbers • What about the 40 data values? (Show me the data!) Avg Perf time (secs) Spotfire TableLens
The real picture • Need stats that compare all data Avg Perf time (secs) Spotfire TableLens
Statistics • t-test • Compares 1 dep var on 2 treatments of 1 ind var • ANOVA: Analysis of Variance • Compares 1 dep var on n treatments of m ind vars • Result: • p = probability that difference between treatments is random (null hypothesis) • “statistical significance” level • typical cut-off: p < 0.05 • Hypothesis confidence = 1 - p
p < 0.05 • Woohoo! • Found a “statistically significant” difference • Averages determine which is ‘better’ • Conclusion: • Cause = visualization tool (e.g. Spotfire ≠ TableLens) • Vis Tool has an effect on user performance for task T … • “95% confident that TableLens better than Spotfire …” • NOT “TableLens beats Spotfire 95% of time” • 5% chance of being wrong! • Be careful about generalizing
p > 0.05 • Hence, no difference? • Vis Tool has no effect on user performance for task T…? • Spotfire = TableLens ? • NOT! • Did not detect a difference, but could still be different • Potential real effect did not overcome random variation • Provides evidence for Spotfire = TableLens, but not proof • Boring, basically found nothing • How? • Not enough users • Need better tasks, data, …
Data Mountain • Robertson, “Data Mountain” (Microsoft)
Comparison of Info Vis Systems • Kobsa
Cleveland’s Rules for Secondary Tasks • Chewar et al.
Usability test vs. Controlled Expm. • Usability test: • Formative: helps guide design • Single UI, early in design process • Few users • Usability problems, incidents • Qualitative feedback from users • Controlled experiment: • Summative: measure final result • Compare multiple UIs • Many users, strict protocol • Independent & dependent variables • Quantitative results, statistical significance
Usability Test Setup • Set of benchmark tasks • Easy to hard, specific to open-ended • Coverage of different UI features • E.g. “find the 5 most expensive houses for sale” • Consent forms • Not needed unless video-taping user’s face (new rule) • Experimenters: • Facilitator: instructs user • Observers: take notes, collect data, video tape screen • Executor: run the prototype if faked • Users • 3-5 users, quality not quantity
Usability Test Procedure • Goal: mimic real life • Do not cheat by showing them how to use the UI! • Initial instructions • “We are evaluating the system, not you.” • Repeat: • Give user a task • Ask user to “think aloud” • Observe, note mistakes and problems • Avoid interfering, hint only if completely stuck • Interview • Verbal feedback • Questionnaire • ~1 hour / user
Usability Lab • E.g McBryde 102
Data • Note taking • E.g. “&%$#@ user keeps clicking on the wrong button…” • Verbal protocol: think aloud • E.g. user expects that button to do something else… • Rough quantitative measures • HCI metrics: e.g. task completion time, .. • Interview feedback and surveys • Video-tape screen & mouse • Eye tracking, biometrics?
Analyze • Initial reaction: • “stupid user!”, “that’s developer X’s fault!”, “this sucks” • Mature reaction: • “how can we redesign UI to solve that usability problem?” • the user is always right • Identify usability problems • Learning issues: e.g. can’t figure out or didn’t notice feature • Performance issues: e.g. arduous, tiring to solve tasks • Subjective issues: e.g. annoying, ugly • Problem severity: critical vs. minor
Cost-Importance Analysis • Importance 1-5: (task effect, frequency) • 5 = critical, major impact on user, frequent occurance • 3 = user can complete task, but with difficulty • 1 = minor problem, small speed bump, infrequent • Ratio = importance / cost • Sort by this • 3 categories: Must fix, next version, ignored
Refine UI • Simple solutions vs. major redesigns • Solve problems in order of: importance/cost • Example: • Problem: user didn’t know he could zoom in to see more… • Potential solutions: • Better zoom button icon, tooltip • Add a zoom bar slider (like moosburg) • Icons for different zoom levels: boundaries, roads, buildings • NOT: more more “help” documentation!!! You can do better. • Iterate • Test, refine, test, refine, test, refine, … • Until? Meets usability specification
Project revisited • For implementation projects: • Informal test • A few users • Not (tainted) info vis students • 102 lab not required • Simple data collection • Biometrics optional! • 1 iteration • Exploit this opportunity to improve your design • For experiment projects: • See controlled experiments