1 / 120

HCI460: Week 8 Lecture

HCI460: Week 8 Lecture. October 28, 2009. Outline. Midterm Review How Many Participants Should I Test? Review Exercises Stats Review of material covered last week New material Project 3 Next Steps Feedback on the Test Plans. Midterm Review. Midterm Review. Overall. N = 44.

sereno
Download Presentation

HCI460: Week 8 Lecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HCI460: Week 8 Lecture • October 28, 2009

  2. Outline • Midterm Review • How Many Participants Should I Test? • Review • Exercises • Stats • Review of material covered last week • New material • Project 3 • Next Steps • Feedback on the Test Plans

  3. Midterm Review

  4. Midterm Review Overall N = 44 • Mean / average: 8.55 • Median: 8.75 • Mode: 10 (most frequent score)

  5. Midterm Review Q1: Heuristic vs. Expert Evaluation • Question: What is the main difference between a heuristic evaluation and an expert evaluation? • Answer: • Heuristic evaluation uses a specific set of guidelines or heuristics. • Expert evaluation relies on the evaluator’s expertise (including internalized guidelines) and experience. • No need to explicitly match issues to specific heuristics. • More flexibility.

  6. Midterm Review Q2: Research-Based Guidelines (RBGs) • Question: What is unique about the research-based guidelines on usability.gov relative to heuristics and other guidelines? What are the unique advantages of using the research-based guidelines? • Answer: • This is a very comprehensive list of very specific guidelines (over 200). Other guideline sets are much smaller and the guidelines are more general. • RBGs were created by a group of experts (not an individual). • RBGs are specific to the web. • Unlike other heuristics and guidelines, RBGs have two ratings: • Relative importance to the success of a site • Helps prioritize issues. • Strength of research evidence that supports the guideline • Research citations lend credibility to the guidelines.

  7. Midterm Review Q3: Positive Findings • Question: Why should positive findings be presented in usability reports? • Answer: • To let stakeholders know what they should not change and which current practices they should try to emulate. • To make the report sound more objective and make stakeholders more receptive to the findings in the report. • Humans are more open to criticism if it is balanced with praise.

  8. Midterm Review Q4: Think-Aloud vs. Retrospective TA RTA ≠ post-task interview • Question: What is the main difference between the think-aloud protocol (TA) and the retrospective think-aloud protocol (RTA)? When should you use each of these methods and why? • Answer: • TA involves having the participant state what they are thinking while they are completing a task. • Great for formative studies; helps understand participant actions as they happen. • RTA is used after the task has been completed in silence. The participants walks through the task one more time (or watches a video of himself/herself performing the task) and explains their thoughts and actions. • Good when time on task and other quantitative behavioral measures need to be collected in addition to qualitative data. • Good for participants who may not be able to do TA,

  9. Midterm Review Q5: Time on Task in Formative UTs • Question: What are the main concerns associated with using time on task in a formative study with 5 participants? • Answer: • Formative studies often involve think-aloud protocol. • Time on task will be longer because thinking aloud takes more time and changes the workflow. • Sample size is too small for the time on task to generalize to the population or show significant differences between conditions.

  10. Midterm Review Q6: Human Error • Question: Why is the term “human error” no longer used in the medical field? • Answer: • “Human error” places the blame on the human when in fact errors usually result from problems with the design. • A more neutral term “use error” is used instead.

  11. Midterm Review Q7: Side-by-Side Moderation Moderating from another room with audio communication ≠ remote study • Question: When would you opt for side-by-side moderation in place of moderation from another room with audio communication? • Answer: • Side-by-side moderation is better when: • Building rapport with participant is important (e.g., in formative think-aloud studies) • Moderator has to simulate interaction (e.g., paper prototype) • The tested object / interaction may be difficult to see via camera feed or through the one-way mirror

  12. How Many Participants Should I Test? Review from Last Week

  13. How Many Participants Should I Test? Overview

  14. How Many Participants Should I Test? Sample Size Calculator for Formative Studies Jeff Sauro’s Sample Size Calculator for Discovering Problems in a User Interface: http://www.measuringusability.com/problem_discovery.php

  15. How Many Participants Should I Test? Sample Size for Precision Testing Sampling Error: • We need sufficient sample size to be able to generalize the results to the population. • Sample size for precision testing depends on: • Confidence level (usually 95% or 99%) • Desired level of precision • Acceptable sampling error (+/- 5%) • Size of population to which we want to generalize the results • Free online sample size calculator from Creative Research Systems: http://www.surveysystem.com/sscalc.htm

  16. How Many Participants Should I Test? Sample Size for Precision Testing • Confidence interval: 95% • When generalizing a score to the population, high sample size is needed. • However, the more the better is not true. • Getting 2000 participants is a waste.

  17. How Many Participants Should I Test? Sample Size for Hypothesis Testing • Hypothesis testing: comparing means • E.g., accuracy of typing on Device A is significantly better than it is on Device B. • Inferential statistics • Necessary sample size is derived from a calculation of power. • Under assumed criteria, the study will have a good chance of detecting a significant difference if the difference indeed exists. • Sample size depends on: • Assumed confidence level (e.g., 95%, 99%) • Acceptable sampling error (e.g., +/- 5%) • Expected effect size • Power • Statistical test (e.g., t-test, correlation, ANOVA)

  18. How Many Participants Should I Test? Hypothesis Testing: Sample Size Table* • *Cohen, J. (1992). A Power Primer. Psychological Bulletin, 112 (1), http://www.math.unm.edu/~schrader/biostat/bio2/Spr06/cohen.pdf.

  19. How Many Participants Should I Test? Reality • Usability tests do not typically require statistical significance. • Objectives dictate type of study and reasonable sample sizes necessary. • Sample size used is influenced by many factors—not all of them statistically driven. • Power analysis provides an estimate of sample size necessary to detect a difference, if it does indeed exist • Risk of not performing power analysis? • Too few  Low power  Inability to detect difference • Too many  Waste (and possibly find differences that are not real) • What if you find significance even with a small sample size? • It is probably really there (at a certain p level)

  20. How Many Participants Should I Test? Exercises

  21. How Many Participants Should I Test? Exercise 1: Background Information Old New • Package inserts for chemicals used in hospital labs were shortened and standardized to reduce cost. • E.g., many chemicals x many languages = high translation cost • New inserts: • ½ size of old inserts (booklet, not “map”) • More concise (charts, bullet points) • Less redundant • Users: Lab techs in hospitals

  22. How Many Participants Should I Test? Exercise 1: The Question • Client question: • Will the new inserts negatively impact user performance? • How many participants do you need for the study and why? • Exercise: • Discuss in groups and prepare questions for the client. • Q & A with the client • Come up with the answer and be prepared to explain it.

  23. How Many Participants Should I Test? Exercise 1: Possible Method • Each participant was asked to complete 30 search tasks • 2 insert versions x 3 chemicals x 5 search tasks • Sample question: “How long may the _____ be stored at the refrigeration temperature?” (for up to 7 days) • Tasks instructions were printed on a card and placed in front of the participant. • The tasks were timed. • Participants had a maximum of 1 minute to complete each task. • Those who exceeded the time limit were asked to move on to the next task. • To indicate that they were finished, participants had to: • Say the answer to the question out loud • Point to the answer in the insert

  24. How Many Participants Should I Test? Exercise 1: Possible Method

  25. How Many Participants Should I Test? Exercise 1: Sample Size

  26. How Many Participants Should I Test? Exercise 1: Sample Size • 32 lab techs: • 17 in the US • 15 in Germany

  27. How Many Participants Should I Test? Exercise 1: Results • Short inserts performed significantlybetter than the long inserts in terms of:

  28. How Many Participants Should I Test? Exercise 2 • Client question: • Does our website work for our users? • How many participants do you need for the study and why? • Exercise: • Discuss in groups and prepare questions for the client. • Q & A with the client • Come up with the answer and be prepared to explain it.

  29. Stats: What you needed to hear…

  30. Stats Planted the Seed Last Week • Stats are: • Not learnable in an hour • More than just p-values • Powerful • Dangerous • Time consuming • But, you are the Expert • Need to know: • Foundation • Rationale • Useful application

  31. Stats Foundation • Just with any user experience research endeavor • Think hard first (Test Plan) • Anticipate outcomes to keep you on track with objectives and not get pulled to tangents • Then begin research... • Definition of statistics: • A set of procedures to for describing measurements and for making inferences on what is generally true • Statistics and experimental design go hand in hand • Objective  Method  Measures  Outcomes  Objectives

  32. Stats Measures • Ratio scales (Interval + Absolute Zero) • Measure that has: • Comparable intervals (inches, feet, etc.) • An absolute zero (not meaningful to non-statisticians) • Differences are comparable • Device A is 72 inches in length while Device B is 36 inches • One is twice as tall as the other • Performance on Device A was 30 sec while Device B was 60 • Users were twice as fast completing task on Device A over Device B • Interval scales do not have a zero • Difference between 40F - 50F = 100F – 90F • Take away: You get powerful statistics using ratio/interval measures

  33. Stats What Does Power Mean Again? • Statistical Power • “The power to find a significant difference, if it does indeed exist” • Too little power  Miss significance when it is really there • Too much power  Might find significance when it is NOT there • You can get MORE Power by: • Adding more participants (by impact is non-linear) • Having a greater “effect size,” which is the anticipated difference • Picking within-subjects designs over between-subjects designs • Using ratio/interval measures • Changing alpha • Practical Power • Sample size costs money • If you find significance, then it is probably really true!

  34. Stats Other Measures (non-ratio/non-interval) • Likert scales • Rank data • Count data • Each of these measures use a different statistical test • Power is different (reduced) • Consider Likert Data 1 2 3 4 5 A B C • Could say: • A=1, B=2, C=5 • C came in 1st, B came in 2nd and A came in 3rd • Precision influences power and the less precise your measure, the less power you have to detect differences

  35. Stats Between-Groups Designs • Between-groups study splits sample into two or more groups • Each group only interacts with one device • What causes variability? • Measurement error • The tool or procedure can be an imprecise device • Starting and stopping the stop watch • Unreliable • We are human, so if you test the same participant on different days and you might get a different time! • Individual differences • Participants are different, so some get different scores than others on the same task. Since we are testing for differences between A and B, this can be a problem

  36. Stats What About Within-Groups Designs? • Within-Groups study has participants interact with all devices • What causes variability? • Measurement error • The tool or procedure can be an imprecise device • Starting and stopping the stop watch • Unreliable • We are human, so if you test the same participant on different days and you might get a different time! • Individual differences • Participants are different, so some get different scores than others on the same task. Since we are testing for differences between A and B, this can be a problem • No longer applies • Thus, less causes for variability results in statistical power

  37. Stats More Common Statistical Tests • You are actually well aware of statistics – Descriptive statistics! • Measures of central tendency • Mean • Median • Mode • Definitions? • Mean = ? • Average • Median = ? • The exact point that divides the distribution into two parts such that an equal number fall above and below that point • Mode = ? • Most frequently occurring score

  38. Stats When In Doubt, Plot Normal distribution Randomly sampled Mean = Median = Mode • Frequency • 1 2 3 4 5 • Score • Take scores • 1 2 3 4 5 • 2 3 4 5 1 • 3 3 4 5 3

  39. Stats Skewed Distributions • Positive skew  Tail on the right • Negative skew  Tail on the left • Impact to measures of central tendency? • Mode • Median • Mean • “Central tendency”

  40. Stats Got Central Tendency, Now Variability • We must first understand variability • We tend to think of a mean as “it” or “everything” • Consider a time on task as a metric • Measurement error • The tool or procedure can be an imprecise device • Starting and stopping the stop watch • Individual differences • Participants are different, so some get different scores than others on the same task. Since we are testing for differences between A and B, this can be a problem • Unreliable • We are human, so if you test the same participant on different days and you might get a different time!

  41. Stats Got Central Tendency, Now Variability • We must first understand variability • We tend to think of a mean as “it” or “everything” • Class scored 80 • School scored 76 • Many scores went into the mean score • Variability can be quantified • [draw]

  42. Stats Variability is Standardized (~Normalized) 1 Std Dev 1 Std Dev 2 Std Dev 2 Std Dev 3 Std Dev • Your score is in the 50th percentile • Ethan and Madeline are the smartest kids in class • AT scores, you saw your score—how did they get a percentile? • Distribution is normal • Numerically, the distribution can be described by only: • Mean and standard deviation

  43. Stats Empirical Rule 1 Std Dev 1 Std Dev 2 Std Dev 2 Std Dev 3 Std Dev 3 Std Dev • Empirical rule = 68/95/99 • 68% Mean +/- 1 std dev • 95% Mean +/- 2 std dev • 99% Mean +/- 3 std dev

  44. Stats Clear on Normal Curves? 40% 50% 60% 70% • Represent a single dataset on a single measure for a single sample • Once data are normalized, you can describe dataset simple by • Mean and standard deviation • 60% success rate with a std dev of 10%

  45. Stats Randomly Sampled Think of data as coming from a population

  46. Stats Things Can Happen By Chance Alone • Is this is really two samples of 10 drawn from a population who may have these characteristics

  47. Stats Exercise • Pass out Yes / No cards • Procedure • Will give you a task • I will count (because it is hard to install a timer on DePaul PCs) • Make a Yes or No decision • Record count • Hold up the Yes / No • Practice

  48. Stats Exercise • Is the man wearing a red shirt? • Decide Yes or No • Note time • Hold up card • Ready? • Decide Yes or No • Note time • Hold up card

  49. Stats 1 sec

  50. Stats 2 sec

More Related