680 likes | 733 Views
This guide covers quantitative methods for experiment design, implementation, and evaluation in user interface studies. It discusses key concepts such as hypothesis testing, experimental variables, data collection, and statistical analysis. The guide emphasizes the importance of controlling variables, addressing privacy concerns, and aiming for generalizable results. Practical examples and considerations for experimental protocol, task design, and subject selection are included.
E N D
NEEDS DESIGN EVALUATE IMPLEMENT Evaluate—Quantitative Methods October 4, 2007 Turn in Project Proposal
Today • Quantitative methods • Scientific method • Aim for generalizable results • Privacy issues when collecting data
Quantitative methods • Reliably measure some aspect of interface • Especially to measurably compare • Approaches • Controlled experiments • Doing Psychology Experiments David W. Martin, 7th edition, 2007 • Collect usage data
Designing an experiment • State hypothesis • Identify variables • Independent • Dependent • Design experimental protocol • Apply for human subjects review • Select user population • Conduct experiment
Conducting experiment • Run pilot test • Collect data from running experiment • Perform statistical analysis • Interpret data, draw conclusions
Experiment hypothesis • Testable hypothesis • Precise statement of expected outcome • More specifically, how you predict the dependent variable (i.e., accuracy) will depend on the independent variable(s) • “Null” hypothesis (Ho) • Stating that there will be no effect • e.g., “There will be no difference in performance between the two groups” • Data used to try to disprove this null hypothesis
Experiment design • Independent variables • Attributes we manipulate / vary in condition • Levels, value of attribute • Dependent variables • Outcome of experiment, measures to evaluate • Usually measure user performance • Time to completion • Errors • Amount of production • Measures of satisfaction
Experiment design (2) • Control variables • Attributes that remain the same across conditions • Random variables • Attributes that are randomly sampled • Can be used to increase generalizability • Avoiding confounds • Confounds are attributes that changed but were not accounted for • Confounds prevent drawing conclusions on independent variables
Example: Person picker • Picking from list of names to invite to use facebook application Bryan Tsao Christine Robson David Sun John Tang Jonathan Tong … Bryan Tsao Christine Robson David Sun John Tang Jonathan Tong …
Example: Variables • Independent variables • Picture vs. no picture • Ordered horizontally or vertically • One column vs. 2 column • Dependent variables • Time to complete • Error rate • User perception • Control variables • Test setting • List to pick from • Random variables • Subject demographics • Confound • Only one woman in list • List mostly Asians
Experimental design goals • Internal validity • Cause and effect: That change in independent variables change in dependent variables • Eliminating confounds (turn them into independent variables or random variables) • Replicability of experiment • External validity • Results generalizable to other settings • Ecological validity—generalizable to the real-world • Confidence in results • Statistical power (number of subjects, at least 10)
Experimental protocol • Defining the task(s) • What are all the combinations of conditions? • How often to repeat each condition combination? • Between-subjects or within-subjects? • Avoiding bias (instructions, ordering)
Task • Defining task to test hypothesis • Pictures will lead to less errors • Same time to pick users with and without pictures (Ho) • Pictures will lead to higher satisfaction • How do you present the task? • Task: Users must select the following list of people to share application with • Jonathan Tong • Christine Robson • David Sun
Motivating user tasks • Create scenario, movie plot for task • Immerse subject in story that removes them from “user testing” situation • Focus subject on goal, system becomes tool (and more subject to critique)
Number of Conditions • Consider all combinations to isolate effects of each independent variable: • (2 order) * (2 columns) * (2 format) = 8 • Horizontal, 1 column picture + text • Horizontal, 2 column picture + text • Vertical, 1 column picture + text • Vertical, 2 column picture + text • Horizontal, 1 column text only • Horizontal, 2 column text only • Vertical, 1 column text only • Vertical, 2 column text only • Adding levels or factors exponential combinations • This can make experiments expensive!
Reducing conditions • Vary only one independent variable at a time • But can miss interactions • Factor experiment into series of steps • Prune branches if no significant effects found
Choosing subjects • Balance sample reflecting diversity of target user population (random variable) • Novices, experts • Age group • Gender • Culture • Example • 30 college-age, normal vision or corrected to normal, with demographic distributions of gender, culture
Population as variable • Population as an independent variable • Identifies interactions • Adds conditions • Population as controlled variable • Consistency across experiment • Misses relevant features • Statistical post-hoc analysis can suggest need for further study • Collect all the relevant demographic info
Recruiting participants • “Subject pools” • Volunteers • Paid participants • Students (e.g., psych undergrads) for course credit • Friends, acquaintances, family, lab members • “Public space” participants - e.g., observing people walking through a museum • Must fit user population (validity) • Motivation is a big factor - not only $$ but also explaining the importance of the research
Current events: Population sampling issue • Currently, election polling conducted on land-line phones • Legacy • Laws about manual dialing of cell phones • Higher refusal rates • Cell phone users pay for incoming phone callshave to compensate recipients • What bias is introduced by excluding cell phone only users? • 7% of population (2004), growing to 15% (2008) • Which candidate claims polls underrepresent? http://www.npr.org/templates/story/story.php?storyId=14863373
Between subjects design • Different groups of subjects use different designs • 15 subjects use text only • 15 subjects use text + pictures
Within subjects design • All subjects try all conditions 15 subjects 15 subjects
Within Subjects Designs • More efficient: • Each subject gives you more data - they complete more “blocks” or “sessions” • More statistical “power” • Each person is their own control, less confounds • Therefore, can require fewer participants • May mean more complicated design to avoid “order effects” • Participant may learn from first condition • Fatigue may make second performance worse
Between Subjects Designs • Fewer order effects • Simpler design & analysis • Easier to recruit participants (only one session, shorter time) • Subjects can’t compare across conditions • Need more subjects • Control more for confounds
Within Subjects: Ordering effects • Countering order effects • Equivalent tasks (less sensitive to learning) • Randomize order of conditions (random variable) • Counterbalance ordering (ensure all orderings covered) • Latin Square ordering (partial counterbalancing)
Study setting • Lab setting • Complete control through isolation • Uniformity across subjects • Field study • Ecological validity • Variations across subjects
Before Study • Always pilot test first • Reveals unexpected problems • Can’t change experiment design after collecting data • Make sure they know you are testing software, not them • (Usability testing, not User testing) • Maintain privacy • Explain procedures without compromising results • Can quit anytime • Administer signed consent form
During Study • Always follow same steps—use checklist • Make sure participant is comfortable • Session should not be too long • Maintain relaxed atmosphere • Never indicate displeasure or anger
After Study • State how session will help you improve system (“debriefing”) • Show participant how to perform failed tasks • Don’t compromise privacy (never identify people, only show videos with explicit permission) • Data to be stored anonymously, securely, and/or destroyed
Exercise: Quantitative test • Pair up with someone who has computer, downloaded the files • DO NOT OPEN THE FILE (yet) • Make sure one of you has a stopwatch • Cell phone • Watch • Computer user will run test, observer will time event
Exercise: Task • Open the file • Find the item in the list • Highlight that entry like this
Example: Variables • Independent variables • Dependent variables • Control variables • Random variables • Confound
Data Inspection • Look at the results • First look at each participant’s data • Were there outliers, people who fell asleep, anyone who tried to mess up the study, etc.? • Then look at aggregate results and descriptive statistics • “What happened in this study?” relative to hypothesis, goals
Descriptive Statistics • For all variables, get a feel for results: • Total scores, times, ratings, etc. • Minimum, maximum • Mean, median, ranges, etc. What is the difference between mean & median? Why use one or the other? • e.g. “Twenty participants completed both sessions (10 males, 10 females; mean age 22.4, range 18-37 years).” • e.g. “The median time to complete the task in the mouse-input group was 34.5 s (min=19.2, max=305 s).”
Subgroup Stats • Look at descriptive stats (means, medians, ranges, etc.) for any subgroups • e.g. “The mean error rate for the mouse-input group was 3.4%. The mean error rate for the keyboard group was 5.6%.” • e.g. “The median completion time (in seconds) for the three groups were: novices: 4.4, moderate users: 4.6, and experts: 2.6.”
Plot the Data • Look for the trends graphically
Other Presentation Methods Scatter plot Box plot Middle 50% Age low high Mean 0 20 Time in secs.
Experimental Results • How does one know if an experiment’s results mean anything or confirm any beliefs? • Example: 40 people participated, 28 preferred interface 1, 12 preferred interface 2 • What do you conclude?
Inferential (Diagnostic) Stats • Tests to determine if what you see in the data (e.g., differences in the means) are reliable (replicable), and if they are likely caused by the independent variables, and not due to random effects • e.g. t-test to compare two means • e.g. ANOVA (Analysis of Variance) to compare several means • e.g. test “significance level” of a correlation between two variables
Means Not Always Perfect Experiment 1 Group 1Group 2 Mean: 7 Mean: 10 1,10,10 3,6,21 Experiment 2 Group 1Group 2 Mean: 7 Mean: 10 6,7,8 8,11,11
Inferential Stats and the Data • Ask diagnostic questions about the data Are these really different? What would that mean?
Hypothesis Testing • Going back to the hypothesis—what do the data say? • Translate hypothesis into expected difference in measure • If “First name” is faster, then TimeFirst < TimeLast • If “null hypothesis” there should be no difference between the completion times H0: TimeFirst = TimeLast
Hypothesis Testing • “Significance level” (p): • The probability that your hypothesis was wrong, simply by chance • The cutoff or threshold level of p (“alpha” level) is often set at 0.05, or 5% of the time you’ll get the result you saw, just by chance • e.g. If your statistical t-test (testing the difference between two means) returns a t-value of t=4.5, and a p-value of p=.01, the difference between the means is statistically significant
Errors • Errors in analysis do occur • Main Types: • Type I/False positive - You conclude there is a difference, when in fact there isn’t • Type II/False negative - You conclude there is no different when there is
Drawing Conclusions • Make your conclusions based on the descriptive stats, but back them up with inferential stats • e.g., “The expert group performed faster than the novice group t(1,34) = 4.6, p < .01.” • Translate the stats into words that regular people can understand • e.g., “Thus, those who have computer experience will be able to perform better, right from the beginning…”
Feeding Back Into Design • Your study was designed to yield information you can use to redesign your interface • What were the conclusions you reached? • How can you improve on the design? • What are quantitative redesign benefits? • e.g. 2 minutes saved per transaction, 24% increase in production, or $45,000,000 per year in increased profit • What are qualitative, less tangible benefit(s)? • e.g. workers will be less bored, less tired, and therefore more interested --> better customer service
Remote usability testing • Telephone or video communication • Screen-sharing technology • Microsoft NetMeeting https://www.microsoft.com/downloads/details.aspx?FamilyID=26c9da7c-f778-4422-a6f4-efb8abba021e&DisplayLang=en • VNC http://www.realvnc.com/ • Greater flexibility in recruiting subjects, environments
Usage logging • Embed logging mechanisms into code • Study usage in actual deployment • Some code can even “phone home” • facebook usage metrics
Example: Rhythmic Work Activity • Drawn from about 50 Awarenex (IM) users • Bi-coastal teams (3-hour time difference) • Work from home team members • Based on up to 2 years of collected data Sun Microsystems Laboratories: James "Bo" Begole, Randall Smith, and Nicole Yankelovich
Activity Data Collected • Activity information • Input device activity (1-minute granularity) • Device location (office, home, mobile) • Email fetching and sending • Online calendar appointments Activity ≠ Availability