IS 4800 Empirical Research Methods for Information Science Class Notes Feb. 24, 2012

IS 4800 Empirical Research Methods for Information Science Class Notes Feb. 24, 2012 Instructor: Prof. Carole Hafner, 446 WVH hafner@ccs.neu.edu Tel: 617-373-5116 Course Web site: www.ccs.neu.edu/course/is4800sp12/

Observational Survey Experimental One-factor, two-level, between-subjects One-factor, two-level, within-subjects aka “repeated measures” or “crossover” Matched pairs Types of Quantitative Studies We’ve Discussed

Between-Subjects Design Different groups of subjects are randomly assigned to the levels of your independent variable Data are averaged for analysis Use t-test for independent means Example: “single factor, two-level, between subjects” design Level A=Word vs. Level B=Wizziword Types of Experimental Designs

Within-Subjects Design A single group of subjects is exposed to all levels of the independent variable Data are averaged for analysis aka “repeated measures design”, “crossover design” Use t-test for dependent means aka “paired samples t-test” We will discuss “single factor, two-level, within subjects” designs. Types of Experimental Designs

Each group is a sample from a population Big question: are the populations the same (null hypothesis) or are they significantly different? Between-Subjects Design

Crucial: method must not be applied subjectively Point in time at which randomization occurs is important Sidebar: Randomization final measures randomization experiment recruiting

Simple randomization Flip a coin Random number generator Table of random numbers Partition numeric range into number of conditions Problems? Sidebar:Randomization

Blocked randomization Avoids serious imbalances in assignments of subjects to conditions Guarantees that imbalance will never be larger than a specified amount Example: want to ensure that every 4 subjects we have an equal number assigned to each of 2 conditions => “block size of 4” Method: write all permutations of N conditions taken B at a time (for B = block size) Example: AABB, ABAB, BAAB, BABA, BBAA, ABBA At the start of each block, select one of the orderings at random Should use block sizes > 2 Sidebar: Randomization

Stratified randomization First stratify Ss based on measured factors (prior to randomization) (e.g., gender) Within each strata, randomize Either simple or blocked Sidebar: Randomization Strata Sex Condition assignment 1 M ABBA BABA… 2 F BABA BBAA…

More Power! Why? Controls for all inter-subject variability Randomized between-subjects design just balances the effects between groups (Matched-pair controls for identified and matched extraneous variables) Within-Subjects DesignsBenefits

Error variance is the variability among scores not caused by the independent variable Error variance is common to all experimental designs Error variance is handled differently in each design Sources of error variance (“extraneous variables”) Individual differences among subjects Environmental conditions not constant across levels of the independent variable Fluctuations in the physical/mental state of an individual subject The Problem of Error Variance

+ Error Variance Independent Variable Measured Outcomes Individual Differences Environmental Conditions

Taking steps to reduce error variance Hold extraneous variables constant by treating subjects as similarly as possible Match subjects on crucial characteristics Increasing the effectiveness of the independent variable Strong manipulations yield less error variance than weak manipulations Handling Error Variance

Matched Group Design Treatment 1 Match Pairs Randomize Treatment 2 • Use when you know some third variable has • significant correlation with outcome • A between-subjects design • Use paired-samples t-test!

Randomizing error variance across groups Distribute error variance equivalently across levels of the independent variable Accomplished with random assignment of subjects to levels of the independent variable Statistical analysis Random assignment tends to equalize error variance across groups, but not guarantee that it will You can estimate the probability that observed differences are due to error variance by using inferential statistics Handling Error Variance

Subjects are not randomly assigned to treatment conditions The same subjects are used in all conditions Closely related to the matched-groups design Advantages Reduces error variance due to individual differences among subjects across treatment groups Reduced error variance results in a more powerful design Effects of independent variable are more likely to be detected Within-Subjects Designs

More demanding on subjects, especially in complex designs Subject attrition is a problem Carryover effects: Exposure to a previous treatment affects performance in a subsequent treatment Within-Subjects DesignsDisadvantages

Embodied Conversational Agents to Promote Health Literacy for Older Adults Carryover Example Brochure Computer T0 T1 T2 Diabetes Knowledge Assessment Diabetes Knowledge Assessment Diabetes Knowledge Assessment

Learning Learning a task in the first treatment may affect performance in the second Fatigue Fatigue from earlier treatments may affect performance in later treatments Habituation Repeated exposure to a stimulus may lead to unresponsiveness to that stimulus Sensitization Exposure to a stimulus may make a subject respond more strongly to another Contrast Subjects may compare treatments, which may affect behavior Adaptation If a subject undergoes adaptation (e.g., dark adaptation), then earlier results may differ from later ones Sources of Carryover

Counterbalancing The various treatments are presented in a different order for different subjects May be complete or partial Balances the effects of carryover on each treatment Assumes carryover effect is independent of the order Dealing With Carryover Effects

Taking Steps to Minimize Carryover Techniques such as pre-training, practice sessions, or rest periods between treatments can reduce some forms of carryover Make Treatment Order an Independent Variable Allows you to measure the size of carryover effects, which can be taken into account in future experiments Dealing With Carryover Effects

The Latin Square Design Sample partial counterbalancing approach Used when you make the number of treatment orders equal to the number of treatments (each treatment occurs once in every row and column) Example: want to evaluate 4 different word processors, using 4 admins in 4 departments. A completely counterbalanced design would require 4x4x4=64 trials. Latin square attempts to eliminate systematic bias in assignment of treatment to departments & subjects. Dealing With Carryover Effects Subj Department 1 2 3 4 1 C B A D Treatments A-D 2 B A D C 3 D C B A 4 A D C B

Treatment Sequence A B B A Order 1 2 Subject 1 2 … Order 2 1 … Treatment A 23.5 14.6 … Treatment B 14.2 11.5 … Example of a Counterbalanced Single-Factor Design With Two Treatments How do you test for “order effects”?

Review pro’s and con’s of between subjects and within subjects. What is matched pairs? Types of Studies We’ve Discussed

You’ve developed a new web-based help system for your email client. You want to compare your system to the old printed manual. Example – Best Design?

You’ve just developed the “Matchmaker” – a handheld device that beeps when you are in the vicinity of a compatible person who is also carrying a Matchmaker. You evaluate the number of users who are married after six months of use compared to a non-intervention control group. Example – Best Design?

You’ve just developed “Reado Speedo” that reads print books using OCR and speaks them to you at twice your normal reading rate. You want to evaluate your product against the old fashioned way on reading rate, comprehension and satisfaction. Example – Best Design?

Introduction to Usability Testing • I. Summative evaluation: Measure/compare user performance and satisfaction • Quantitative measures • Statistical methods • II. Formative Evaluation: Identify Usability Problems • Quantitative and Qualitative measures • Ethnographic methods such as interviews, focus groups

Usability Goals (Nielsen) Learnability Efficiency Memorability Error avoidance/recovery User satisfaction Operationalize these goals to evaluate usability

What is a Usability Experiment? • Usability testing in a controlled environment • There is a test set of users • They perform pre-specified tasks • Data is collected (quantitative and qualitative) • Take mean and/or median value of measured attributes • Compare to goal or another system • Contrasted with “expert review” and “field study” evaluation methodologies • The growth of usability groups and usability laboratories

Experimental factors Subjects representative sufficient sample Variables independent variable (IV) characteristicchanged to produce different conditions. e.g. interface style, number of menu items. dependent variable (DV) characteristicsmeasured in the experiment e.g. timetaken, number of errors.

Experimental factors (cont.) Hypothesis prediction of outcome framed in terms ofIV and DV null hypothesis: states no differencebetween conditions aim is to disprovethis. Experimental design within groups design each subjectperforms experiment under each condition. transfer of learning possible less costlyand less likely to suffer from user variation. between groups design each subjectperforms under only one condition notransfer of learning more users required variation can bias results.

Summative AnalysisWhat to measure? (and it’s relationship to a usability goal) Total task time User “think time” (dead time??) Time spent not moving toward goal Ratio of successful actions/errors Commands used/not used frequency of user expression of: confusion, frustration, satisfaction frequency of reference to manuals/help system percent of time such reference provided the needed answer

Measuring User Performance Measuring learnability Time to complete a set of tasks Learnability/efficiency trade-off Measuring efficiency Time to complete a set of tasks How to define and locate “experienced” users Measuring memorability The most difficult, since “casual” users are hard to find for experiments Memory quizzes may be misleading

Measuring User Performance (cont.) Measuring user satisfaction Likert scale (agree or disagree) Semantic differential scale Physiological measure of stress Measuring errors Classification of minor v. serious

Reliability and Validity Reliability means repeatability. Statistical significance is a measure of reliability Validity means will the results transfer into a real-life situation. It depends on matching the users, task, environment Reliability - difficult to achieve because of high variability in individual user performance

Formative EvaluationWhat is a Usability Problem?? Unclear - the planned method for using the system is not readily understood or remembered (info. design level) Error-prone - the design leads users to stray from the correct operation of the system (any design level) Mechanism overhead - the mechanism design creates awkward work flow patterns that slow down or distract users. Environment clash - the design of the system does not fit well with the users’ overall work processes. (any design level) Ex: incomplete transaction cannot be saved

Qualitative methods for collecting usability problems Thinking aloud studies Difficult to conduct Experimenter prompting, non-directive Alternatives: constructive interaction, coaching method, retrospective testing Output: notes on what users did and expressed: goals, confusions or misunderstandings, errors, reactions expressed Questionnaires Should be usability-tested beforehand Focus groups, interviews

Observational Methods - Think Aloud user observed performing task user asked to describe what he is doing andwhy, what he thinks is happening etc. Advantages simplicity - requires little expertise can provide useful insight can show how system is actually use Disadvantages subjective selective act of describing may alter taskperformance

Observational Methods - Cooperative evaluation variation on thinkaloud user collaborates in evaluation both user and evaluator can ask each otherquestions throughout Additional advantages less constrained and easier to use user is encouraged to criticize system clarification possible

Observational Methods - Protocol analysis paper and pencil cheap, limited to writing speed audio good for think aloud, diffcult to match with other protocols video accurate and realistic, needs special equipment, obtrusive computer logging automatic and unobtrusive, large amounts of data difficult to analyze user notebooks coarse and subjective, useful insights, good for longitudinal studies Mixed use in practice. Transcription of audio and video difficult andrequires skill. Some automatic support tools available

Query Techniques - Interviews analyst questions user on one to one basisusually based on prepared questions informal, subjective and relatively cheap Advantages can be varied to suit context issues can be explored more fully can elicit user views and identifyunanticipated problems Disadvantages very subjective time consuming

Query Techniques - Questionnaires Set of fixed questions given to users Advantages quick and reaches large user group can be analyzed more rigorously Disadvantages less flexible less probing

Laboratory studies: Pros and Cons Advantages: specialist equipment available uninterrupted environment Disadvantages: lack of context difficult to observe several users cooperating Appropriate if actual system location is dangerous orimpracticalfor to allow controlled manipulation of use.

Steps in a usability experiment The planning phase The execution phase Data collection techniques Data analysis

The planning phase • Who, what, where, when and how much? • Who are test users, and how will they be recruited? • Who are the experimenters? • When, where, and how long will the test take? • What equipment/software is needed? • How much will the experiment cost? • Prepare detailed test protocol • *What test tasks? (written task sheets) • *What user aids? (written manual) • *What data collected? (include questionnaire) • How will results be analyzed/evaluated? • Pilot test protocol with a few users

Detailed Test Protocol What tasks? Criteria for completion? User aids What will users be asked to do (thinking aloud studies)? Interaction with experimenter What data will be collected? All materials to be given to users as part of the test, including detailed description of the tasks.

Execution phase Prepare environment, materials, software Introduction should include: purpose (evaluating software) voluntary and confidential explain all procedures recording question-handling invite questions During experiment give user written task description(s), one at a time only one experimenter should talk De-briefing

Execution phase: ethics of human experimentation applied to usability testing • Users feel exposed using unfamiliar tools and making erros • Guidelines: • Re-assure that individual results not revealed • Re-assure that user can stop any time • Provide comfortable environment • Don’t laugh or refer to users as subjects or guinea pigs • Don’t volunteer help, but don’t allow user to struggle too long • In de-briefing • answer all questions • reveal any deception • thanks for helping

Execution Phase: Designing Test Tasks Tasks: Are representative Cover most important parts of UI Don’t take too long to complete Goal or result oriented (possibly with scenario) Not frivolous or humorous (unless part of product goal) First task should build confidence Last task should create a sense of accomplishment

IS 4800 Empirical Research Methods for Information Science Class Notes Feb. 24, 2012