1 / 41

SIMS 247 Lecture 11 Evaluating Interactive Interfaces

SIMS 247 Lecture 11 Evaluating Interactive Interfaces. February 24, 1998. Many slides from this lecture are closely derived from some created by Professor James Landay, 1997 Also, from Chapter 4 of Shneiderman 97. Outline. Why do evaluation? What types of evaluations? Choosing participants

kaliska
Download Presentation

SIMS 247 Lecture 11 Evaluating Interactive Interfaces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SIMS 247 Lecture 11Evaluating Interactive Interfaces February 24, 1998 Marti Hearst SIMS 247

  2. Many slides from this lecture are closely derived from some created by Professor James Landay, 1997Also, from Chapter 4 of Shneiderman 97 Marti Hearst SIMS 247

  3. Outline • Why do evaluation? • What types of evaluations? • Choosing participants • Designing the test • Collecting data • Analyzing the data • Drawing conclusions Marti Hearst SIMS 247

  4. Why do Evaluation? • To tell how good or bad a visualization is • People must use it to evaluate it • Must compare against the status quo • Something that looks useful to the designer might be too complex or superfluous for real users • For iterative design • Interface might be almost right but require adjustments • The interactive components might have problems • To advance our knowledge of how people understand and use technology Marti Hearst SIMS 247

  5. Types of Evaluation • Expert Reviews • Usability Testing • Controlled Psychologically-Oriented Experiments • There are tradeoffs for each • see McGrath paper Marti Hearst SIMS 247

  6. Types of Evaluation • Expert Reviews • Heuristic Evaluation • An expert critiques the design with respect to certain performance criteria • Cognitive Walkthrough • Make a low-fidelity prototype • Designers “walk through” or simulate what how a user would use the design under various circumstances • Relies on relevance of experts’ opinions (as opposed to what users experience) Marti Hearst SIMS 247

  7. Types of Evaluation • Usability Testing • Goals • Rapid assessment with real users • Find problems with the design • Techniques • Carefuly chosen set of tasks • Only a few participants (as opposed to a scientific experiment) • Results • Recommended changes • (as opposed to acceptance or rejection of hypothesis) Marti Hearst SIMS 247

  8. Types of Evaluation • Techniques for Usability Testing • “Thinking Aloud” while using the system • Acceptance Tests • Does the interface meet the performance objectives? • Time for users to learn specific functions • Speed of task performance • Rate of errors • User retention of commands over time • Surveys • Focus group discussions • Field Tests Marti Hearst SIMS 247

  9. Types of Evaluation • Controlled Psychologically-Oriented Experiments • Usually part of a theoretical framework • Propose a testable hypothesis • Identify a small number of independent variables to manipulate • Choose the dependent variables to measure • Judiciously choose participants • Control for biasing factors • Apply statistical methods to data analysis • Place results within the theory, or refine or refute the theory if necessary, point direction to future work Marti Hearst SIMS 247

  10. Choosing Participants • Should be representative of eventual users, in terms of • job-specific vocabulary / knowledge • tasks • If you can’t get real users, get approximation • system intended for doctors • get medical students • system intended for electrical engineers • get engineering students • Use incentives to get participants Marti Hearst SIMS 247

  11. Ethical Considerations • Sometimes tests can be distressing • users leave in tears • users can be embarrassed by mistakes • You have a responsibility to alleviate this • make voluntary • use informed consent • avoid pressure to participate • let them know they can stop at any time • stress that you are testing the system, not them • make collected data as anonymous as possible • Often must get human subjects approval Marti Hearst SIMS 247

  12. User Study Proposal • A report that contains • objective • description of system being testing • task environment & materials • participants • methodology • tasks • test measures • Get approved • Once this is done, it is useful for writing the final report Marti Hearst SIMS 247

  13. Selecting Tasks • Should reflect what real tasks will be like • may need to shorten if • they take too long • require background that test user won’t have • Be sure tasks measure something directly related to your design • But don’t bias the tasks so that only your design can win • should be a realistic task in order to avoid this • Don’t choose tasks that are too fragmented Marti Hearst SIMS 247

  14. Special Considerations for Evaluating Visualizations • Be careful about what is being compared • Example of how to do it wrong: • One study compared a web path history visualization that had • thumbnails • fisheye properties • hierarchical layout • against the Netscape textual history list • Problem: • too many variables changed at once! • can’t tell which of the novel properties caused the effects Marti Hearst SIMS 247

  15. Important Factors • Novices vs. Experts • often no effect is found for experts, or experts are slowed down at the same time that novices are helped • experts might know the domain while novices do not • need to try to separate learning about the domain from learning about the visualization Marti Hearst SIMS 247

  16. Important Factors • Perceptual abilities • spatial abilities tests • colorblindness • handedness (lefthanded vs. righthanded) Marti Hearst SIMS 247

  17. The “Thinking Aloud” Method • This is for usability testing, not formal experiments • Need to know what users are thinking, not just what they are doing • Ask users to talk while performing tasks • tell us what they are thinking • tell us what they are trying to do • tell us questions that arise as they work • tell us things they read • Make a recording or take good notes • make sure you can tell what they were doing Marti Hearst SIMS 247

  18. Thinking Aloud (cont.) • Prompt the user to keep talking • “tell me what you are thinking” • Only help on things you have pre-decided • keep track of anything you do give help on • Recording • use a digital watch/clock • take notes • keep a computerized log of what actions taken • if possible • record audio and video Marti Hearst SIMS 247

  19. Pilot Study • Goal: • help fix problems with the study • make sure you are measuring what you mean to be • Procedure: • do twice, • first with colleagues • then with real users • usually end up making changes both times Marti Hearst SIMS 247

  20. Instructions to Participants (Gomoll 90) • Describe the purpose of the evaluation • “I’m testing the product; I’m not testing you” • Tell them they can quit at any time • Demonstrate the equipment • Explain how to think aloud • Explain that you will not provide help • Describe the task • give written instructions Marti Hearst SIMS 247

  21. Designing the Experiment • Reducing variability • recruit test users with similar background • brief users to bring them to common level • perform the test the same way every time • don’t help some more than others (plan in advance) • make instructions clear • control for outside factors • Evaluating an interface that uses web hyperlinks can cause problems • variability in network traffic can effect results • participants should be run under as similar conditions as possible • try to eliminate outside interruptions Marti Hearst SIMS 247

  22. B A Comparing Two Alternatives • Between groups experiment • two groups of test users • each group uses only 1 of the systems • Within groups experiment • one group of test users • each person uses both systems • can’t use the same tasks (learning effects) • See if differences are statistically significant • assumes normal distribution & same std. dev. Marti Hearst SIMS 247

  23. Experimental Details • Order of tasks • between groups • choose one simple order (simple -> complex) • within groups • must vary the ordering to make sure there are no effects based on the order in which the tasks occurred • Training • depends on how the real system will be used • What if someone doesn’t finish • assign very large time & large # of errors Marti Hearst SIMS 247

  24. Measurements • Attributes that are useful to measure • time requirements for task completion • successful task completion • compare two designs on speed or # of errors • application-specific measures • e.g., how many web pages visited • Time is easy to record • Error or successful completion is harder • define in advance what these mean Marti Hearst SIMS 247

  25. Measuring User Preference • How much users like or dislike the system • can ask them to rate on a scale of 1 to 10 • or have them choose among statements • “this visualization helped me with the problem…”, • hard to be sure what data will mean • novelty of UI, feelings, not realistic setting, etc. • If many give you low ratings, you are in trouble • Can get some useful data by asking • what they liked, disliked, where they had trouble, best part, worst part, etc. (redundant questions) Marti Hearst SIMS 247

  26. Debriefing Participants • Interview the participants at the end of the study • Ask structured questions • Ask general open-ended questions about the interface • Subjects often don’t remember details • video segments can help with this • Ask for comments on specific features • show them screen (online or on paper) and then ask questions Marti Hearst SIMS 247

  27. Analyzing the Numbers • Example: trying to get task time <=30 min. • test gives: 20, 15, 40, 90, 10, 5 • mean (average) = 30 • median (middle) = 17.5 • looks good! • wrong answer, not certain of anything • Factors contributing to our uncertainty • small number of test users (n = 6) • results are very variable (standard deviation = 32) • std. dev. measures dispersal from the mean Marti Hearst SIMS 247

  28. Analyzing the Numbers (cont.) • This is what statistics are for • Get statistics book • Landay recommends (for undergrads) • The Cartoon Guide to Statistics, Gonick and Smith • Crank through the procedures and you find • 95% certain that typical value is between 5 & 55 • Usability test data is quite variable • Need many subjects to get good estimates of typical values • 4 times as many tests will only narrow range by 2 times • breadth of range depends on sqrt of # of test users Marti Hearst SIMS 247

  29. Analyzing the Data • Summarize the data • make a list of all critical incidents (CI) • positive: something they liked or worked well • negative: difficulties with the UI • include references back to original data • try to judge why each difficulty occurred • What does data tell you? • Does the visualization work the way you thought it would? • Is something missing? Marti Hearst SIMS 247

  30. Using the Results • For usability testing: • Update task analysis and rethink design • rate severity & ease of fixing CIs • fix both severe problems & make the easy fixes • Will thinking aloud give the right answers? • not always • if you ask a question, people will always give an answer, even it is has nothing to do with the facts • try to avoid specific questions Marti Hearst SIMS 247

  31. Study Good Examples of Experiments! • Papers in reader • by Byrne on Icons • Studies done by Shneiderman’s students • www.otal.umd.edu/Olive/Class Marti Hearst SIMS 247

  32. Byrne Icon Study • Question: do icons facilitate searching for objects in a graphical UI? • Do they work better than lists of file names? • What characteristics of icons work best? Marti Hearst SIMS 247

  33. Byrne Icon Study • A Task Analysis (of how icons are used) identified three kinds of factors: • General factors • Visual search factors • Semantic search factors • Twelve factors, all told • Only a subset will be investigated Marti Hearst SIMS 247

  34. Byrne Icon Study • Theoretical Model: • Model of Mixed Search • Icon search involves two kinds of search that may or may not be related • the icon picture • the textual name associated with it • This leads to two kinds of search that get mixed together • visual search • semantic search • The visual characteristics of the icon will partly determine visual search time Marti Hearst SIMS 247

  35. Byrne Icon Study • Goals of experiment are two-fold • estimate effects of several of the factors • evaluate the Mixed Search model • Mixed search model depends on timing for visual search, so vary parameters relating to visual search: • vary complexity of visual form of icons • vary icon set size • color kept constant • Model also depends on association of meaning with icons, so vary type of knowledge needed: • file name knowledge • picture knowledge Marti Hearst SIMS 247

  36. Byrne Icon Study • Method • Participants • 45 undergrads getting extra credit in a course • note this will probably allow for statistical significance • Materials • instructions and stimuli presented in hypercard • Design of Experiment • Complex: • one factor evaluated between subjects • icon type, one of three levels • four factors evaluated within subjects • set size of icons • match level varied within sizes • amount of a priori picture knowledge • amount of a priori filename knowledge Marti Hearst SIMS 247

  37. Byrne Icon Study • Procedures and Stimuli • Participants got 3 practice runs and 72 experimental runs • Each run had three stages • encoding (see the target document) • decay (do something else to distract) • search (find the correct icon as quickly as possible) Marti Hearst SIMS 247

  38. Byrne Icon Study • Some Results • Mean across all conditions was 4.36 seconds with a standard deviation of 6.61s • Outliers discarded • No overall effect for icon type • But interactions occurred • The search time was effected by a combination of the picture knowledge and icon type • participants could only make use of the meaning of the picture if the meaning was displayed in a simple way Marti Hearst SIMS 247

  39. Byrne Icon Study • More Results • The search time was effected by a combination of filename knowledge and icon type • participants did best with blank icon if they knew the filename! • If the did not know the filename, participants did best with the simple icon • Simple icons were faster to search on overall • compared against a baseline of blank icons Marti Hearst SIMS 247

  40. Byrne Icon Study • Conclusions • For icons to be effective aids in visual search • they should be simple • they should be easily discriminable from one another • simple icons • more effective for large set sizes • allow user to use knowledge of picture meaning • are less effected by lack of filename knowledge • complex icons • are worse than blank! • Support found for the mixed search two-pass processing model Marti Hearst SIMS 247

  41. Summary • User evaluation is important, but takes time & effort • Early testing can be done on a mock-ups (low-fi) • Use real tasks & representative participants • Be ethical & treat your participants well • Goal: learn what people are doing & why • Doing scientific experiments requires more users to get statistically reliable results Marti Hearst SIMS 247

More Related