520 likes | 637 Views
Empirical Evaluation. Susanne Eklund IS247 Presentation 22 March 2002. What is Empirical Evaluation?. Why do it? Be sure we’re improving on current methods Be sure a new vis actually helps people complete tasks and doesn’t just look pretty. Empirical:
E N D
Empirical Evaluation Susanne Eklund IS247 Presentation 22 March 2002
What is Empirical Evaluation? Why do it? • Be sure we’re improving on current methods • Be sure a new vis actually helps people complete tasks and doesn’t just look pretty Empirical: based on observation or experience (M-W.com)
Why Do It, Continued • Learn what works about particular systems • Pull together best parts of different systems • Or, figure out when to use one system over another • A step back from building entirely new systems • (CS system-syndrome) • Is it the same as “usability evaluation”? • Partly…need both usability and good Vis packaged together • Hidden problems: usability does not always equal accuracy/effectiveness, and vice versa
How do we judge value of a Vis?* • Different measures • Impact on community as a whole, influential ideas • Assistance to people in the tasks they care about • Strong View: • Unless a new technique or tool helps people in some kind of problem or task, it doesn’t have any value • Broaden Thinking: • Sometimes the chain of influence can be long and drawn out • System X influences System Y influences System Z which is incorporated into a practical tool that is of true value to people • This is what research is all about (typically) *From slides by John Stasko, Prof at GeorgiaTech
Evaluation of InfoVis v. GUIs • Techniques seem similar • Measure usability of UI • Harder to measure success of a vis without actual real use? Often, knowledge of domain is needed. • InfoVis can be “usable” but not “successful” – example in Bullseye search study • Graham et al Methodology seems sound • Articulating methodology assures all steps are followed
Evaluation Techniques Formal Lab Experiment - XML3D Field Experiment - Taxonomy? Control Lab Observation - Metadata? Field Observation - Hypertext Authenticity
Spectrum of Measures Quantitative Qualitative • Lots of ways to measure effectiveness of system. • Does a variety of measures = better test? Measures of Task Success ClickStream Analysis Observation SatisfactionSurveys TaskTiming ThinkAloud System Adoption Rate
Which technique is best? Arguments for quantitative: • Observers aren’t biased • Results based, easier to compare • You can gather a lot of very rich data Arguments for qualitative: • Gauge thought processes • Understand why users do what they do • Avoid small sample problem IMHO, the best studies use a combination of both. Qualitative to understand “why”, and quantitative to confirm observations.
The Studies Papers Required for This Week • Ease of use for 2D and 3D information visualizations of web content - Risden et al • Examining the usability of web site search - SIMS • Toward a methodology for developing visualizations - Graham et al Additional Studies in Reader • Evaluating the effectiveness of visual user interfaces for information retrieval - Sutcliffe et al • Hypertext authoring and visualization -Pohl and Purgathofer
Risden et al – XML3D • An initial examination of ease of use for 2D and 3D information visualizations of web content. • Risden, Czerwinski, Munzner, and Cook. International Journal of Human Computer Studies, Special Issue on Empirical evaluation of information visualizations, Vol. 53, No. 5, November 1, 2000.
Study Design Target Users • Webmasters and web content producers • Males (according to their participant demographics…) Task Domain • Adding content to a directory scheme • Searching for appropriate existing categories • Browsing for places to put new categories • Some categories have multiple parents
The Interfaces XML3D • Hyperbolic space Focus+Context • Handles multiple-inheritance hierarchies • Selected node moves to focus point at center • Accompanied by 2D lists of parents, children, sibling nodes • High visibility of location in hierarchy
The Interfaces Snap.com • Category directory similar to Yahoo! • Limited visibility of hierarchy • Multiple parents indicated but not explained Collapsible Tree Browser • Similar to Windows Explorer • Can only show one parent at a time
Procedure • Controlled for learning effects • Participants received “a small amount of training” for each interface • Had participants complete a set of four types of tasks • Didn’t appear to use observation data. • Did they even have people present? • Relied on system log data to answer questions about how people used system.
Variables Independent Variables: • Interface used • Task type Dependent Variables: • Time to complete task • Consistency (correctness?) of answers • Frequency of use of XML3D elements • “Satisfaction survey”
Results Speed Analysis • Snap and Tree lumped into “2D” and compared to XML3D (was this a good decision?) • XML3D faster than 2D overall • Existing category faster than new one overall • XML3D only significantly faster on existing category tasks • No speed/”accuracy” tradeoff Can we think of a better metric than speed?
Results, con’t Use of 2D list on 3DXML System • 2D list was used frequently for new category tasks (and these weren’t significantly faster with XML3D) • Existing category tasks used either 3D or 2D list • Because they didn’t use “think aloud” protocol, they don’t know why participants used one or the other.
What they mean When we’re looking for something specific in a sea of related items… • This 3D vis seemed to be effective. When we’re looking for a place to put the new item… • 2D lists may work better….OR people use lists for harder task b/c they’re familiar • The best solution may be to have both methods available.
+/- of Study + • Focused on a specific domain and user group • Used skilled users to minimize individual skill differences - • Did not isolate effect of 3D visualization • Felt like it was comparing apples and oranges • Relied on time, “accuracy”, and behavior measurements only. Did not effectively answer “why”. Ineffective post-test survey.
The SIMS Search Study English, Hearst, Sinha, Swearingen, and Yee • Examining the Usability of Web Site Search, submitted for publication, 2002. Goals of Study: • Find out how people use different search interfaces for different tasks • See how people use metadata • Use this information to improve website navigation and search
Improving the Middlegame • Good “scent” • Help user explore • Get sense of collection • Narrow or broaden results • Revise query as needed Opening: Enter query MidGame: Revise, filter EndGame: Review results
Study Design – 3 Searches Basic Search • Keyword-based • Results in laundry list • No way to refine Try it out
Enhanced Search User selects facet values • High degree of control • Easy to get 0 results • Results appear in laundry list Try it
Browse • Yahoo-like category browsing • Preview of number of recipes in each child category • Can refine by different facets – causes query preview to update • Breadcrumb allows easy backtrack
The Study • 9 participants • Controlled for: • Interest (all like to cook, personal goals) • Motivation (recipe booklet) • Stress (site preview) • Learning effect (random order)
Tasks • Find a dish 3x, once with each method • Using personal scenarios • Structured search tasks • Find specific items using specific interfaces • Hypothetical tasks • To see which interface they would pick
Results • Perception of keywords v. metadata is off • Enhanced search requires more constraints and often produced 0 results (27%) • Satisfaction was high for all methods, but especially so for Browse and Enhanced • Users prefer Enhanced for high-constraint tasks, Browse for low-constraint tasks • Basic search is good entry point, but doesn’t offer mid-game • Enhanced would benefit from dynamic results count…as well as ability to refine • Browse has strong middle-game
Contributions • Users recognize that different search interfaces are better for certain tasks • Metadata search can be a valuable way to improve searching and results management
+/- of Study + • Novel procedure controls for many factors (learning, motivation, training) • “Think-aloud” and probing questions get at WHY people do things - • No cons (it’s a SIMS paper!) • BUT it might benefit from repetition of study with different subject matter and/or hierarchical facets to see if conclusions hold • AND would be interesting to measure recall/precision with a dataset where there is more of a “right answer” concept?
Taxonomy System Graham, Kennedy, and Benyon. Towards a methodology for developing visualizations. • International Journal of Human Computer Studies, Special Issue on Empirical evaluation of information visualizations, Vol. 53, No. 5, November 1, 2000.
Problem • No single methodology for developing a good visualization. We have HCI methods for interfaces but no set method for InfoVis systems. • We design for the way people work, yet tech usually changes work habits. • Therefore we must have *more than 1* round of testing and redesign Requirements Work Artifacts Possibilities
Development Methodology • Get requirements from users. Make task storyboard. • Show storyboard to users and confirm correct interpretation. • Test visualization to be sure it gives users what they need. Identify needed functionality. • Test extra functionality and general interface usability. • Test overall interface usability. • Test whole product in a statistically rigorous manner to obtain satisfaction ratings, error rates, etc. This is not so different from regular UI design practice. *But it clearly separates Vis testing from UI testing.
Domain: Taxonomy Landscape • What is a taxonomy? • All published taxonomies are “right” • Work requires looking at multiple taxonomies and comparing categorizations
System Goals • Manage accumulation of old taxonomies • Identify relationships between different taxonomies • Perform new tasks that weren’t possible with paper system
About the tests • Highly qualitative • Lots of interaction with actual end users • Informal interviews for requirements • “Budget” • Iterative • Accuracy and discovery rather than speed
Procedures • Step 1: Requirements • Informal interview with taxonomy experts. • Step 2: Storyboarding • Confirm that concepts behind vis are right • Step 3: Test of vis • Exploration of two different visualizations with a given set of tasks
Procedures, con’t • Result of 1st test: • Users prefer mental model over the data model (!!) • Step 4: Usability Test • Test new functionality and UI usability • Scenarios instead of tasks • Accuracy of vis and whether UI got in way • Bigger scale • Step 5/6: Further refined usability tests
Results / Conclusions • Fixed several usability bugs • Found model that best suits users • Functionality requirements change as users see possibilities • (Don’t use low-fi videocameras in test) Would be nice to see how product worked in real world.
Pros/Cons of Paper + • I had a good feel about methodology • Involvement of users from start to end • Product evolved - • ? Suggestions for any cons?
Optional Reading: Bullseye • Sutcliffe, Ennis, and Hu: Evaluating the effectiveness of visual user interfaces for information retrieval. • Evaluation of “Integrated Thesaurus-Results Browsing System” with Bullseye visualization of clusters • Questions: • How effective is this system for retrieval? • How effectively do visual metaphors represent system model or search functionality to user?
Everything on ONE SCREEN QUERY ENTRY FORM SEMI-CONFUSING BULLSEYE OPTION SETTINGS CONFUSING THESAURUS TREE BULLSEYE DISPLAY WITH AUTO-CLUSTERING ARTICLE ABSTRACT In very tiny print MORE SETTINGS
Primary Findings • Overall performance was poor • Low recall (valid?) and precision • Subject matter problem • Participants mistaken about how system actually works, even though trained • But people liked using system • Usability is high by many measures • Errors, questionnaire, observed problems all low • Good thing they had multiple measures!
Conclusions • Non-expert users may prefer simpler search interfaces (Google) • More complicated methods may require further help (wizards, training) • Product was built for task-based efficiency, but all-in-one-place may not be what is needed in this domain • Vis tools aren’t substitute for analysis; may encourage “sub-optimal and cognitive lazy practice”
Things to learn from this paper • People don’t always listen to or read directions. • Search tech is *complicated* and not always walk-up-and-use. Good system will not require people to understand the black box. (Epicurious) • Human processing is necessary part of every search, and even excellent interfaces can’t bypass it. • “Good users” can have poor results and vice versa • Be sure system is successful as well as usable
Optional Reading: Hypertext • Hypertext = HyperCard-based system • Does the writing process change with use of hypertext tools? • Does vis of info structures play a role in authoring? • Field Study– gathered data from students who used system to write papers Node Editor View Overview Map Text here, much like a regular text editor, except you can add links to other nodes.
Findings • “Windowing” technique shows major blocks of activity • Nice technique? Individual variation of activity distribution is high (edit, make node, move, delete, other) • No single pattern • Resulting overview maps—and documents—vary greatly in structure and organization • Overall writers prefer hierarchy
Their Conclusions • Conclusions are weak, partly b/c study had no comparative elements • Also because analysis of resulting documents sounded subjective and vague • “Students used this feature a lot, therefore it is important” • “Results indicate that visualizing information structure is one of the most important new features of hyptertext systems” • Study would benefit from: • Analysis of hypertext authoring *without* map • Structured comparison of docs written with and without maps