Computer-Aided Techniques for Extracting Usability Information from User Interface Events

Computer-Aided Techniques for Extracting Usability Information from User Interface Events David M. Hilbert Information and Computer Science University of California, Irvine Irvine, California 92697-3425 dhilbert@ics.uci.edu http://www.ics.uci.edu/~dhilbert/ 1

Overview • Introduction • Key Insights • Discussion • Conclusions 2

Introduction • Introduction • Survey Topic • Survey Context • Survey Goals & Research Method • Survey Results • Key Insights • Discussion • Conclusions 3

Survey Topic • UI events are a potentially fruitful source of usability data • natural products of most modern graphical user interface systems • easily and unobtrusively captured • indicate user actions with respect to user interface components • provide detailed information regarding timing and sequence of actions • However, extracting useful information from UI events is difficult • volume and detail can be overwhelming • need to map lower level events into higher level events of interest • need to map results of analysis back to features of the interface/application • Need computer-aided support 4

Reason for Evaluating Predictive Evaluation Observational Evaluation Participative Evaluation User behavior/performance x X Comparing design alternatives x X X Usability metrics x X X User needs, desires, thought processes, & experiences x X Certifying conformance w/ standards X Survey Context (Types of Evaluation) 5

Usability Indicators UI EventRecording Audio& VideoRecording Psycho-Physical Recording On-line behavior:task times, % tasks completed, error rates, on-line help use, functions used X x Non-verbal off-line behavior:eye movements, facial gestures, off-line document use & problem solving X Cognition/Understandingverbal protocols, comprehension-question answers, sorting task scores x X • Stress/anxiety:galvanic skin response (GSR), heart rate (ECG), event-related brain potentials (ERPs), EEG’s, etc. X Attitude/Opinion:post-hoc comments, questionnaire and interview results Survey Context (Data Collection) 6

Survey Goals & Research Method • Survey goals • framework to help HCI practitioners/researchers compare and evaluate • allow comparison of “classes” of systems, not just “instances” • Research method • search of academic and professional computing forums, including magazines, conference proceedings, journals • initial set of papers used to identify “key” characteristics • matrix constructed w/ techniques on one axis and attributes on the other • visually detectable “clusters” of features suggested an initial classification • attributes and classification iteratively refined based on further exploration 7

Survey Results • Synch and Search • Transformation • Filtering • Abstraction • Recoding • Analysis • Counts and Summary Statistics • Sequence Detection • Sequence Comparison • Sequence Characterization • Visualization • Integrated Support 8

Synch & Search Playback; Microsoft, SunSoft, & Apple Labs; DRUM; MacSHAPA; I-Observe Filtering Abstraction CHIME; Hawk; MacSHAPA; Remote CI’s; EDEM; Chen; CHIME; Hawk; Mac- SHAPA; Remote CI’s; EDEM; Recoding Counts & Summary Stats } 30% 15/hr. 25 sec. MIKE UIMS; KRI/AG; MacSHAPA; Long-Term Monitoring; AUS 9

Sequence Detection TOP/G; LSA; Fisher’s Cycles; MRP; Chunk Detection; Expectation Agents; EDEM; Sequence Comparison ADAM; UsAGE; MacSHAPA EMA; “Process Validation” Sequence Characterization Markov-based; Grammar-based; “Process Discovery” Visualization Integrated Support Counts & Stats Filtering Abstraction Recoding Detection Comparison Characterization Visualization Counts & Stats Filtering Abstraction Recoding Detection Comparison Characterization I-Observe; AUS; UsAGE; MacSHAPA DRUM; Hawk; MacSHAPA 10

Key Insights • Introduction • Key Insights • Transformation is Critical (But Hard) • The Grammatical Structure of Events • Events at Multiple Levels • Scattered and Missing Context • Discussion • Conclusions • Related Work 11

Transformation is Critical (But Hard) • Meaningful transformation is a prerequisite to meaningful analysis • allows events and patterns of interest to emerge from the “noise” • can significantly impact results of human and automated analyses • can be helpful in linking results back to interface/application features • However, meaningful transformation is hard to do • the grammatical structure of UI events • events occur at multiple levels of abstraction • problems of scattered and missing contextual information 12

The Grammatical Structure of Events • A grammar for activating a print job Print -> PrintToolbarButtonPressed OR (PrintDialogActivated THEN PrintDialogOkayed) PrintDialogActivated -> PrintMenuItemSelected OR PrintAcceleratorKeyEntered • Assume the lexical elements in this language are A: indicating "print toolbar button pressed”B: indicating "print menu item selected”C: indicating "print accelerator key entered”D: indicating "print dialog okayed" • What do these “sentences” indicate? AAAA CDAAA ABDBDA BDCDACD CDBDCDBD 13

The Grammatical Structure (cont’d) • These sequences are semantically equivalent AAAA CDAAA ABDBDA BDCDACD CDBDCDBD • all occurrences of ‘A’ indicate an immediate print job activation • all occurrences of ‘BD’ or ‘CD’ indicate a print job activated by using the print dialog and then okaying it • Notice, however • each sequence contains a different number of lexical elements • some have absolutely no lexical elements in common • the lexical elements occupying the first and last positions differ • Significance: • most techniques do not address grammatical issues, and thus, are overly sensitive to such lexical differences 14

Events at Multiple Levels Goal/Problem-Related(e.g. placing an order) Domain/Task-Related(e.g. providing address information) Abstract Interaction Level(e.g. providing input values) Window System Events(e.g. key events, mouse events, input focus events) Input Device Events(e.g. hardware-generated key or mouse interrupts) Physical Events(e.g. fingers presses, hand movements) 15

Events at Multiple Levels (cont’d) • Higher level events do not (in themselves) reveal their composition from lower level events • recording only higher level events results in information loss • Lower level events do not (in themselves) reveal how they combine to form higher level events • either higher level events must be recorded in conjunction with lower level events, or • there must be some model (e.g. a grammar) to describe how lower level events combine to form higher level events • Significance: • most techniques do not address issues of events at multiple levels, and thus, mapping from lower level events to higher level events can be difficult 16

Scattered and Missing Context • Given a transcript of a conversation between A & B at a car show, identify A’s favorite cars based on utterances in the transcript: • Example 1: "The Lamborghini is one of my favorite cars". • everything you need to know is contained in a single utterance • Example 2: "The Lamborghini". • need access to prior context — probably a response to a prior utterance, e.g., ”Which car here is your least favorite?" • Example 3: "That is one of my favorite cars". • need ability to de-reference an indexical — information carried by the indexical “that” may not be available in any of the utterances in the transcript, but was available at the time of the utterance • Example 4: "That is another one". • need access to prior context and ability to de-reference an indexical 17

Scattered and Missing Context (cont’d) • Analogous UI event examples: • Example 1’: PrintToolbarButtonPressed • this event carries enough information to indicate the action taken • Example 2’: CancelButtonPressed • what was canceled? indicates a response to a prior event, e.g., a prior PrintMenuItemSelected event. • Example 3’: ErrorDialogActivated • what’s the error? query the dialog for its error message; this is similar to de-referencing an indexical if you think of the error dialog as “pointing at” an error message that doesn’t appear in the event stream • Example 4’: ErrorDialogActivated • assuming the error message is “Invalid Command”, then the information needed to interpret the significance of this event is not only found by de-referencing the indexical, but must be supplemented by information available in prior events 18

Scattered and Missing Context (cont’d) • Significance: • context is critical in interpreting the significance of UI events • sometimes context is spread across multiple events • sometimes context is missing altogether from the event stream • however, context is usually available “for the asking” from the user interface system at the time of the events, but not afterwards • most techniques do not perform transformation and analysis in-context, and thus, cannot access contextual information critical in interpretation 19

Discussion • Introduction • Insights • Discussion • Problems • Non-Answers • Requirements for a Successful Solution • Conclusions • Related Work 20

Problems • Meaningful transformation appears to be a prerequisite to meaningful analysis • However, of the 25+ techniques surveyed, only a handful provide any assistance in automating transformation • e.g., CHIME, Hawk, & EDEM • Unfortunately, all suffer from some combination of the following problems • failure to perform transformation in-context • failure to allow analysis at multiple levels of abstraction • lack of assistance in mapping results of analysis back to features of the interface/application being evaluated • lack of support for end-user modifiability and reuse of transformation and analysis mechanisms 21

Non-Answers • Why not use a UIMS? (e.g. MIKE, KRI/AG, UsAGE) • UIMS’s model relationships between UI actions and application features • however, more general techniques are required • Why not rely on hand-instrumentation? (e.g. LTM, EMA) • hand-instrumentation allows higher level and application-specific events to be reported, and is more general than a UIMS-based approach • however, too much burden placed on application developers, and • much useful usability-related information is not available to applications • e.g., shifts in input focus, mouse movements, specific UI actions used to invoke application features, etc. • Why not just rely on users? (e.g. User-Identified CI’s) • users often fail to recognize when their behavior violates expectations about proper usage (Smilowitz et al., 1994) 22

Requirements for a Successful Solution • Need an approach that: • does not require developers to adopt a particular UIMS • does not require developers to call an API to report every potentially interesting event • does not rely (solely) on users to perform event filtering and abstraction • The approach must also address: • transformation in-context • analysis at multiple levels of abstraction • mapping of results back to features of the interface/application • end-user modifiability and reuse of transformation and analysis mechanisms 23

Conclusions • Introduction • Insights • Discussion • Conclusions • More Research • A Hypothesis • Open Issues 24

More Research • If meaningful usability information is to be extracted automatically from UI events, then more research is needed to address: • how to allow transformation and analysis to be performed in-context without significantly impacting application performance, evolution, and deployment. • how to establish a reasonable mapping between lower level events and higher level events of interest that is maintainable as the application interface and functionality evolves. • how to establish a reasonable mapping between events and application features that is also maintainable as the application interface and application functionality evolves. • how to make transformation and analysis mechanisms easily adaptable and reusable by investigators wishing to employ them. 25

A Hypothesis • Performing transformation and analysis automatically and in-context has the potential of greatly reducing data collection • As a result, such an approach might scale up to large-scale and ongoing use over the Internet • lift current restrictions on evaluation size, location, duration • help developers and managers answer such empirical questions as: • how thoroughly has beta testing exercised relevant features? • how are application features actually being used? • which features warrant more/less development and testing effort? • which features if modified/added/deleted are most likely to impact application utility, usability, and productivity? • does actual usage match expectations? • how can the design be improved to better match actual usage? 26

Open Issues • Analysis might address single or multiple users at one time • The techniques surveyed here primarily address single users • If events reflect users associated with them, these techniques might be extended to address multiple users • Analysis might address synchronic and diachronic aspects of processes • Characterizing a user’s behavior in a single session in terms of a “process” addresses synchronic aspects of processes. • Characterizing changes in a user’s process as it evolves over time addresses diachronic aspects of processes. • The techniques surveyed here primarily address synchronic aspects of processes. 27

Related Work 28

Related • Model-based debugging/testing (e.g. EBBA & TSL) • Beta test data collection (e.g. Aqueduct Profiler) • API usage monitoring (e.g. HP/Tivoli ARM API) • Event histories and undo (e.g. Kosbie & Myers, 1994) • Layered interaction models (e.g. Nielsen, 1986, Taylor, 1988) • Command & task-action grammars (e.g. Moran, 1981, Payne & Green, 1986) • UI testing/debugging (e.g. Mercury Interactive X/WinRunner, SunTest JavaStar) • Process validation techniques (Cook & Wolf, 1997) • Process discovery techniques (Cook & Wolf, 1996) 29

Somewhat Related • Collaborative remote usability (e.g. “Remote Evaluation” web site) • Simple event capture (e.g. Windows SPY) • Macro recording/playback (e.g. Windows Macro Recorder) • Programming by demonstration (e.g. Cypher et al., 1994) • Task recognition and assisted completion (e.g. Cypher, 1991) • Enterprise management (e.g. TIBCO Hawk) • Product condition monitoring • Firewall monitoring, intrusion detection • Network monitoring/diagnostics • Trade monitoring 30

Supporting Technologies • Transport mechanisms • Hypertext transfer protocol (HTTP) • Mobile agent infrastructures (e.g. ObjectSpace Voyager) • Event notification • Simple mail transfer protocol (SMTP) • More advanced event notification systems (e.g. TIB/Rendezvous) 31

Background & Definitions 32

“Usability” • “Usability” has multiple components and is traditionally associated these five usability attributes (Nielsen, 1993): • Learnability: can users rapidly figure out how to accomplish tasks? • Efficiency: once learned, can tasks be accomplished quickly? • Memorability: can the casual user return after some period of time without having to relearn everything? • Errors: are error rates reasonably low? can users easily recover from errors? are catastrophic errors prevented? • Satisfaction: are users subjectively satisfied when using the system? Do they like it? 33

Usability Evaluation & Usability Data • Usability evaluation: • the act of measuring — or detecting potential “issues” affecting — usability attributes of a system or device with respect to particular users, performing particular tasks, in particular contexts • values of usability attributes will typically vary depending on the background knowledge and experience of users, tasks for which the system is used, and context in which it is used • Usability data: • any information useful in measuring — or detecting potential “issues” affecting — usability attributes of a system being evaluated 34

Usefulness • Usefulness = Utility + Usability (Grudin, 1992) • “utility” is (roughly) the question of whether system functionality can in principle support the needs/desires of users • “usability” is (roughly) the question of how satisfactorily users can make use of system functionality • Usability evaluations often uncover issues relating to both utility and usability, and thus more properly address “usefulness” 35

Evaluation Approaches • Predictive evaluation • predict values of usability attributes based on psychologically-inspired models (e.g. GOMS or the Cognitive Walkthrough), or based on formal reviews by experts • key strength is the ability to produce results with non-functioning design artifacts and without actual users (not so great for evaluating “utility”) • Observational evaluation • measure usability attributes by observing users interacting with prototypes or fully functioning systems • key strength is the ability to uncover aspects of user behavior and performance difficult to capture using other techniques • Participative evaluation • collect information regarding usability attributes directly from users based on their subjective reports • key benefit is the ability to capture aspects of users’ needs, desires, thought processes, and experiences difficult to obtain otherwise 36

A Spectrum of HCI Events 37

Comparison of Techniques 38

Synch and Search • Strengths • combines advantages of detailed on-line behavior data with more semantically rich data • allows searches in one source to locate supplementary data in others • allows context missing from events to be used in interpretation • facilitates communication of results back to developers • Limitations • requires video recording equipment, operator, observer • equipment can be expensive to obtain/operate • may make subjects self-conscious and affect behavior • often produces massive amounts of data that is time-consuming to analyze • places limitations on evaluation context, size, duration, etc. 39

Transformation • Strengths • allows events and patterns of interest to emerge from the “noise” • can significantly impact results of human and automatic analysis • if performed “in-context”, could support large-scale remote collection • Limitations • if performed “in-context”, important information may be lost • approaches relying on users are even more likely to drop important data • most techniques don’t support transformation “in-context” • meaningful interpretation, particularly without the help of users, is challenging 40

Counts and Summary Statistics • Strengths • with the large number of potential counts and summary statistics, built-in support makes some sense • Limitations • most approaches do not allow users to modify or define their own counts, summary statistics, and reports • the MIKE UIMS approach relies on information about the mapping between application commands and UI actions not typically available • the MacSHAPA approach does not operate in-context and thus misses potentially important contextual information • the AUS approach does not make use of application-specific information and thus is limited in what it can say with regard to application features 41

Sequence Detection • Strengths • addresses sequential information in addition to isolated events • ESDA techniques may be helpful in “discovering” unanticipated patterns • Limitations • ESDA techniques produce large amounts of output that is difficult and time-consuming to interpret • the efficacy of ESDA techniques at uncovering usability issues has been called into doubt (Coumo, 1994) • non-ESDA techniques require more effort to specify sequences of interest 42

Sequence Comparison • Strengths • allows actual usage to be compared against some model or trace of “ideal” or expected usage to identify potential usability issues • particularly attractive when “ideal” usage can be specified “by demonstration” • Limitations • some approaches assume short, pre-segmented sequences (e.g. ADAM) • some approaches assume a reasonable correspondence will actually exist between separate traces (e.g. UsAGE). • some approaches assume a “complete” model of all possible/expected sequences (e.g. EMA) • all approaches assume reasonably abstract events and thus presuppose transformation or careful hand instrumentation. • all approaches require expert user interpretation of output to determine whether sequences are interestingly similar or different (especially ADAM) 43

Sequence Characterization • Strengths • provides insight into the sequential structure of UI event sequences • Markov-based techniques provide insight into transition probabilities between particular features or sets of features • grammar-based techniques are powerful for performing event abstraction • Limitations • typically requires extensive human involvement, particularly grammar-based techniques • automated approaches (e.g. “process discovery”, Cook and Wolf, 1996) appear to be less able to produce models meaningful to investigators (Olson et al., 1994), at least without careful event stream transformation ahead of time • investigators more likely to want to see actual instances of subsequences in order to count and analyze them rather than to see a single abstract model (such as a grammar of FSM) to summarizes them all 44

Visualization • Strengths • allows humans to use innate visual analysis capabilities • particularly useful in linking results of analysis back to features of the interface • Limitations • most non-standard graph visualizations must be created by hand 45

Integrated Support • Strengths • reduces the burden of managing multiple artifacts and media types and the creation and composition of various transformation and analysis techniques • Limitations • existing environments do not address event capture, so contextual information critical in proper transformation and analysis is missing • most environments don’t allow for flexible definition and reuse of automated transformations and analyses (except for Hawk) 46

Examples of Techniques 47

Synch and Search • Playback (Neal & Simons, 1983) • events, coded observations, play back • DRUM (Macleod et al., 1993) • events, coded observations, video • Usability labs at Microsoft, Apple, SunSoft (Weiler et al., 1993) • events, coded observations, video, “highlight videos” • I-Observe (Badre et al., 1995) • events, video, regular expression search, visualizations 48

Transformation • “Intrinsic” monitoring (Chen, 1990) • automatic & investigator-specifiable filtering, limited widget context • User-identified “Critical Incidents” (Hartson et al., 1996) • only events surrounding user-identified CI’s reported, users provide context • CHIME (Badre & Santos, 1991) • investigator-specifiable filtering, limited abstraction via notion of IU’s • Hawk (Guzdial, 1993) • AWK used to filter and abstract log files, no context • MacSHAPA (Sanderson & Fisher, 1994) • database query and manipulation language, manual techniques, no context • EDEM (Hilbert & Redmiles, 1998) • investigator-specifiable filtering & abstraction, CI filtering, grammar-based abstraction, access to context via standard software component model 49

Counts and Summary Statistics • MIKE UIMS (Olsen & Halverson, 1988) • computation and reports of task performance time, mouse travel, command frequency, command-pair frequency, cancel and undo, physical device swapping, and more • MacSHAPA (Sanderson & Fisher, 1994) • support for user-specifiable counts, summary statistics, and reports • Automatic Usability Software (AUS) (Chang & Dillon, 1997) • use of help, cancel and undo, mouse travel, mouse click density per window, and more 50

Computer-Aided Techniques for Extracting Usability Information from User Interface Events