740 likes | 883 Views
User-centered System Evaluation. Reference. Diane Kelly (2009). Methods for Evaluating Interactive Information Retrieval Systems with Users. Foundations and Trends in Information Retrieval, 3(1-2), 1-224. DOI: 10.1561/1500000012. introduction. Interactive Information Retrieval (IIR).
E N D
Reference • Diane Kelly (2009). Methods for Evaluating Interactive Information Retrieval Systems with Users. Foundations and Trends in Information Retrieval, 3(1-2), 1-224. DOI: 10.1561/1500000012
Interactive Information Retrieval (IIR) • Traditional IR evaluations abstract users out of the evaluation process • IIR focuses on user’s behaviors and experiences, • physical, cognitive and affective • Interactions between users and systems • Interactions between users and information
Different evaluation questions • Classic IR evaluation (non-user centric): does this system retrieve relevant document? • IIR evaluation (user-centric): can people use this system to retrieve relevant documents. • Therefore: IIR is viewed as a sub-area of HCI
Relevance Feedback • Same information needs different queries different search results different relevance feedback. • Dealing with users is difficult as causes and consequences of interactions cannot be observed easily (it is in user’s head) • The available observation: query, save a document, provide relevance feedback. • Based on these observation, we must infer
Difficulties • Each individual user has a different cognitive composition and behavioral disposition • Some interactions are not easily observable nor measurable • Motivation, • How much to know the topic • expectations
IIR • Using users to evaluate IR • Different approaches • Using users to evaluate the research results of a system (users are treated as black boxes) • Search log analysis (queries, search results and click-through behavior) • TREC Interactive Track evaluation model (evaluating a system or interface) • General information search behavior in electronic environments (observing and documenting user’s natural search behaviors and interactions)
Research goals • Setting up a clear research goal: • Exploration: when the subject is less known, focusing on learning the subject, rather than make prediction, open-end research questions or hypotheses are uncommon. • Description: documenting and describing a subject (query log or query behavior analysis), to provide benchmark description and classification, results can be used to inform other studies • Explanation: examine the relationship between two or more variables with the goal of prediction and explanation, establish causality,
Approaches • Evaluations vs. Experiments • Evaluation: to assess the goodness of a system, interface or interaction technique. • Experiments: to understand behavior, (similar as experiments in psychology or education), compare at least two things. • Lab and naturalistic studies • Lab (more controls) vs. naturalistic (less controls) • Longitudinal studies • Taking place over an extended period of time and measurements are taken at fixed intervals.
Approaches • Case studies • The intensive study of a small number of cases • A case maybe a user, a system or an organization. • It usually takes place in naturalistic settings and involve some longitudinal elements. • Not for generalizing rather than gaining an in-depth view of a particular case. • Wizard of Oz studies and simulations • Testing “non-real” or simulated system • Used for proof-of-concept • Provide an indication of what might happen in ideal circumstances • Wizard of Oz studies are simulations • Simulated users can represent different actions or steps a real user might take while interacting with an IR system
Problems and Questions • Identify and describe problems • Provide roadmap for research • Example of research questions • Exploratory: • How do people re-find information on the Web? • Descriptive: • What Web browser functionalities are currently being used during web-based information-seeking tasks • Explanatory: • What are the differences between written and spoken queries in terms of their retrieval characteristics and performance outcomes? • What is the relationship between query box size and query length? What is the relationship between query length and performance?
Theory • A theory is a system of logical principles that attempts to explain relations among natural, observable phenomena. • Theory is abstract, general, can generate more specific hypotheses
Hypotheses • Hypotheses state expected relationships between two variables • Alternative hypotheses vs. null hypotheses • Specific relationship vs. no relationship • Hypotheses can be directional or non-directional
Variables and measurement • Variables represent concept • To analyze concepts • Conceptualization • To define concepts: provide temporary definition, divide into dimensions • Operationalization • How to measure the concept: • Direct and indirect observables • Directly observed: • # of queries entered, the amount time spent searching • Indirectly observed: • User satisfaction
Variables • Independent: the causes • examining differences in how males and females use an experimental and baseline IIR system • Sex is independent variables • Dependent: the effects • E.g., Satisfaction or performance of the systems. • Confounding variables • Affect the independent or dependent variable, but have not been controlled by the researcher. • E.g., maybe males are more familiar with these systems than females.
Measurement • Range of variation • Preciseness of the measure • E.g., category of usage frequency of a system • Exhaustiveness • Complete list of choices • Exclusiveness • How to differentiate partially relevant vs. somewhat relevant (in your relevance rubric) • Equivalence • Find items that are of the same type and at the same level of specificity • Different scales: I know details=very familiar, I know nothing=very unfamiliar • Appropriateness • How likely are you to recommend this system to others? Scale: a five-point scale with strongly agree and strongly disagree – which does not match the question
Level of Measurement • Two basic levels of measurement: discrete vs. continuous • Discrete measures: categorical responses • Nominal: no order • E.g., interface type, sex, task-type • Ordinal: ordered • Rank-order (from most relevant to least relevant) or Likert-type order (five-point scale with 1=not relevant, 5=relevant) • Relative measure • one subject’s 2 may not represent the same thing internally as another subject’s 2. • we could not say that a document rated 4 was twice as relevant as a document rated 2 since the scale contains no true zero
Level of Measurement • Two basic levels of measurement: discrete vs. continuous • Continuous measure: interval vs. ratio • Different between consecutive points are equal, but there is no true zero for interval scales • Fahrenheit temperature scale, IQ test scores • Zero does not mean no heat or no intelligence • The differences between 50 vs. 80 and 90 vs. 120 are same • Ratio: the highest level of measurement: the number of occurrences. • There is a true zero • E.g. time, number of pages viewed (zero is meaningful)
The basic experimental design in IIR evaluation examines the relationship between two or more systems or interfaces (independent variable) on some set of outcome measures (dependent variables)
IIR design • General goal of IIR is to determine if a particular system helps subjects find relevant documents • Developing a valid baseline in IIR evaluation involves identifying and blending the status quo and the experimental system. • Random assignment can be used to increase the characteristics being evenly distributed across groups
Factorial Designs • Good for studying the impact of more than one stimulus or variable
Rotation and counterbalancing • The primary purpose of rotation and counterbalancing is to control for order effects and to increase the change that results can be attributed to the experimental treatments and conditions. • Rotating variables: • Latin square design • Graeco-Latin square design
Rotation and counterbalancing A basic design with no rotation. Numbers in cells represent different topics Cons: Order effects Some topics are easier than others, some systems may do better with some topics than others. Fatigue can impact the results
Latin Square rotation Basic Latin Square rotation of topics • Problems: • Interaction among topics • the order effects of interfaces still exist Basic Latin Square rotation of topics and randomization of columns
Graeco-Latin Square Design • To solve the problem of orders of interfaces existing above. • Graeco-Latin Square is a combination of two or more Latin squares.
Study mode • Batch-mode • Multiple subjects complete the study at the same location and time • Single-mode • Subjects complete the study alone, with only the researcher present. • The choice of mode is determined by the purpose of the study. • Single-mode: if each subject has to be interviewed, or some interactive communication needed between subject and researcher • Batch-mode: self-contained, efficient (but subject can influence each other)
Protocols • A protocol is a step by step account of what will happen in a study. • Protocol helps maintain the integrity of the study and ensure that subjects experience the study in similar ways.
Tutorials • Provide some instruction on how to use a new IIR system • Printed materials • Verbal instructions • Video tutorial • Try to avoid bias in the tutorial • Such as specially focusing on one special feature.
Pilot testing • To estimate time • To identify problems with instruments, instructions, and protocols • To get detailed feedback from test subjects
Sampling • It is not possible to include all elements from a population in a study • The population in IIR evaluation is assumed to be all people who engage in online information search. • The size of sample: the more the better • Two approaches to sampling: probability sampling and non-probability sampling
Probability Sampling • Selecting a sample from a population that maintains the same variation and diversity that exists within the population. • Representative sample: • In a population: 60% are males and 40% are females, then your representative sample would also contain roughly the same ratio of males and females. • Increase the generalizability of the results • Assumes that all elements in the population have an equal chance of being selected.
Probability sampling • Simple random sampling • Randomly pick up an element • Systematic sampling • Pick up every kth element, where k=population size/sample size • Stratified sampling • Subdivide the population into more refined groups according to specific strata • Select a sample that is proportionate to the population in each strata.
Non-probability sampling • Used when all of the elements in a population is unknown, or not available. • It limits its ability to generalize • Researchers should be cautious when generalizing their data and be aware of the sampling limitations in their research.
Non-probability sampling • Three major types of non-probability sampling: • Convenience: relying on available elements the researcher can access: undergraduate students, people is located closer to the researcher. • Purposive or judgmental sampling: a researcher selects subjects or other elements that have particular characteristics, expertise or perspectives • Quota sampling: similar as stratified sampling, but the subjects for the strata are based on a first-come-first-served policy.
Subject Recruitment • Many ways to recruit subjects • Send solicitations to mailinglists • Inviting • Using referral services • Crowdsourcing • Mechanical Turk • Web advertising • Mass mailings • Virtual posting in online locations • Pros and Cons: using lab mates, or own research group members as study subjects
Collections for testing • Identification of a set of documents for subjects to search, a set of tasks or topics which directs this searching, and the ground truth about the relevance of the searched objects to the topics - • A test collection: corpus, topics, and relevance judgments
TREC collections • TREC Interactive and HARD tracks • Newswire, blog, legal • Artificial topics • Relevance assessment generalization problem
Web corpora • The major drawback is that it is impossible to replicate the study since the Web is constantly changing. • The same queries issued at different time can get completely different results
Natural corpora • Corpora assembled over time by study participants • Pros: meaningful to subjects, controllable • Cons: lack of replicability and equivalence across participants,
Tasks and topics • Most information needs can be characterized in terms of tasks and topics • Information need = task = topic • Information needs • People do not know their information needs • People have difficulties to articulate their information needs • Or using a vocabulary proper for a system
Generating information needs • It is not clear at what level of specificity a task or topic should be defined • Task can be broken down into a series of sub-tasks, such as writing a research proposal • Working on the query logs to develop information needs
Data collection techniques • Corpora, tasks, topics, and relevance assessments are major instruments to evaluate IIR systems • Other instruments: questionnaires, screen capture software allow researchers to collect data.