Evaluating Mixed-initiative Systems: Goals and Metrics

How to Evaluate a Mixed-initiative System? Mike Pazzani’s caution: • Don’t lose sight of the goal. • The metrics are just approximations of the goal. • Optimizing the metric may not optimize the goal.

Question: What is the goal to be optimized? Possible goals of mixed-initiative systems: General goal Mixed-initiative systems integrate human and automated reasoning to take advantage of their complementary reasoning styles and computational strengths. More specific goal Mixed-initiative systems combine the human’s experience, flexibility, creativity, … with the agent’s speed, memory, tirelessness … to take advantage of these complementary strengths. Even more specific goal Mixed-initiative systems increase human’s speed, memory, accuracy, competence, creativity … Other goals: … The more precise the goal the easier to evaluate it achievement.

Question: How to evaluate the goal (or claim)? Mixed-initiative system X increases a human’s speed, memory, accuracy, competence, creativity … MI • Sub-questions: • How to define and measure the speed, memory, accuracy, competence, creativity …, of the human-system combination? • How to measure the relative contribution of the human and the system to the emergent behavior? • (Is the overall performance mostly due to a smart user, to a good system, or to both?)

Compare to baseline behavior? Measure and compare speed, memory, accuracy, competence, creativity … for solving a class of problems in different settings: MI Human alone Agent alone Mixed-initiative human-agent system ¬MI MI- Non mixed-initiative human-agent system Ablated mixed-initiative human-agent system

Other complex questions Consider the setting: MI Human alone (baseline) Mixed-initiative human-agent system How to account for human learning during baseline evaluation? Use other humans? How to account for human variability? Use many humans? How to pay for the associated cost??? Replace a human with a simulation? How well does the simulation actually represents a human? Since the simulation is not perfect, how good is the result? How much does a good simulation cost?

Evaluation Framework for MI systems Currently no such framework exists, but it may emerge from generalization of specific cases. Specific problem: Knowledge authoring by subject matter experts who do not have prior knowledge engineering experience. Specific case: Disciple learning agent taught by a subject matter expert to become a knowledge-based assistant. The agent can help to formalize the knowledge. The expert has knowledge but cannot formalize it by himself. Question: What are the characteristics of good case studies?

Evaluating Mixed-initiative Systems: Goals and Metrics