RADAR EVALUATION Goals, Targets, Review & Discussion

School of Computer Science RADAR EVALUATION Goals, Targets, Review & Discussion Jaime Carbonell & soon Full SRI/CMU/IET RADAR Team 1-February-2005 Supported By DARPA IPTO PAL Program: “Personalized Assistant That Learns”

Outline: Radar Evaluation • Brief Review of Radar Challenge Task • Evaluation Objectives:Obligation and Desiderata • Evaluation Components:Radar Tasks • RadarMetrics: Tasks  Meaningful Measures • Putting it all together: Tin-man formula proposal

Wing A RADAR Planning & Scheduling NLP Crisis Resolver Learning E-Mail Handler Knowledge Base Wing B The resolver needs to replan:gather information, commandeer other rooms, change schedules,post to websites,inform participants. Test: Radar will assist a conference planner in a crisis situation. The original plan has been disrupted. Conference wing A is no longer available.Other rooms may be affected. The test will be evaluated on quality and completeness of the new plan and on the successful completion of related tasks. Conference Organizers Conference Participants Website

Conference Re-planning Tasks • Situation Assessment • Which resources have become unavailable • What alternative resources exist and at what price • Tentative re-planning of conference schedule • Elicit and satisfy as many preferences as possible • Validating conference schedule & resource allocation • Securing buy-in from key stakeholders (requires meeting) • Awaiting external confirmations (or default assumptions) • Modifying plan as/when needed • Informing all stakeholders • Briefings to VIPs, Update website for participants • Cope with background tasks(time permitting)

Scoring Criteria (Adapted from Garvey) • Task Realism • Must reflect RADAR challenge performance • Sensitive to Learning • Must allow headroom beyond Y2 (no low ceiling) • Must include measurement of learning effects • Auditable with Pride • Objective, Simple, Clear, Transparent, Statistically Sound, Replicable, … • Comprehensive & Research-Useful • All RADAR modules included, albeit differentially • Responsive to RADAR scientific objectives

Evaluation Components • All RADAR Modules(Sched quality) • Time-Space Planning (TSP): Schedule quality • Meeting Scheduling (CMRadar): Meetings, bumps • Webmaster + Briefing Assistant (VIO) • Email + NLP: Other tasks completed: background • Additional Learning Targets (?) • Relevant facts & preferences acquired • Strategic knowledge (when/how to apply K) • Combination Function (Utility-like) • Linear weighted sum with +/- terms

Example: Schedule Quality Metric W = Weight = importance of the session (e.g. keynote > posters) P = Penalty for distance from ideal (e.g. room smaller than target), linear or step fn f = factors of sessions (e.g. room size, duration, equipment, …) r = resource (e.g. ballroom at Flagstaff)

Putting It All Together • Normalizing components: • Summing: or

Next Steps for Evaluation Metrics • Metrics for Other components • Metrics for Learning Boost • Discuss/Refine/Redo Combination • True open-ended scale? • Something other than weighted sum? • Quality metric w/o penalties (+ ’s only) • Test in a full walk-through scenario • Refine the details • Don’t loose sight of objectives

RADAR EVALUATION Goals, Targets, Review & Discussion