1 / 9

RADAR EVALUATION Goals, Targets, Review & Discussion

School of Computer Science. RADAR EVALUATION Goals, Targets, Review & Discussion. Jaime Carbonell & soon Full SRI/CMU/IET RADAR Team 1-February-2005. Supported By DARPA IPTO PAL Program: “Personalized Assistant That Learns”. Outline: Radar Evaluation.

Download Presentation

RADAR EVALUATION Goals, Targets, Review & Discussion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. School of Computer Science RADAR EVALUATION Goals, Targets, Review & Discussion Jaime Carbonell & soon Full SRI/CMU/IET RADAR Team 1-February-2005 Supported By DARPA IPTO PAL Program: “Personalized Assistant That Learns”

  2. Outline: Radar Evaluation • Brief Review of Radar Challenge Task • Evaluation Objectives:Obligation and Desiderata • Evaluation Components:Radar Tasks • RadarMetrics: Tasks  Meaningful Measures • Putting it all together: Tin-man formula proposal

  3. Wing A RADAR Planning & Scheduling NLP Crisis Resolver Learning E-Mail Handler Knowledge Base Wing B The resolver needs to replan:gather information, commandeer other rooms, change schedules,post to websites,inform participants. Test: Radar will assist a conference planner in a crisis situation. The original plan has been disrupted. Conference wing A is no longer available.Other rooms may be affected. The test will be evaluated on quality and completeness of the new plan and on the successful completion of related tasks. Conference Organizers Conference Participants Website

  4. Conference Re-planning Tasks • Situation Assessment • Which resources have become unavailable • What alternative resources exist and at what price • Tentative re-planning of conference schedule • Elicit and satisfy as many preferences as possible • Validating conference schedule & resource allocation • Securing buy-in from key stakeholders (requires meeting) • Awaiting external confirmations (or default assumptions) • Modifying plan as/when needed • Informing all stakeholders • Briefings to VIPs, Update website for participants • Cope with background tasks(time permitting)

  5. Scoring Criteria (Adapted from Garvey) • Task Realism • Must reflect RADAR challenge performance • Sensitive to Learning • Must allow headroom beyond Y2 (no low ceiling) • Must include measurement of learning effects • Auditable with Pride • Objective, Simple, Clear, Transparent, Statistically Sound, Replicable, … • Comprehensive & Research-Useful • All RADAR modules included, albeit differentially • Responsive to RADAR scientific objectives

  6. Evaluation Components • All RADAR Modules(Sched quality) • Time-Space Planning (TSP): Schedule quality • Meeting Scheduling (CMRadar): Meetings, bumps • Webmaster + Briefing Assistant (VIO) • Email + NLP: Other tasks completed: background • Additional Learning Targets (?) • Relevant facts & preferences acquired • Strategic knowledge (when/how to apply K) • Combination Function (Utility-like) • Linear weighted sum with +/- terms

  7. Example: Schedule Quality Metric W = Weight = importance of the session (e.g. keynote > posters) P = Penalty for distance from ideal (e.g. room smaller than target), linear or step fn f = factors of sessions (e.g. room size, duration, equipment, …) r = resource (e.g. ballroom at Flagstaff)

  8. Putting It All Together • Normalizing components: • Summing: or

  9. Next Steps for Evaluation Metrics • Metrics for Other components • Metrics for Learning Boost • Discuss/Refine/Redo Combination • True open-ended scale? • Something other than weighted sum? • Quality metric w/o penalties (+ ’s only) • Test in a full walk-through scenario • Refine the details • Don’t loose sight of objectives

More Related