190 likes | 308 Views
Recording Meetings with the CMU Meeting Recorder Architecture. Satanjeev Banerjee, et al. School of Computer Science Carnegie Mellon University. Goals. End goal: Build conversational agents That “understand” meetings E.g.: Identify action items Make contributions to meetings
E N D
Recording Meetings with the CMU Meeting Recorder Architecture Satanjeev Banerjee, et al. School of Computer Science Carnegie Mellon University
Goals • End goal: Build conversational agents • That “understand” meetings • E.g.: Identify action items • Make contributions to meetings • E.g.: Confirm details of action items • Part of Project CALO: Cognitive Agent that Learns and Organizes • First goal: Create corpus of human meetings • Capture data that we expect agents to use • E.g.: Speech, video, whiteboard markings, etc. Carnegie Mellon University
Desirable Properties of the Recorder • Need to record meetings anywhere • Emphasis on instrumenting user, not room • Assume low network bandwidth • Should still be able to record in the extreme situation where there is no network access! • Should be easy to add new data streams • “Easy” = low time to incorporate new stream • Should be able to support major OS-es Carnegie Mellon University
The Recorder Architecture • Information stream is discretized into events • Either a sequence of events, e.g. utterances • Or one long event, e.g. video data • Each event is given start/end time stamps • Coincide for instantaneous events, e.g. keystroke • Events are stored on local disks • Laptops, shuttle PCs, etc. • Events are (slowly) uploaded to a central server when there is network access Carnegie Mellon University
Event Identification and Logging • Each recorded event has the following identifying information associated with it: • Start and stop time stamps • Name of the meeting and the user • Modality (speech, video, hand-writing, etc.) • After recording an event, its identification information is sent to a logging server • Server creates a list of all the events in a meeting • Good for book-keeping (but not essential) Carnegie Mellon University
Browse Meeting P1 P2 P3 P1 P2 P3 P1 Participant 3 P1 Participant 1 Time server Participant 2 Architecture of Meeting Recorder { DATA_BLOCK session: OTTER user: arudnicky datatype: SPEECH file: \\spot\data\u1.raw Start: 20030917::18:27.600 End: 20030917::18:35.357 } [master] Carnegie Mellon University
Synchronizing the Time Stamps • All event time stamps must be synchronized • We use the Simplified Network Time Protocol • Query a central NTP server for the time • Use the reply and the round-trip time to estimate time difference between local machine and server • Use this to create server-time time stamps • Rough experiments reveal 10ms variance • Caveat: Experiments done on high speed network • What if there is *no* network access? Carnegie Mellon University
Aggregating the Data • Upon network access availability, data is transferred from all sites to a central location • Current recording sites: CMU and Stanford • Implemented a cross-platform version of the MS Background Intelligent Transfer Service • Uploads files in a transparent background process • Throttles bandwidth use as user’s activity goes up • Pauses if network connection is lost • Resumes once network access is restored Carnegie Mellon University
Transcription, Annotation MEETING DATABASE CALO Learning Analysis Data Collection Process (proposed) preparation Independent cross-site collection integration Background data transmission research Carnegie Mellon University
Capturing Close-Talking Speech • Implemented Meeting Recorder Cross Platform (MRCP) to record speech and notes • Speech recorded using head-mounted mics • 11.025 kHz sampling rate used for portability • End pointing done using CMU Sphinx 3 ASR • Each end-pointed utterance is an event • Utterance is recorded to local disk (wav format) • Time stamps are generated using Simple NTP • Utterance’s identifying information is sent to logging server, utterance is queued for upload Carnegie Mellon University
Capturing Typed Notes • Users type notes in client’s note-taking area • “Snapshots” of notes are taken at each carriage return • Each snapshot is an event • Each snapshot is saved to disk, time-stamped, logged, and queued for upload • [Demonstration of MRCP] Carnegie Mellon University
More Details about MRCP • Implemented using cross platform libraries: • wxWidgets for GUI, file access, networking • PortAudio for audio libraries • Currently compiles on Windows, Macintosh OS-X and Linux operating systems • Windows version distributed to other Project CALO sites • Macintosh and Linux versions in beta-testing • WinCE version in development Carnegie Mellon University
Capturing Whiteboard Pen Strokes • We use Mimio to capture whiteboard pen strokes • “Strokes” consist of all the x-y coordinates between pen-down and pen-up • Each stroke is an event. It is recorded, time-stamped, logged, queued for upload. Carnegie Mellon University
Capturing Power Point Slides Information • We use MS’s PowerPoint API to capture slide change timing information, and slide contents • Events = slide changes • Event data = content of the new slide • Content is in the form of all the text, and all the “shapes” on the slide • Events are instantaneous • Start and stop time stamps coincide • Events are processed as before Carnegie Mellon University
Capturing Panoramic Video • We capture panoramic video using a 4-camera CAMEO device • Developed by the Physical Awareness group at CMU • Video recording done in MPEG-4 format • One long event is produced and uploaded Carnegie Mellon University
Current Status of Data Collection • Recorded meetings vary widely in size… • From 2 to 10 person meetings • …in meeting type • Scheduling meetings, presentations, brain storms • …in content • Speech group meetings, dialog group meetings, physical awareness group meetings • Currently have a total of more than 11,000 utterances (including cross talk) Carnegie Mellon University
Using the Data: Some Initial Research • Question: Can we detect the state of a meeting, and the roles of participants from simple speech data? • Introduced a taxonomy of meeting states and participant roles Carnegie Mellon University
Detection Methods and Initial Results • Used Anvil to hand annotate 45 minutes of meeting video with states and roles • Trained decision tree classifier from 30 minutes of data • Input features: • # speakers, lengths of utterances, pauses and interruptions within a short history of the meeting • Initial results: About 50% detection accuracy on separate 15 minutes of test data Carnegie Mellon University
Questions? Thanks to DARPA grant NBCH-D-02-0010