1 / 19

Technical Aspects of the CALO Recorder

Technical Aspects of the CALO Recorder. By Satanjeev Banerjee Thomas Quisel Jason Cohen Arthur Chan Yitao Sun David Huggins-Daines Alex Rudnicky. Role of the CALO recorder. A centralized mechanism to collect all perceptual events. Speech, Text CMU provides technology on

umika
Download Presentation

Technical Aspects of the CALO Recorder

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Technical Aspects of the CALO Recorder By Satanjeev Banerjee Thomas Quisel Jason Cohen Arthur Chan Yitao Sun David Huggins-Daines Alex Rudnicky

  2. Role of the CALO recorder • A centralized mechanism to collect all perceptual events. • Speech, Text • CMU provides technology on • On Event Recording • On Speech Recognition

  3. Role of the CALO Recorder • One of the component of CAMPER • The four: • CALO recorder • Speechalizer • End-pointing Information • Prosodic Information • Speech Recognition • CAMSeg • Speech Segmentation • Understanding

  4. An Architecture Diagram (Client Side) Audio Capturing Text Capturing through Keyboard Other Events Ring Buffers End-Pointer VU Meter Speech Decoder Storage

  5. Persistence of Data • Background Intelligent Transfer System (BITS) • Use to transfer data off-line

  6. Technical Challenges in the Recorder • Threading • Audio Buffering • Time-synchronization • Real-time processing • End-pointing • Speech processing • Portability • Maintenance/Distribution

  7. Threading • Several processing needs to be concurrently • VU meter • Speech Processing and Higher-level Understanding • Graphical User Interface • Long development time was invested to make the communication between to be correct. • (By Thomas Quisel) See Architecture Diagram next slides • Example Issues: In some platforms, WX implementation will make GUI thread disallow other threads to call its drawing functions.

  8. Audio Buffering • Sphinx 2, 3.X libaudio require, • Capture audio • Do processing on the audio buffer. • If the processing thread is slightly slower than 1xRT • Audio will be lost • (By Jason Cohen) A ring buffer structure is implemented.

  9. Time Synchronization • By David Huggins • Simple NTP (SNTP) is used in getting universal time coordinate (UTC) from arbitrary NTP server • Clone of standard NTP implementation • Internal Synchronization • Synchronization time between machines • 50-60ms • Major challenge is the delay imposed by OS/audio capturing software.

  10. Real-time Processing • Role of End-pointing and Recognition • After long-time debate • Two stage end-pointing and recognition architecture is chosen • By Ziad • High performance end-pointing routine is created • Gaussian Mixture Model-based • End-pointer implemented as a frames voter within segments • The parameters are further manually tuned. • Speed optimized. • Now in s3ep, a customized version of Sphinx

  11. Speech Recognizer • Resulting output is fed to the recognizer • Speech Recognition in meeting • Regards as one of the biggest challenge in the field • Results largely varied from meeting style, number of attendants, topics, disfluencies of the speakers.

  12. Accuracy Performance, still under heavy work, Currently…… • In the cleanest meeting (Bdb001) • With one very dominating male speaker • With one very dominating female speaker • Speaker speaking rate entropy is lowest • Error rate 29.4%

  13. Phase IV of Accuracy Improvement (Core) • Boosting-based training • Confidence-based N-best re-ranking • Speaker adaptation based on transformation • Speaker normalization • Include BN , SWB material in LM training • Dictionary Refinement

  14. Phase IV of Accuracy Improvement (Optional) • STC • MLLT • DT • PLP, TRAP • LM with disfluencies and back-channeling

  15. Speed • 2.2G machine • Communicator • S2, 17.3%, 0.34xRT • S3.X BL 11.8%, 4xRT • S3.X Tuned 12.8, 0.87xRT • WSJ 5k • S3.X BL 7.4% 1.61xRT • S3.X BL 8.3% 0.5xRT • ICSI • With tuning SVQ and CIGMMS, 0.7xRT is achieved. • We may possibly tune up the results. • Benchmarking results need time to prepared

  16. Maintenance and Distribution • All in local CVS • C, Java • Will soon move to SRI • Regular release is created, usage of SRI’s CVS will blur this line.

  17. Conclusion • Engineering work is mostly done for the recorder • Time to improve individual components. • Everyone is welcomed to join the effort.

More Related