160 likes | 321 Views
SpeechBuilder: Facilitating Spoken Dialogue System Creation. Eugene Weinstein Project Oxygen Core Team MIT Laboratory for Computer Science ecoder@mit.edu. Language Generation. Speech Synthesis. Dialogue Management. Hub. Audio. Database Server. Speech Recog. Context Resolution.
E N D
SpeechBuilder: Facilitating Spoken Dialogue System Creation Eugene Weinstein Project Oxygen Core Team MIT Laboratory for Computer Science ecoder@mit.edu
Language Generation Speech Synthesis Dialogue Management Hub Audio Database Server Speech Recog. Context Resolution Language Processing Speech Builder Bridging the Experience Gap • Developing robust, mixed-initiative spoken dialogue systems is difficult • Complex systems can be created by human-language technology experts • Novice developers must overcome a considerable technical challenge • SpeechBuilder aims to help novices rapidly create speech-based systems • Uses intuitive methods for specifying domain-specific constraints • Automatically configures HLT components using MIT GALAXY architecture • Leverages future technical advances • Encourages research on portability
CGI Parameter Generation Speech Synthesis Hub Audio Server SpeechBuilder Server HTTP Speech Recognition Language Processing Baseline Configuration • Communication with Galaxy via simple HTTP protocol • Gives developer total control over application functionality Developer Application
CGI Parameter Generation Speech Synthesis Hub Audio Server Frame Relay Server Semantic Frame TCP Socket Speech Recognition Language Processing Modified Baseline Configuration (this class) • Still gives developer total control over application functionality • Frame Relay server exposes Galaxy meaning representation to app Developer Application
Language Generation Speech Synthesis Dialogue Management Audio Server Audio Server Hub Database Server INFO I/O Server Speech Recognition Discourse Resolution Language Processing Database Access Configuration ** • No programming required; specify table(s) and constraints • For a speech-based interface to structured data
NLG Upload INFO TTS Dialog HUB Compile NLU NLG Hub Audio SB Disc Dialog ASR ASR Discours Response NLU INFO Query Creating a Speech-Based Application Step 1:Off-line creation and compilation Step 2: On-line deployment
Generates ‘E-form’, SQL, & responses • Default entries made • Galaxy programmable hub controls interactions between all components Language Generation • Generic server handles interaction • Accesses back-end database Speech Synthesis Dialogue Management • Commercial product Hub Audio Server Database Server • Telephone or lightweight audio server • New component performs concept inheritance & masking • Processes ‘E-form’ Speech Recognition Context Resolution Language Processing • Generic acoustic models • Unknown word model • Class or hierarchical n-gram • N-best interface with ASR • Grammar from attributes & actions • Backs off to concept spotting Human Language Technologies
Extracting Database Information ** • Some columns are used to access entries (e.g., Name) • Column entries must be incorporated into ASR & NLU • Some columns are only used in responses (e.g., Phone) • Column names must be incorporated into ASR & NLU “What is the phone number for Victor Zue?”
Knowledge Representation • Concepts and actions form basis for understanding • Concepts become key/value entries in meaning representation • city:Boston, New York…day:Monday, Tuesday • Actions provide sentence-level patterns of specific queries • “I want to fly from Boston to Taipei…” action=lookup_flight • Action text can be bracketed to define hierarchical concepts ** • “I want to fly source=(from Boston) destination=(to Taipei)” • source=Boston destination=Taipei • Concepts and actions used to configure the following components • Speech Recognition • Natural Language Understanding • Discourse • Database columns define basic concepts • Column names can be grouped into concepts • property:phone, email…weather:snow, rain…
rain snow hail Language Modeling and Understanding “Will it snow?” weather:snow • By default, concepts are used for language modeling, parsing grammar, and meaning representation • Concept usage can be fine-tuned to improve performance:** • For language modeling and parsing grammar only (i.e., no meaning) • For keyword spotting only (i.e., no role in language modeling) • For fine-grained language modeling with coarser meaning representation snowfall snowstorm sprinkles breezy accumulation showers snowy thunderstorm flurries blizzard rainfall rainy weather:snow
Current Status • SpeechBuilder has been operational for over two years • Used by over 50 developers from MIT and elsewhere • Used in undergraduate classes at MIT and Georgetown University • ASR capabilities benchmarked against main systems • Achieves same ASR performance as MIT Jupiter weather information system (6.8% word error rate on clean data) (phone #) • Several prototype systems have been developed • Information about faculty, staff and students at LCS and AI Labs (phone, email, room, voice messages, transfer, etc.) • Application to control the various physical items in a typical office (lights, curtains, TV, VCR, projector, etc.) • Others include TV schedules, real-time weather forecasts, hotel and restaurant information etc. • SpeechBuilder used for initial design of many more complex domains
Ongoing and Future Work • Increase sophistication of discourse and dialogue manager to handle more complex dialogues • Enable finer specification of discourse capabilities • Add generic capabilities for times, dates, etc. • Incorporate confidence scoring and implement unsupervised training of acoustic and language models • Create functionality to allow developers to create domain-specific concatenative speech synthesis • Create alternative methods of domain specifications to streamline development • Advanced developers don’t necessarily use web interface • Allow for more efficient automatic generation of SpeechBuilder domains
Acknowledgements Issam Bazzi Scott Cyphers Ed Filisko Jim Glass TJ Hazen Lee Hetherington Joe Polifroni Stephanie Seneff Michelle Spina Eugene Weinstein Jon Yi Misha Zitser
SpeechBuilder Hands-on Activity Eugene Weinstein Project Oxygen Core Team MIT Laboratory for Computer Science ecoder@mit.edu
CGI Parameter Generation Speech Synthesis Hub Audio Server Frame Relay Server TCP Socket Speech Recognition Language Processing Modified Baseline Configuration (this class) Jaim • Still gives developer total control over application functionality • Frame Relay server exposes Galaxy meaning representation to app Developer Application Semantic Frame
SpeechBuilder API Galaxy Frame Relay • Galaxy meaning representation provided through frame relay • Applications connect via TCP sockets • API provided in Perl, Python, and Java • This class: Python API Python class galaxy.server.Server TCP Socket galaxy.frame.Frame methods: getAction() getAttribute(attr_name) getText() toString() galaxy.server.Server methods: Constructor(machine,port,ID) connect() processMessage(blocking) disconnect() Python class galaxy.frame.Frame Python API Application