90 likes | 187 Views
ICSLP’ 98 CONTROLLING A HIFI WITH A CONTINUOUS SPEECH UNDERSTANDING SYSTEM. J. Ferreiros, J. Colás, J. Macías-Guarasa, A. Ruiz, J. M. Pardo Grupo de Tecnología del Habla - Departamento de Ingeniería Electrónica E.T.S.I. Telecomunicación - Universidad Politécnica de Madrid
E N D
ICSLP’ 98CONTROLLING A HIFI WITH A CONTINUOUS SPEECH UNDERSTANDING SYSTEM J. Ferreiros, J. Colás, J. Macías-Guarasa, A. Ruiz, J. M. Pardo Grupo de Tecnología del Habla - Departamento de Ingeniería Electrónica E.T.S.I. Telecomunicación - Universidad Politécnica de Madrid Ciudad Universitaria s/n, 28040 Madrid Spain
General Architecture Context Dependent Rules Context Dependent Rules SCHMM + Word Pair Tagged Dictionary HIFI Status Speech Recogniser Tagger Tags Refiner Understanding Actuator IR-LED Text to Speech Speech Generation Module Alternative Expresions
Speech recogniser • Characteristics: • Continuous speech commands • One-pass search with word-pair grammar • 163 words • SCHMM phone models • Implementation: • Front-end: DSP LSI board • Rest of processing: PC
Speech understanding (I) • TAGGER: • 78 semantic tags • several tags applied to each word • “garbage” tag used for no-meaning words • Gives robustness against speech recogniser errors • Will allow OOV in the recognised string “Please, set the volume higher” • Tagging directly specified in the lexicon
Speech understanding (II) • TAGS REFINER: • Aims: • Numbers processing • Disambiguation of words with several tags • “garbage” removal • May change the literal of the words “two five” “25” • May introduce new refined semantic tags • Context dependent rules word: “right” tags: “position increment” rule: “if there exists any other word tagged as a tape parameter, then the word right is the position of this tape else it is a increment indicator”
Speech understanding (III) • UNDERSTANDING STAGE: • Context dependent rules • Gives independence on the order of the concepts • Trying to fill in frames: SUBSYSTEM=(radio,cd-player,cassette,...) PARAMETER=(volume,tone,broadcast station,song,...) VALUE=(higher,number,...) • One or several frames for each command • More specific rules: first to be executed • We also fill in message strings • With the “reasoning” • With the problems in the understanding stage
Speech understanding (IV) • ACTUATOR: • Sends IR commands to the HIFI set • Keeps track of the set status • Informs the user of the actions performed or the problems found USER: “switch the radio on” ACTUATOR: “The radio was already on”
Speech generation • Input: pattern string of both literals and concepts coming from the rest of the architecture • Performs random concepts substitution by text to achieve a certain degree of naturalness / variety Input: “C_SEEING the word higher with an increment meaning, C_THINK that put means an increasing action” C_SEEING “As I can see", "As I have discovered", "As It appears", ... C_THINK "I think", "I imagine", "I suppose"... • Output through a text-to-speech subsystem
CONCLUSIONS & FUTURE WORK • Supporting ideas of the system: • Semantic-like tagging • Context dependent rules • “garbage” tag • pattern-based generation • random concepts substitution for generation • Desirable new aspects: • Use of more information of the recognised sentences • Handle more complex commands • Introducing semantic-syntactic parsing of the sentence structure • Introduce dialogue to complete not understood or not given information and as a confirmation strategy