490 likes | 596 Views
INVOCA Project Speech Interfaces for Air Traffic Control Tasks. Javier Macías-Guarasa Speech Technology Group (GTH) Department of Electronic Engineering E.T.S.I. Telecomunicación (ETSIT) Universidad Politécnica de Madrid (UPM). Overview. Introduction Tasks (applications, prototypes)
E N D
INVOCA ProjectSpeech Interfaces for Air Traffic Control Tasks Javier Macías-Guarasa Speech Technology Group (GTH) Department of Electronic Engineering E.T.S.I. Telecomunicación (ETSIT) Universidad Politécnica de Madrid (UPM)
Overview • Introduction • Tasks (applications, prototypes) • Data collection • System architecture & technical details • Evaluation • Demo • Conclusions
Introduction (I) • INVOCA • Speech Interfaces for Air Traffic Control INterfaces VOcales para Control de tráfico Aéreo • Project proposal: • AENA Spanish Airports and Air Navigation • Speech Technology Group ETSIT-UPM • Exploratory project technology evaluation: • Analyze the state of the art of speech recognition technology and its applications to air traffic control tasks • Feasibility study: to be integrated in SACTA? (SACTA = Advanced System for Air Traffic Control)
Introduction (II) • SACTA
People @ GTH: José M. Pardo Javier Ferreiros José Colás Fernando Fernández Valentín Sama Ricardo de Córdoba Juan M. Montero Javier Macías José D. Romeral More people @ GTH: Sergio Díaz María J. Pozuelo Gregoire Prime Jordi Safont Eduardo Campos et al.! AENA Staff: Germán González Myriam Santamaría Introduction (III)
Tasks (I) • Identifying suitable target applications (tasks) within the SACTA environment: • Air traffic controllers (ATCs) in control towers in Barajas (Madrid airport) • ‘Feasible’ tasks • ‘Useful’ tasks • Outcome: • Isolated word recognition IF1 • Spontaneous speech recognition & understanding IF2
Tasks (II)Speech Interface IF1 (I) • Target: • Air Traffic Controllers (ATCs) in control towers • Must keep an eye on traffic around the airport • Feasibility of C&C speech interfaces to help them in handling complex control systems? • Application: • Hard to identify in current SACTA status • Instead: replace FOCUCS system (tactile display) to control main display visualization
Tasks (IV)Speech Interface IF1 (III) • Prototype architecture
Tasks (V)Speech Interface IF2 (I) • Target: • Air Traffic Controllers (ATCs) in control towers • ATCs provide aircraft pilots with instructions regarding flight level, transponder code, etc. • Some data must/should be entered in the computer system • Application: • Detect key concepts (slots) and associated datavalues in [ controller pilot ] radio communication
Tasks (VI)Speech Interface IF2 (II) • IF2 subtasks: Five, one for every control position in Barajas Airport: • Arrivals • Authorizations • North tower • South tower • Take offs • IF1 & IF2 both handling Spanish and ‘English’ (spoken by Spaniards!)
Data Collection (I)Standard databases • Spanish SpeechDat (M & FDB) • Telephone Speech… … but ATC radio channels are band limited! • ~4000 Speakers, isolated & continuous read speech • Digits, isolated words, digit strings, phonetically rich sentences, etc. (40 items per speaker) • But… not related to the task • Need more data! • For adaptation • For full retraining?
Data Collection (II)Speech Interface IF1 • Read isolated words in the task domain: • 16KHz, 16 bits linear (downsample to 8KHz.) • 30 speakers (15 male & 15 female) • 5 repetitions of every command* in the FOCUCS task vocabulary 228 SP / 176 EN
Data Collection (III)Speech Interface IF2 (I) • Real recordings: controller pilot • 16KHz, 16 bits linear, downsample to 8KHz. • Stereo recording: speech + PTT signal 33htotal ~6s/sent 16 wrds/sent
Data Collection (IV)Speech Interface IF2 (II) • Process: • Recording chunks of 15 minutes, continuously • Segmenting in sentences: PTT easyAverage ‘real speech’ contents = 16.4% • Transcribing • Hard, specially in English • Label pauses, respiration and aspiration, tongue, unidentified noise, click, cough • Also concept labeling
Data Collection (V)Speech Interface IF2 (III) • Samples of “Authorizations” sentences: • thai niner four three start up approved qnh one zero one eight !P clear eh !LP fiumicino via flight plan route !P eh nando !P one charlie departure squawk !P on one four two six • alitalia zero six nine roger start up approved and according slot one zero one eight and clear to !P milan malpensa airport via pinar one !P bravo departure squawk one four one six • !RUIDO ok we havent got it yet but the supervisor eh lets me give you start up clearance !ASP and we will give you the atc clearance we when we receive it so start up approved eh report your position again please
Data Collection (VI)Speech Interface IF2 (IV) • Concept labeling sample: • olympic two four eight on stand eightystart up approved with qnh one zero one niner clear to destination athensvia flight plan route nando two golf standard departure initial flight level one three zero on the squawk one four seven three ====== UNDERSTANDING RESULT ====== identifier=[olympic248] startup_status=[START UP APPROVED] destination=[athens] exit_using=[nando2G] transponder=[1473] initial_flight_level=[130] qnh=[1019] parking=[stand80] ======================================
Data Collection (VII)Speech Interface IF2 (V) • Samples of “Arrivals” sentences: • airfrance one five zero zero yes swissair six five zero vacating • klm seven zero one good morning continue approach runway three three as number two wind calm precedent traffic seven six seven four miles ahead
Data Collection (VIII)Speech Interface IF2 (VI) • Samples of “Take offs” sentences: • nostrum eight six one five wind two eight zero one zero cleared take off runway three six left • speedbird four six five you are number four behind iberia airbus three twenty on sierra
Data Collection (IX)Speech Interface IF2 (VII) • Samples of “North tower” sentences: • airnostrum eight seven two five continue via alfa behind iberia seven five seven via kilo tango forty i call you back hold short mike taxi way • airnostrum eight ou triple seven roger taxi via kilo mike holding three six left and please give way traffic mike delta spanair coming out via mike ou ten is now crossing juliett gate
Data Collection (X)Speech Interface IF2 (VIII) • Samples of “South tower” sentences: • alitalia zero six niner are you able to enter mike between the airfrance traffic and the aireuropa your right side atp • sabena now proceed via in a taxi way to the left and wait for the follow me car
System Architecture (I) Speech Interface IF1 Spanish HMMs Spanish dict. One Pass Command to UDP Feature extraction Recognizedcommand One Pass 12 LPC-Cepstrumlog energy 13 13 English HMMs English dict. Change in main display
System Architecture (II) Speech Interface IF2 Task dependent Understanding module Spanish HMMs Spanish N-gram Spanish dict. Tagged dict. Tagger One Pass +rescoring Recognizedsentence in a certain language Preproc. CD rules Feature extraction LanguageID One Pass +rescoring CD rules Tag refiner 12 LPC-Cepstrumlog energy 13 13 Task dependent English HMMs English N-gram English dict. Understanding module Conceptualframe & data
System Architecture (III) • Preprocessing & modeling: • 12 LPC cepstrum + logE + 13 + 13 • CMN + CVN (utterance level) • CD continuous HMMs trained with HTK • Spanish: 1509 states, 8 mixtures per state • English: 1400 states, 8 mixtures per state • Multiple pronunciations in the dictionary! • Training database: Spanish SpeechDat • Further adaptation: Task & speaker
System Architecture (IV) • Search: First pass • One pass Beam search (for states and for last states) • Search space reduced to 18% w/o performance penalty • Bigram LM guided • Scores on demand • Non-speech models handling (regarding LM scoring) • Able to generates n-best output sentences • Search: Second pass: • Rescores first pass output (graph) with trigram • Task dependent tuned LM and IWP weights
System Architecture (V) • Language ID: • Spanish speakers: • Great variability in ‘canonical’ pronunciation • Some words pronounced in ‘Spanish’ (e.g. bravo) • ATCs mix languages (to greet or say goodbye) • Initial effort using well known techniques (PPRLM, etc.) • Final system using LM score comparison!
System Architecture (VI) • Understanding module: • Tagger: several categories per word • Number preprocessing • Tags refiner • Understanding module • Understanding module architecture used in other tasks in our Group • Task dependent & time consuming
Evaluation (I) • Multiple environments: • Off line, using recorded database (offline) • With people at GTH, predefined script (online) • With users (advanced ATC trainees): • Predefined script (online) • Predefined scenarios (free online) • Subjective evaluation • English & Spanish • Measuring: • Word accuracy rates (IF1 & IF2) • Concept accuracy rates (IF2)
Evaluation (II)Speech Interface IF1 (I) • Off line, Main results:
Evaluation (III)Speech Interface IF1 (II) • On line, predefined script (11 speakers): • Spanish: 50 commands (98 words/speaker) • English: 30 commands (60 words/speaker)
Evaluation (IV)Speech Interface IF1 (III) • On line, predefined script (11 speakers): • Detailed error analysis
Evaluation (V)Speech Interface IF1 (IV) • On line, real-task test (11 speakers ATCs): • Form with different questions (subjective) • The system understands what I say… 1 2 3 4 5 AVG 4.0
Evaluation (VI)Speech Interface IF1 (V) • On line, real-task test (11 speakers ATCs): • Form with different questions (subjective) • I would use this system instead of the current one… 1 2 3 4 5 AVG 3.4
Evaluation (VII)Speech Interface IF2 (I) • Training, adaptation & rec. issues: • Spanish authorizations task (prelim. experim.): • Full retraining is used (using only Auth. DB) • Rescoring improves only 4% relative (20% in read speech) not used in final prototype
Evaluation (VIII)Speech Interface IF2 (II) • Database & LM statistics: • Spanish: • English:
Evaluation (IX)Speech Interface IF2 (III) • Off/on line*, word/concept recognition rates: • Spanish: • English: * GTH Online 16 spks: 10 snt/spk in Sp.& 6 snt/spk in Eng. Read Speech! * ATC Online 7 spks: 10 snt/spk in Sp. & 6 snt/spk in Eng. Read Speech!
Evaluation (X)Speech Interface IF2 (IV) • Off/Free on line, word/concept recognition rates: • Spanish: • English: * ATC Free online 7 spks: scenario based 10 snt/spk in Sp. & 6 snt/spk in Eng.5 additional OOVs
Evaluation (XI)Speech Interface IF2 (V) • Real-world*, RT system working in tower: • Word/concept recognition rates: • Language ID rates: * Real World, 205 sentences, 3433 reference words, 588 slots. 10 addit. OOVs
Evaluation (XII)Speech Interface IF2 (VI) • Cross task comparison (off line): • Spanish (average rate for all other tasks): • English (average rate for all other tasks):
Demo • Start praying • Wrong microphone & channel • Wrong speaker! • IF1: • Using defined dictionary • Only Spanish, sorry • IF2: • Random sentences, Spanish & English • Will (try to) point out mistakes
Conclusions (I) • Great fun! • Plenty of space for improvement: • Task dependent restrictions (existing frequencies & flight ids, airport layout data, etc.) • Concept refining (current set is very broad) • Rules development • Speaker/gender adaptation • More data!
Conclusions (II) • ASR Technology not ready for prime time! • Difficult task • We are talking about planes and people! • ‘Political’ issues • Other applications in this field: • Non critical tasks • Pseudo-pilots for ATCs training? • Phraseology trainers • Indexing
EvaluationSpeech Interface 1 IF2 • Database & LM statistics: • Spanish • English
EvaluationSpeech Interface 1 IF2 • Off line, recognition rates: • Spanish • English
EvaluationSpeech Interface 1 IF2 • Off line, understanding rates: • Spanish • English
System ArchitectureLanguage ID in IF2 • Preliminary experiments with PPRLM: • Need almost 5 seconds to achieve 96% • Bad performance in real task • Implemented system uses LM score comparison!
System ArchitectureUnderstanding example • Lufthansa four three four seven clearance correct on stand eight one next call one two one decimal seven bye • <lufthansa> -DATA_identifier- • <4> -single_digit- • <3> -single_digit- • <4> -single_digit- • <7> -single_digit- • <clearance> -ID_freq_change- • <correct> -DATA_correct- • <on> -garbage- -ID_freq_change- • <stand> -DATA_park- • <8> -single_digit- • <1> -single_digit- • <next> -garbage- • <call> -ID_standby- -ID_freq_change- • <1> -single_digit- • <2> -single_digit- • <1> -single_digit- • <decimal> -freq_decimal_point- • <7> -single_digit- • <bye> -goodbye- • <lufthansa> -DATA_identifier- • <4347> -single_digit- • <clearance> -ID_freq_change- • <correct> -DATA_correct- • <on> -garbage- -ID_freq_change- • <stand> -DATA_park- • <81> -single_digit- • <next> -garbage- • <call> -ID_standby- -ID_freq_change- • <121> -single_digit- • <decimal> -freq_decimal_point- • <7> -single_digit- • <bye> -goodbye-
System ArchitectureUnderstanding example • Lufthansa four three four seven clearance correct on stand eight one next call one two one decimal seven bye • <lufthansa> -DATA_identifier- • <4347> -single_digit- • <clearance> -ID_freq_change- • <correct> -DATA_correct- • <on> -garbage- -ID_freq_change- • <stand> -DATA_park- • <81> -single_digit- • <next> -garbage- • <call> -ID_standby- -ID_freq_change- • <121> -single_digit- • <decimal> -freq_decimal_point- • <7> -single_digit- • <bye> -goodbye- • <lufthansa4347> -SLOT_identifier- • <clearance> -ID_freq_change- • <correct> -DATA_correct- • <stand> -DATA_park- • <81> -single_digit- • <call> -ID_standby- -ID_freq_change- • <121.7> -SLOT_freq_change- • <bye> -goodbye- • <lufthansa4347> -SLOT_identifier- • <clearance> -ID_freq_change- • <correct> -DATA_correct- • <stand81> -SLOT_park_id- • <call> -ID_standby- -ID_freq_change- • <121.7> -SLOT_freq_change- • <bye> -goodbye- • ====== UNDERSTANDING RESULTS ====== • identifier=[lufthansa4347] • park_id=[stand81] • freq_change=[121.7] • ======================================