400 likes | 509 Views
A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks. MS Thesis Defense December 6, 2001 University of Illinois at Chicago Politecnico di Torino. Daniele Quercia d.quercia@studenti.to.it. Distributed Speech Recognition: Our Focus. Experimental framework.
E N D
A Simulative Study Of Distributed Speech Recognition Over Internet Protocol Networks MS Thesis Defense December 6, 2001 University of Illinois at Chicago Politecnico di Torino Daniele Quercia d.quercia@studenti.to.it
Distributed Speech Recognition: Our Focus Experimental framework Experimental results Outline Conclusions: the impact of packet losses 2
Introduction Internet has proliferated rapidly Strong interest in transporting voice over IP networks Novel Internet applications can benefit from Automatic Speech Recognition (ASR) 3
Introduction (cont’d) Desire for speech input for hand-held devices (mobile phones, PDA’s, etc.) Speech recognition requires high computation, RAM, and disk resources If hand-held devices connected to a network, the speech recognition can take place remotely (e.g. ETSI Aurora Project) 4
Distributed Speech Recognition Architecture Speech recognition task distributed between two end systems: client side (light-weight) server side IP Network Client Side Server Side 5
Distributed Speech Recognition Architecture (cont’d) Packing and Framing Front-end Client Device IP Network Recognizer Unpacking Remote site 6
Technical challenges IP networks are not designed for transmitting real-time traffic Lack of guarantees in terms of packet losses,network delay,and delay jitter 7
The design of Distributed Speech Recognition systems must consider the effect of packet losses Technical challenges (cont’d) Speech packets must be received: without significant losses with low delay with small delay variation (jitter) 8
Our Focus Performance evaluation of a Distributed Speech Recognition (DSR) system operating over simulated IP networks 9
Evaluation of a standard front-end that achieves state-of-the-art performance Simulative study of DSR under increasingly more realist scenarios: Random losses Gilbert-model losses Network simulations Our Research Contributions 10
Experimental framework Front-end Recognizer Speech Database Network scenarios 11
Experimental framework Results Summary & Conclusions Front-end Front-end extracts from the speech signal significant information for recognition Spectral Envelope 12
Experimental framework Results Summary & Conclusions Front-end (cont’d) ETSI Aurora Standard Front-end produces 14 coefficients Front-end based on mel coefficients Mel coefficients represent the short-time spectral envelope Frequency axis is warped closer to perceptual axis 13
Experimental framework Results Summary & Conclusions Recognizer HMM-based speech recognizer HMM model consists of: -states qi 16 states per word -transition probabilities aij -initial state distribution pi -emission distribution bi(o) 3 Gaussian mixtures per state 14
Experimental framework Results Summary & Conclusions Recognizer (cont’d) TRAINING PHASE Word 1 Word 2 Word 3 Training examples M 1 M 2 M 3 Estimated Models RECOGNITION PHASE Unknown observation sequence O P(O|M 1) P(O|M 2) P(O|M 3) Evaluated probabilities Max Probability chosen 15
For training, 8440 utterances selected For test, 4004 utterances selected without noise added Experimental framework Results Summary & Conclusions Speech Database ETSI Aurora TIdigits Database 2.0 16
3 network scenarios was considered: Random losses Gilbert-model losses Network simulations Experimental framework Results Summary & Conclusions Network scenarios 1 frame per packet 17
Each packet has the same loss probability Packet loss ratios: 10% - 40% Experimental framework Results Summary & Conclusions Random losses 18
Random losses does not model the temporal dependencies of loss Experimental framework Results Summary & Conclusions Random losses (cont’d) Generally, packet losses appear in burst When the net is congested ... packet loss High Probability(packet loss) !! time t + d t 19
2-state Markov model: p = P(next packet lost | previous packet arrived) q = P(next packet arrived | previous packet lost) Experimental framework Results Summary & Conclusions Gilbert-model losses p 1-p 1-q STATE 1 no loss STATE 2 loss q 20
2-state Markov model is less accurate than a nth order Markov model, but (accuracy vs. complexity) is better. Documented in the literature: “Gilbert model is a suitable loss model” Simulated Packet loss ratios: 10% - 40% Experimental framework Results Summary & Conclusions Gilbert-model losses (cont’d) 21
Network simulations represent more realistic IP scenarios Experimental framework Results Summary & Conclusions Network simulations Previous models are mathematically simple 22
VINT Simulation Environment was used Components: NS-2(network simulator version 2) NAM(network animator) NS package allows extension by user Experimental framework Results Summary & Conclusions Network simulations (cont’d) 23
Experimental framework Results Summary & Conclusions Network simulations (cont’d) NS simulator receives a scenario as input produces trace files Network Simulator 24
Experimental framework Results Summary & Conclusions Network simulations (cont’d) Our analysis: Scenario in which the users are speaking, while interfering FTP traffic is going on Speech sources Speech receivers 1 ms 3 ms • 64 kb/s FTP sources FTP receivers 25
Playout Buffer required to deal with delay variations Experimental framework Results Summary & Conclusions Network simulations (cont’d) Sender Time IP net Receiver Time Buffer Buffer size Time Network delay 26
Simulated Packet loss ratios: 5% - 20% Experimental framework Results Summary & Conclusions Network simulations (cont’d) Characteristics of the scenario: Speech traffic uses RTP protocol with header compression (8-bytes long packet) Round-trip time: 10 ms Playout buffer size: 100 ms Competing traffic: on/off TCP sources Simulation: 350 s 27
Experimental framework Results Summary & Conclusions Performance measures Word Accuracy is a good measure of performance Word Accuracy of the baseline system (no errors): 99% 28
Experimental framework Results Summary & Conclusions Performance measures (cont’d) Three kinds of errors: Reference: I want to go to Venezia Recognized:- want to go to theVerona I=Insertion D=Deletion S=Substitution ( ) S+D+I Word Accuracy=100 1 _ % #spoken words 29
Experimental framework Results Summary & Conclusions Packet loss concealment Error concealment technique for packet losses When packet losses occur, the missing packets replaced by interpolation 30
Experimental framework Results Summary & Conclusions Results for random losses For Packet Loss Ratio =10% and 20%, predominantly single packet losses occur Overall, 94% of burst lengths < 5 packet.s 31
Experimental framework Results Summary & Conclusions Results for random losses As Packet Loss Ratio increases performance deteriorates Recovery from 83% to 99% 32
Experimental framework Results Summary & Conclusions Results for Gilbert-model losses For Packet Loss Ratio=10% and 20%, predominantly single packet losses occur For Packet Loss Ratio=30% and 40%, burst lengths < 6 packets 33
Experimental framework Results Summary & Conclusions Results for Gilbert-model losses With Packet Loss Ratio=40%, Average loss burst length: 4 packets Recovery from 80% to 98% 34
Average burst length: 45 packets. Why? TCP packets are much larger than speech ones: when speech packets get delayed in the queues, they may reach the receiver too late Experimental framework Results Summary & Conclusions Results for network simulations 35
With long loss bursts, nothing can be done Experimental framework Results Summary & Conclusions Results for network simulations Loss burst lengths are very large 36
Packet losses were modeled by: random losses Gilbert-model losses network simulations Experimental framework Results Summary & Conclusions Summary and Conclusions We have analyzed the impact of packet losses on a DSR system over IP networks using the ETSI Aurora database 37
Expected recognition performance from length of burst losses Small burst length losses: good recognition results Large burst length losses: degraded recognition results Experimental framework Results Summary & Conclusions Summary and Conclusions (cont’d) 38
Single packet losses and short bursts can be tolerated Bursty packet losses lead to large performance degradation Error concealment technique provides good results if the error bursts are short (4-5 packets) Experimental framework Results Summary & Conclusions Summary and Conclusions (cont’d) 39
Submission Submitted to the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2002 40