170 likes | 423 Views
Speech Processing for NSR Vs DSR. Veeru Ramaswamy PhD CTO, Vianix LLC Email: veeru@vianix.com. Fast-paced speech technology company with corporate headquarters located in Virginia Beach, Virginia. Vianix has developed, tested, proven and licensed MASC ®
E N D
Speech Processing for NSR Vs DSR Veeru Ramaswamy PhD CTO, Vianix LLC Email: veeru@vianix.com
Fast-paced speech technology company with corporate headquarters located in Virginia Beach, Virginia. Vianix has developed, tested, proven and licensed MASC® Managed Audio Sound Compression (MASC ®) State-of-the-Art speech compression technology High performance enabling voice technology For a broad spectrum of healthcare, multimedia communications and enterpriseapplications Vianix Background
NSR Vs DSR NSR DSR
Bandwidth Requirements: Bit rates ~7-11 kbps, no better than that of compressed voice (For e.g., a lot of VBR encoders can compress from 5 to 17 Kbps with a good recognition accuracy). Speech Reconstruction: Not possible to listen to original voice although more recent advances of DSR allow only low quality reconstruction of voice from features such as LPC or cepstral coefficients (MFCC as in ETSI based Aurora). Playback using TTS: Most DSR applications can only synthesize voice using TTS for an audio playback. DWER: Overall DWER may be lower or greater than NSR based recognition Feature-aware recognition: The recognition engine has to know the type of feature extraction being done apriori in order for the recognition engine to transcribe accurately. Cost of additional client: Additional expenditure as the front-end each time a client needs to be changed. Disadvantages of DSR
Delay/Jitter for Transcription: Any delay in the network transmission of NSR is inconsequential because most transcription applications are non-real-time. Single Client: NSR front-end clients do not need to be changed. The same front-end terminal such as those used in VoIP and other applications. Bandwidth Requirement: Transmission of speech data over any data network for NSR applications requires almost the same bandwidth requirements to encode speech data (For e.g., there are different encoders today offering VBR levels to meet bandwidth requirements without compensating too much on the recognition accuracy). Bit-stream domain recognition: Recognizing speech at the compressed bit-stream domain avoids complications such as no additional feature extraction mechanism is required on the device, and there are no reconstruction losses on the server. Channel coding: Standard schemes can be used with compressed stream (to avoid channel errors) VoIP robustness: Earlier, it was difficult to send compressed voice (only voice features) through the data channel. Now that VoIP has become very robust, high quality compressed voice content can also be sent via data channels. Advantages of NSR
PESQ (Perceptual Evaluation of Speech Quality) Originally defined as part of P.861 as PSQM as an objective measure Modification to PSQM as PESQ in P.862 PESQ combines the excellent psycho-acoustic and cognitive model of PSQM+ with a time alignment algorithm that handles varying delays. PESQ usually ranges from 1 to 4.5 MOS (Mean Opinion Score) A linear mapping and proportional to PESQ MOS, according to ITU standard can be between 1.0 to 5.0 MOS is a subjective measure as opposed to PESQ being an objective measure PESQ / MOS
Variable Bit-Rate: Various bitrates for different codecs (which support variable bit rates) including MASC codec were compared with variable bit-rates. Bit-rates range from almost 5 kbps to 20 kbps. MIPS: Computational efficiency for diff codecs are compared using V-Tune. MIPS ranges from 20 to about 200 depending on the codec used. WER: A measure to compute the number of words in percentage that have NOT been correctly identified by an ASR. Accuracy of the ASR engine is computed by identifying how many words were inaccurate. DWER:difference in WER from the original uncompressed PCM samples to decompressed/decoded PCM samples. Absolute and Relative. Absolute here and a relative number can be obtained by computing the ratio of Absolute DWER to the Original Uncompressed WER. Other Metrics
Procedure for ADWER Computation Comparison of MASC with other various codecs ADWER PESQ Bit-rate Procedure for Comparison of different Codecs
Signal Train for DWER Calculation Original Text STAGE 1 Automatic Speech Recognition Engine PCM REF Transcribed Text from PCM Ref Comparison Of Text Files for Word Error Rate %WER REF STAGE 2 Transcribed Text from PCM Deg %WER Deg PCM REF Encoder Decoder PCM Deg STAGE 3 WER = %WER REF %WER Deg Delta
Stage 1: Obtain the transcribed text of the PCM reference file by passing it thru the PSM. Obtain % WER of transcribed text from the original text (WERREF) All inputs were converted to 8 KHz from 16 KHz using Adobe Audition 2.0 Stage 2: Repeat Stage 1 with the PCM reference file encoded and decoded with different encoders and decoders i.e., Repeat Stage 1 using the “Degraded/Decompressed PCM” as input to ASR (WERDEG). Used Adobe-Audition 2.0 or Sound-Recorder to convert from PCM to compressed/encoded data and back to Decoded/Decompressed PCM. Stage 3: ADWER = WERREF -WERDEG Procedure for computing ADWER
Input: Speech Test Vectors A set of test vectors in .wav format are required to adapt and evaluate on ASR 456 test vectors consisting of eight users (4 Male and 4 Female). Each user has eleven adaptation files and forty six evaluation files. Output: Transcribed Text WER computed from Original text and Transcribed Text from PCM Reference ADWER computed as a difference between Text from Reference PCM and Text from Degraded/Decompressed PCM Inputs and Outputs
Comparison of 8 KHz Codecs on ASR1 MASC is the only Codec that exists today at 8 KHz and at a ADWER in the 0.5 range
Although, there is a perception that DSR might be using low bandwidth and high accuracy, given the importance of voice reconstruction at the back-end and the accuracy w.r.t ASR engines, NSR outweighs DSR with lot more advantages in reality. Summary