400 likes | 744 Views
Speaker Verification System Part B Final Presentation. Performed by: Barak Benita & Daniel Adler Instructor: Erez Sabbag . Implementation of a speaker verification algorithm on a DSP The verification module will perform a real time authentication of the user based on sampled voice data.
E N D
Speaker Verification SystemPart B Final Presentation Performed by: Barak Benita & Daniel Adler Instructor: Erez Sabbag
Implementation of a speaker verification algorithm on a DSP The verification module will perform a real time authentication of the user based on sampled voice data. The idea is to integrate the speaker verification model with other security and management models allowing them to grant access to resources based on the speakers voice verification. The Project Goal
Introduction Speaker verification is the process of automatically authenticating the speaker on the basis of individual information included in speech waves. Speaker’s Identity (Reference) Speaker Verification System Speaker’s Voice Segment Result [0:1]
System Overview BT Base Station Access Denied My name is Bob! LAN Speaker Verification Unit Server BT Base Station LAN
System Description The system is compound from TI’s C6701floating point DSP with the speaker verification algorithm on it. A user with a hand device (e.g. bluetooth on a PDA), will receive access to different resources ( door opening, file access, etc) based on a voice verification process. The project implements only the speaker verification algorithm on the DSP and has input and output interfaces to interact with other devices (e.g. Bluetooth). The DSP is encoded with the users voice signature. Each time user verification is needed, the algorithm compares the speakers voice with the signature. 3
System Block Diagram DSP Signature parameters Enrollment Server (training phase – building A signature) Codec Verification Channel Voice Channel (optional) Bluetooth Radio Interface • Bluetooth • unit Codec Bluetooth Base station Authorization Server LAN Voice Channel (optional) Voice Channel “My name is Bob” 5
Project Description: • Part One: • Literature review • Algorithms selection • MATLAB implementation • Result analysis • Part Two: • Implementation of the chosen algorithm on a DSP
Speaker Verification Process Analog Speech Pre-Processing Feature Extraction Pattern Matching Reference Model Decision Result [0:1]
Implemented Algorithms: Feature Extraction Module – MFCC MFCC (Mel Frequency Cepstral Coefficients) is the most common technique for feature extraction. MFCC tries to mimic the way our ears work by analyzing the speech waves linearly at low frequencies and logarithmically at high frequencies. The idea acts as follows: Spectrum Mel Spectrum Cepstrum Mel Cepstrum FFT Mel-frequency Wrapping Windowed PDS Frame
Feature Vector Reference Model = Codebook Pattern Matching = Distortion measure Distortion Rate Implemented Algorithms:Pattern Matching Modeling Module – Vector Quantization (VQ) In the enrolment part we build a codebook of the speaker according to the LBG (Linde, Buzo, Gray) algorithm, which creates an N size codebook from set of L feature vectors. In the verification stage, we are measuring the distortion of the given sequence of the feature vectors to the reference codebook.
Implemented Algorithms: Decision In VQ the decision is based on checking if the distortion rate is higher than a preset threshold: acceptance if distortion rate > t, else rejection. In this project no decision model will be build, the output of the system will be based on the following score rate (values between 0 to 1), which indicates the suitability of the person to the reference model: Score = exp (-mean distance)
Implementation Environment • Hardware tools: • TI DSP 6701 EVM board • PC host station • Software development tools: • TI Code Composer • Matlab 6.1 • Programming Languages: • C • Assembler • Matlab
TI DSP 6701 EVM • Why? • Floating Point • Designed Especially for Voice Applications • Large Bank of On Chip Memory • High level development (C) • PCI Interface • Why Not? • Price • Size • Consumption
Program Workflow Analog Speech (input) DSP Program Pre-Processing Feature Extraction MATLAB Program Reference Model Pattern Matching Decision Result [0:1] (output)
Step By Step Implementation • Pre-processing a ‘ones’ vector on the DSP and comparing it to the Matlab results • Pre-processing an audio file and comparing to the Matlab results • Feature extracting of the audio file (after pre-processing) and comparing to the Matlab results • Pattern matching the feature vectors to a ‘ones’ codebook matrix and comparing to the Matlab results (running with the same codebook) • Creating a real codebook from a reference speaker importing it to the DSP and comparing the running results of the DSP and the Matlab • Verifying that the distances of the speakers from the codebook in the DSP program and in the Matlab program are the same
Creating the Assembler Lookup Files • Creating the output data through Matlab functions (e.g. hamming(n)) • Saving the output in an assembler lookup table format • Referencing the lookup table with a name that will be called from the C source code in the DSP project (as a function) hamming = fopen('hamming.asm', 'wt', 'l'); fprintf(hamming, '; hamming.asm - single precision floating point table generated from MATLAB\n'); fprintf(hamming, '\t.def\t_hamming\n'); fprintf(hamming, '\t.sym\t_hamming, _hamming, 54, 2, %d,, %d\n', size, n); fprintf(hamming, '\t.data\n'); fprintf(hamming, '_hamming:\n'); fprintf(hamming, '\t.word\t%tXh, %tXh, %tXh, %tXh\n', h); fprintf(hamming, '\n'); fclose(hamming); • Importing the file as an asm file (adding a file to the project) to the DSP project ; hamming.asm - single precision floating point table generated from MATLAB .def _hamming .sym _hamming, _hamming, 54, 2, 8192,, 256 .data _hamming: .word 3DA3D70Ah, 3DA4203Fh, 3DA4FBD3h, 3DA669A4h .word 3DA86978h, 3DAAFB01h, 3DAE1DD8h, 3DB1D180h .word 3DB61567h, 3DBAE8E1h, 3DC04B30h, 3DC63B7Dh .word 3DCCB8DCh, 3DD3C24Bh, 3DDB56B1h, 3DE374E1h .word 3DEC1B99h, 3DF5497Fh, 3DFEFD27h, 3E049A87h .word 3E09F7D0h, 3E0F9597h, 3E1572FFh, 3E1B8F1Ch h = hamming(n); • Using the lookup table in the C source code // ----- Windowing the filtered frame with Hamming ---- for (k=0 ; k < N ; k++){ for (j=0 ; j < N ; j++){ if (k - j < 0) break; frame[k] += hamming[j]*filtered_frame[k-j]; } }
Generation of assembly functions through Matlab Generation of voice data file from a *.wav format file through Matlab waveread function Hamming.asm Generation of assembly functions through Matlab Sari5fix.asm Melbank.asm Rdct.asm Generation of assembly functions through Matlab Codebook.asm Binding All The Pieces Analog Speech (input) DSP Program C Code Pre-Processing Feature Extraction Pattern Matching Decision Result [0:1] (output)
Software Modules main init O(1) extract_frame O(n^2) digitrev_index O(n) bitrev O(n) calc_dist O(1) hamming O(1) bitrev O(n) melbank O(n) cfftr2_dit O(nlog(n))
speakerverification.pjt Project Structure Include Files board.h codec.h dma.h intr.h mcbsp.h link.cmd pci.h regs.h Libraries verification.h rts6700.lib Source bitrevf.asm cfftr2.asm codebook.asm digitrev_index.c hamming.asm melbank.asm rdct.asm verification.c
Tested System • The Tested System parameters: • The tested algorithms and methods were the MFCC and VQ with the following parameters: • Sampling Frequency: 11025Hz • Feature Vector Size: 18 • Window Size: 256 • Offset Size: 128 • Codebook Size: 128 • Number of iterations for codebook creation: 25 • We compared between the Matlab and DSP results based on a codebook created from Daniel’s 60 seconds of random speech and random selection of different five seconds speakers.
Verifications • The DSP results were compared to the Matlab simulation. • We chose random speakers from the speakers DB with one reference codebook. • For Example: • PersonMATLABDSP • Daniel 66.95% (0.4011) 66.95% (0.4011) • Barak 44.01% (0.8206) 44.01% (0.8206) • Ayelet 43.61% (0.8299) 43.61% (0.8299) • Diego 53.97% (0.6166) 53.97% (0.6166) • Adi 42.07% (0.8656) 42.07% (0.8656)
Conclusions • The TI DSP 6701 EVM is capable of preforming speaker • verification analysis and achieve high resolution results (as • achieved in the Matlab) • Speaker Verification algorithms are not mature enough to • become a good biometric detection solution • Code Composer is not stable and good enough to become an “easy • to use” development environment • A second phase project, which will implement a complete verification system should be build
Time Table – First Semester 14.11.01 – Project description presentation 15.12.01 – completion of phase A: literature review and algorithm selection 25.12.01 – Handing out the mid-term report 25.12.01 – Beginning of phase B: algorithm implementation in MATLAB 10.04.02 – Publishing the MATLAB results and selecting the algorithm that will be implemented on the DSP
Time Table – Second Semester 10.04.02 – Presenting the progress and planning of the project to the supervisor 17.04.02 – Finishing MATLAB Testing 17.04.02 – The beginning of the implementation on the DSP 07.11.02 – Project presentation and handing the project final report
Pre-Processing (step 1) Analog Speech Windowed PDS Frames [1, 2, … , N] Pre-Processing
Pre-Processing module Analog Speech Anti aliasing filter to avoid aliasing during sampling. LPF [0, Fs/2] LPF Band Limited Analog Speech Analog to digital converter with frequency sampling (Fs) of [10,16]KHz A/D Digital Speech Low order digital system to spectrally flatten the signal (in favor of vocal tract parameters), and make it less susceptible to later finite precision effects First Order FIR Pre-emphasized Digital Speech (PDS) Frame Blocking Frame blocking of the sampled signal. Each frame is of N samples overlapped with N-M samples of the previous frame. Frame rate ~ 100 Frames/Sec N values: [200,300], M values: [100,200] PDS Frames Frame Windowing Using Hamming (or Hanning or Blackman) windowing in order to minimize the signal discontinuities at the beginning and end of each frame. Windowed PDS Frames
Feature Extraction (step 2) Windowed PDS Frames Set of Feature Vectors [1, 2, … , N] [1, 2, … , K] Feature Extraction Extracting the features of speech from each frame and representing it in a vector (feature vector).
Pattern Matching Modeling (step 3) • The pattern matching modeling techniques is divided into two sections: • The enrolment part, in which we build the reference model of • the speaker. • The verifications (matching) part, where the users will be • compared to this model.
Enrollment part – Modeling Set of Feature Vectors [1, 2, … , K] Modeling Speaker Model This part is done outside the DSP and the DSP receives only the speaker model (calculated offline in a host).
Pattern Matching Speaker Model Set of Feature Vectors [1, 2, … , K] Pattern Matching Matching Rate
Decision Module (Optional) In VQ the decision is based on checking if the distortion rate is higher than a preset threshold: if distortion rate > t, Output = Yes, else Output = No. In HMM the decision is based on checking if the probability score is higher than a preset threshold: if probability scores > t, Output = Yes, else Output = No.
The Voice Database • Two reference models were generated (one male and one female), • each model was trained in 3 different ways: • repeating the same sentence for 15 seconds • repeating the same sentence for 40 seconds • reading random text for one minute • The voice database is compound from 10 different speakers (5 • males and 5 females), each speaker was recorded in 3 ways: • repeating the reference sentence once (5 seconds) • repeating the reference sentence 3 times (15 seconds) • speaking a random sentence for 5 seconds
Experiment Description Cont. Conclusions: Window size of 330 and offset of 110 samples performs better than window size of 256 and offset of 128 samples
Experiment Description Cont. Conclusions: Feature vector of 18 coeffs is better than feature vector of 12 coeffs
Experiment Description Cont. • Conclusions: • Worst combinations: • 5 seconds of fixed sentence for testing with an • enrolment of 15 seconds of the same sentence. • 5 seconds of fixed sentence for testing with an • enrolment of 40 seconds of the same sentence. • Best combinations: • 15 seconds of fixed sentence for testing with an • enrolment of 40 seconds of the same sentence. • 15 seconds of fixed sentence for testing with an • enrolment of 60 seconds of random sentences. • 5 seconds of a random sentence with an • enrolment of 60 seconds of random sentences.
Experiment Description Cont. The Best Results:
Additional verification results • The DSP results were compared to the Matlab simulation. • We chose random speakers from the speakers DB with one reference codebook. • For Example: • PersonMATLABDSP • Alex 69.58% (0.3627) 69.58% (0.3627) • Sari 61.66% (0.4835) 61.66% (0.4835) • Roee 49.97% (0.6938) 49.97% (0.6938) • Eran 54.75% (0.6023) 54.75% (0.6023) • Hila 55.72% (0.5849) 55.72% (0.5849)