260 likes | 419 Views
Look who’s talking? Project 3.1. Yannick Thimister Han van Venrooij Bob Verlinden . Project 3.1 27-01-2011 DKE Maastricht University. Contents. Speaker recognition S peech samples Voice activity d etection Feature extraction Speaker recognition Multi speaker recognition
E N D
Look who’s talking?Project 3.1 Yannick Thimister Han van Venrooij Bob Verlinden Project 3.1 27-01-2011 DKE Maastricht University
Contents Project 3.1 DKE - Maastricht University Speaker recognition Speech samples Voice activity detection Feature extraction Speaker recognition Multi speaker recognition Experiments and results Discussion Conclusion
Speaker Recognition Project 3.1 DKE - Maastricht University • Speech containsseverallayers of info • Spoken words • Speaker identity • Speaker-related differences are a combination of anatomical differences and learned speaking habits
Speech samples Project 3.1 DKE - Maastricht University • Self recorded database • 55 sentences from 11 different people • 2x2 predefined and 1 random • Pro recording and build-in laptop microphone • Database via Voxforge.org • 610 sentences from 61 different people • Varying recording microphones and environments
Voice activity detection Adaptive noise estimation Project 3.1 DKE - Maastricht University Power-based Entropy-based Long term spectral divergence Frames Initial frames are noise Hangover
Voice activity detection Project 3.1 DKE - Maastricht University • Power-based • Assumes that the noise is normally distributed • Calculate mean, standard deviation • For each sample n • Calculate • For each frame j • The majority of the samples
Voice activity detection Project 3.1 DKE - Maastricht University • Entropy-based • Scale DFT coefficients • Entropy equals
Voice activity detection Project 3.1 DKE - Maastricht University • Long term spectral divergence • L-frame window • Estimation • Divergence
Voice activity detection Project 3.1 DKE - Maastricht University • Long term spectral divergence • Estimate the noise spectrum • Averages of the DFT coefficients • Calculate mean (μ) LTSD of noise frames • For each frame f • Calculate the LTSD > c μ • Update
Feature extraction Project 3.1 DKE - Maastricht University • Representation of speakers • Mel frequency cepstralcoefficients • Imitateshuman hearing • Linear predictive coding • Linearfunction of previous samples
MFCC Project 3.1 DKE - Maastricht University Hamming window FFT Mel-scale Log FFT
LPC Project 3.1 DKE - Maastricht University Pth order linearfunctionestimated
Speaker recognition Project 3.1 DKE - Maastricht University • NearestNeighbor • Euclideandistance • NeuralNetwork • Multilayerperceptron
Nearestneighbor Project 3.1 DKE - Maastricht University Features comparedpairwise
Neuralnetwork Project 3.1 DKE - Maastricht University
Multi speaker recognition Project 3.1 DKE - Maastricht University Preprocessing using VAD Consecutive speech frames Single speaker recognition per segment
Experiments VAD Project 3.1 DKE - Maastricht University Hand labeled samples Percentage of correct classified False Negatives
Results VAD Project 3.1 DKE - Maastricht University • Entropy-based • Correctly classified: 65,3% • False negatives: 9,3% • Power-based • Correctly classified: 76,3% • False negatives:6,2% • Long term spectral divergence • Correctly classified: 79,0% • False negatives: 1,6%
Experiments Feature extraction Project 3.1 DKE - Maastricht University • Nr. of coefficients • MFCC • Optimal: 10 • 90.9% • LPC • Optimal: 8 • 77.3%
Experiments single speaker recognition Project 3.1 DKE - Maastricht University • Professional vs. Build-in laptop microphone • Silenceremoval
Experimentsneuralnetwork Project 3.1 DKE - Maastricht University Optimalnumber of nodes Selfrecorded database: 25 nodes Voxforge database: 100 nodes
Experimentsneuralnetwork Project 3.1 DKE - Maastricht University Cycles
Experimentsmulti speaker recognition Project 3.1 DKE - Maastricht University Self-made samples Optimal settings used Neuralnetwork: 66.7% Nearestneighbor: 76.5%
Discussion Project 3.1 DKE - Maastricht University Nearestneighborbetterthanneuralnetwork? Neuralnetworkbetterapplicable VAD givesnoimprovement
Conclusions Project 3.1 DKE - Maastricht University LTSD is the best VAD method MFCC outperforms LPC Training and testingwith different microphonesgives significant lessaccuracy Nearestneighborworksbetterthananoptimizedneuralnetwork
Questions? Project 3.1 DKE - Maastricht University