Mark D. Skowronski and John G. Harris Computational Neuro-Engineering Lab

Statistical automatic identification of microchiroptera from echolocation callsLessons learned from human automatic speech recognition Mark D. Skowronski and John G. Harris Computational Neuro-Engineering Lab Electrical and Computer Engineering University of Florida Gainesville, FL, USA November 19, 2004

Overview • Motivations for bat acoustic research • Review bat call classification methods • Contrast with 1970s human ASR • Experiments • Conclusions

Bat research motivations • Bats are among: • the most diverse, • the most endangered, • and the least studied mammals. • 1000 species, ~25% of all mammal species • Close relationship with insects, agricultural impact, disease vectors • Acoustical research non-invasive, significant domain (echolocation) • Simplified biological acoustic communication system (compared to human speech)

Bat echolocation • Ultrasonic, brief chirps • Determine range, velocity of nearby objects (clutter, prey, conspecifics) • Tailored for task, environment Tadarida brasiliensis (Mexican free-tailed bat) Listen to 10x time-expanded search calls:

Echolocation calls • Two characteristics • Frequency modulated -- range • Constant frequency -- velocity • Features (holistic) • Freq. extrema • Duration • Shape • # harmonics • Call interval Mexican free-tailed calls, concatenated

Current classification methods • Expert sonogram readers • Manual or automatic feature extraction • Comparison with exemplar sonograms • Automatic classification • Decision trees • Discriminant function analysis • Artificial neural networks • Spectrogram correlation Parallels the knowledge-based approach to human ASR from the 1970s (acoustic phonetics, expert systems, cognitive approach).

Acoustic phonetics DH AH F UH T B AO L G EY EM IH Z OW V ER • Bottom up paradigm • Frames, boundaries, groups, phonemes, words • Manual or automatic feature extraction • Formants, voicing, duration, intensity, transitions • Classification • Decision tree, discriminant functions, neural network, Gaussian mixture model, Viterbi path

Acoustic phonetics limitations • Variability of conversational speech • Complex rules, difficult to train • Boundaries difficult to define • Coarticulation • Feature estimates brittle • Variable noise robustness • Hard decisions, errors accumulate Shifted to information theoretic paradigm of human ASR, better able to account for variability of speech, noise.

Information theoretic ASR • Data-driven models from computer science • Non-parametric: dynamic time warp (DTW) • Parametric: hidden Markov model (HMM) • Frame-based • Expert information in feature extraction • Models account for feature, temporal variability Information theoretic ASR dominates state-of-the-art speech understanding systems.

Data collection • UF Bat House, home to 60,000 bats • Mexican free-tailed bat (vast majority) • Evening bat • Southeastern myotis • Continuous recording • 90 minutes around sunset • ~20,000 calls • Equipment: • B&K mic (4939), 100 kHz • B&K preamp (2670) • Custom amp/AA filter • NI 6036E 200kS/s A/D card • Laptop, Matlab

Experiment design • Designs and assumptions • All recorded bats are Mexican free-tailed • Calls divided into different intraspecies calls • All calls are search phase • Hand-labeled call detection is complete (no discarded calls) • Hand labels • Narrowband spectrogram • Endpoints, class label • 436 calls in 261 0.5-sec sequences (2% of data) • Four classes, a priori: 34, 40, 20, 6% • All experiments on hand-labeled data only

Experiments • Baseline • Features: Fmin, Fmax, Fmax_energy, and duration, from zero crossings and MUSIC • Classifier: Discriminant function analysis, quadratic boundaries • DTW and HMM • Frame-based features: fundamental frequency (MUSIC super-resolution estimate), log energy, temporal derivatives (HMM only) • DTW: MUSIC frequencies, 10% endpoint range • HMM: 5 states/model, 4 Gaussian mixtures/state, diagonal covariances • Tests • Leave one out • 75% train, 25% test, 1000 trials • Test on train (HMM only)

Results • Baseline, zero crossing • Leave one out: 72.5% correct • Repeated trials: 72.5 ± 4% (mean ± std) • Baseline, MUSIC • Leave one out: 79.1% • Repeated trials: 77.5 ± 4% • DTW, MUSIC • Leave one out: 74.5 % • Repeated trials: 74.1 ± 4% • HMM, MUSIC • Test on train: 85.3 %

Confusion matrices Baseline, zero crossing Baseline, MUSIC DTW, MUSIC HMM, MUSIC

Conclusions • Human ASR algorithms applicable to bat echolocation calls • Experiments • Weakness: accuracy of class labels • No labeled calls excluded • HMM most accurate, undertrained • MUSIC frequency estimate robust, slow • Machine learning • DTW: fast training, slow classification • HMM: slow training, fast classification

Future work • Find robust features of bat echolocation calls that match assumptions of machine learning algorithms • Noise robust • Distribution modeled by Gaussian mixtures • Use hand-labeled subset of data to create call detection algorithm • Explore unsupervised learning • Self-organized maps • Clustering • Real-time portable detection/classification system on laptop PC

Further information • http://www.cnel.ufl.edu/~markskow • markskow@cnel.ufl.edu • DTW reference: • L. Rabiner and B. Juang, Fundamentals of Speech Recognition, Prentice Hall, Englewood Cliffs, NJ, 1993 • HMM reference: • L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” in Readings in Speech Recognition, A. Waibel and K.-F. Lee, Eds., pp. 267–296. Kaufmann, San Mateo, CA, 1990.

Mark D. Skowronski and John G. Harris Computational Neuro-Engineering Lab