Understanding the soundscape concept: the role of sound recognition and source identification

Understanding the soundscape concept:the role of sound recognition and source identification David Chesmore Audio Systems Laboratory Department of Electronics University of York

Overview of Presentation • Role of soundscape analysis • Instrument for Soundscape Recognition, Identification and Evaluation (ISRIE) • Soundscape description language • Applications • Conclusions

Role of Soundscape Analysis • Potential applications: • identifying relevant sound elements in a soundscape (e.g. high intensity sounds) • determining positive and negative sounds • biodiversity studies • tranquil areas • preserving important soundscapes • planning and noise abatement studies

Soundscape Analysis Options • Manual • Advantage: subjective • Disadvantages: time consuming, limited resources, subjective, very large storage requirements • Automatic • Advantages: objective (once trained), continuous analysis possible, much reduced data storage requirements • Disadvantage: reliability of sound element classification

How to Automatically Classify Sounds? • Major issues to address: • separation and localisation of sounds in the soundscape (especially with multiple simultaneous sounds) • classification of sounds depends on feature overlap, number of elements • Number of elements, localisation, etc depends on application

Instrument for Soundscape Recognition, Identification and Evaluation (ISRIE) • ISRIE is a collaborative project between York, Southampton and Newcastle Universities • 1 of 3 projects arising from EPSRC Noisy Futures Sandpit • York - sound separation + sound classification • Southampton - applications + interface with users • Newcastle - sound localisation + arrays

Aim of ISRIE • Aim is to produce an instrument capable of automatically identifying sounds in a soundscape by: • separating sounds in 3-d • localising sounds from the 3-d field • classification of sound in a restricted range of categories

Outline of ISRIE ISRIE Sensor Localisation + Separation (alt, az) Location Duration, SPL, LEQ Classification Category

Sound Separation - Sensor • B-format microphone as sensor • Provides 3D directional information • A coincident microphone array reduces convolutive separation problems to instantaneous. • More compact and practical than multi-microphone solutions. Outputs W – omni-directional component X – fig-8 response along x-axis Y – fig-8 response along y-axis Z – fig-8 response along z-axis

Overview of Separation Method • Use Coincident Microphone array • Transform into Time-Frequency Domain • Find Direction Of Arrival (DOA) vector for each Time-Frequency point. • Filter sources based on known or estimated positions in 3D space

Assumptions • Approximately W-Disjoint Orthogonal • Sparse in time-frequency domain, i.e. the power in any time-frequency window is attributed to one source. • Sound sources are geographically spaced (sparse) • Noise sources have unique Direction of Arrival (DOA).

The Dual Tree Complex Wavelet Transform (DT-CWT) • Efficient filterbank structure • Approximately shift invariant

STFT separation

DT-CWT separation

Separation results - speech • 3 Male speakers • Recorded in anechoic chamber ISVR. Mixed to virtual B-format, known locations spaced around microphone

Source Estimation and Tracking • Examples used known source locations. In many deployment scenarios, this is acceptable. • More versatility could be provided by finding source locations and tracking • Two approaches considered • 3D histogram approach • Clustering using plastic self organising map

Results • 2 Speakers – Directional Geodesic Histogram Position of peaks at (0,0) and (10,20) degrees Blur between peaks due to 2 sources only approximating the assumptions

Signal Classification What features? TDSC Which classifier? ANN – MLP, LVQ Which Sounds?

ISRIE Sound Categories

Time-Domain Signal Coding • Purely time-domain technique • Successfully used for: • Species recognition • birds, crickets, bats, wood-boring insects • Heart sound recognition • Current applications • Environmental sound • Vehicle recognition

Time-Domain Signal Coding Time Epoch

MultiscaleTDSC (MTDSC) • New method of D-S data presentation • Replaces S-matrix, A-matrix or D-matrix • Multiscale • Made from groups of epochs in powers of 2 (512, 256, etc) • Inspired by Wavelets

MTDSC Value in frame n=4

MTDSC Example Logarithmic Chirp – 100Hz - 24kHz Epoch frame length 2m

MTDSC (cont) • Currently use shape but will investigate: • epoch duration (zero-crossings interval) only • epoch duration and shape • epoch duration, shape and energy • Also use mean, can also use varience, higher order statistics for larger values of m (e.g. 9)

MTDSC Results (1) 1 Audio MTDSC data generation & stacking 3 output LVQ network 2 3 • Winning output determines result • Overall network accuracy: 76% • Some categories better than others • Road, Rail – 93%

MTDSC Results (2) • 3 different Japanese cicada species used for biodiversity studies (2 common, 1 rare) in northern Japan • 21 test files from field recordings including 1 with -6dB SNR • Backpropagation MLP classifier • 20 out of 21 test files correctly classified • ~ 95% accuracy

Practical ISRIE ISRIE Sensor Localisation + Separation (alt, az) Location Duration, SPL, LEQ Classification Category User Supplied Data Approx location required sound category

Restricting Location target Automatic rejection of signals Cone of acceptance a b

Further Automated Analysis • At present, ISRIE only provides a classified sound element in a small range of categories • Can we create a soundscape description language (SDL)? • Needs to be flexible enough to accomodate manually and automatically generated soundscapes • Take inspiration from speech recognition, natural language, bioacoustics (e.g. automated ID of insects, birds, bats, cetaceans)

sonotag = G(L,q,f,d,t,D,a,c,p,G) where L = label q,f = direction of sound d = estimated distance to sound t = onset time D = duration a = received sound pressure level c = classification (a = automatic, m = manual) p = level of confidence in classification G = geotag = G(ll,lo,al) ll = lat, lo = longitude, al = altitude • Other possibilities exist

Example of Monaural Sonotags 18s recording of O. viridulus at nature reserve in Yorkshire in 2003 G(plane,-1,-1,100,11:52.5,5,35,a,0.96,(53.914,-0.845,10)) G(Bird1,-1,-1,100,12:02,5,41,a,0.99,(53.914,-0.845,10)) G(O. viridulus,-1,-1,1,11:50,1.5,50,a,0.99,(53.914,-0.845,10)) G(O. viridulus,-1,-1,1,11:45,2,50,a,0.99,(53.914,-0.845,10))

Example of 3-D Sonotags G(speaker1,10,20,2,14:00,5,42,a,0.92,(53.9,-0.9,10)) Treat separated sounds as monaural recordings for classification G(speaker2,0,0,1.5,14:00,5,43,a,0.96,(53.9,-0.9,10))

Applications (1) • BS 4142 assessments • PPG 24 assessments • Noise nuisance applications • Other acoustic consultancy problems • Soundscape recordings • Future noise policy

Applications (2) • Biodiversity assessment, endangered species monitoring • Alien invasive species (e.g. Cane Toad in Australia) • Anthropomorphic noise effects on animals • Habitat fragmentation • Tranquility studies

Conclusions • ISRIE has been shown to be successful in separating and classifying urban sounds • much work still to be done, especially in classification • Automated soundscape description is possible but a flexible and formal framework is needed

Understanding the soundscape concept: the role of sound recognition and source identification