170 likes | 290 Views
ASSESS: a descriptive scheme for speech in databases. Roddy Cowie. to refresh people’s memory …. ASSESS embodies an approach to processing audio element of a database It is about going beyond the raw audio signal; Providing processing that a lot of people might want,
E N D
ASSESS: a descriptive scheme for speech in databases Roddy Cowie
to refresh people’s memory… • ASSESS embodies an approach to processing audio element of a database • It is about going beyond the raw audio signal; • Providing processing that a lot of people might want, • But not everyone can do.
ASSESS covers several levels: • Basic transformations of the signal; • Key boundaries and the units that go with them; • Properties of the units. • the system generates a lot of files but a lot of the things you might want are there if you know where to look
The processes ASSESS uses • A reasonable model: • Developed for inconsiderate inputs • Robust • Maximise availability • Systematic rather than selective
ASSESS input characteristics • Input file: • Reasonably long (up to 2.5 mins) • 20kHz sampling rate • No header (.raw, not .wav) Messy, but conversion techniques are easily available
Using ASSESS • Woefully undramatic • Supply 3 command lines • eg for a file called ‘test’ lasting x secs • filterbank test.raw test.spc 20000 • howard test.raw test.tx • stage2 test • Wait about x/2 secs • Admire outputs
Basic transformations and 1st order output • Intensity • 1/3 octave spectrum • ‘pulses’ corresponding to vocal cord openings • - basis for estimating pitch • 1st order output consists of 2 files • intensity & 1/3 octave spectrum • estimated ‘pulses’ • Everything else ASSESS calculates is derived from those
Conditioning 1st order outputs inASSESS • Raw intensity • Scaled by parameter derived from a ‘reference’ file • - representing normal speaking level under same recording conditions • Clumsy, but checks show it allows reasonable comparison across files • Same scaling applied to spectrum
Conditioning 1st order outputs inASSESS • Raw pulse estimates cleaned • by selecting sequences where intervals are very close • Results (in pink) comparable to standard autocorrelation, but easier to clean further • High noise associated with frication filtered using spectrum
Conditioning 1st order outputs inASSESS • Fitting flexible ‘rope’ filters extremes, captures broad shape • (zeroes mark pause boundaries – taken into account)
Conditioning 1st order outputs inASSESS • In contrast, standard methods try to correct for octave jumps - • with the kind of result shown in the lower panel
Boundary finding inASSESS • Silences are found iteratively • find an intensity level that separates a cluster of low-intensity samples (pauses) from a cluster of high-intensity samples (speech); • fine-tune using the spectrum of the definite pauses. • Again, robust: in a comparison sample • a phonetician identified 503pauses • ASSESS identified 498 • difference between times of corresponding bounds averaged • 10.4 ms for pause starts • -1.7ms for pause ends • A similar approach is applied to frication
2nd order output of ASSESS • .exm files specify • pitch and intensity contours • in terms of local maxima and minima • and speech/silence boundaries • episodes with frication (boundaries & average spectra)
Describing units – 3rd order outputs ofASSESS • Basic units: • Pauses • Tunes (structures between pauses lasting over 150ms) • Pauses have only duration • Tunes have multiple attributes, and ASSESS covers them systematically
Describing units – 3rd order outputs ofASSESS • Basic module of description (in .psg file) - Pattern repeated for pitch, & for each tune
Describing units – structural properties • Tune properties include • global slope & curvature of pitch contour, • movement at start and end, • measures of spectral balance & change • Relations between tunes include • abruptness of change from last tune • ‘crescendo’ … • etc.
Summary • ASSESS is part system, part philosophy • The system delivers robust estimates of spectrum, F0 and intensity contours, key boundaries, and properties of the units they define • The philosophy is using signal processing expertise to make multiple alternatives at multiple levels available to others.