10 likes | 99 Views
Word Prominence Detection using Robust yet Simple Prosodic Features. Intonational Prominence in Spoken Communication Word prominence is acoustically realized by increased pitch , greater energy and longer duration than the neighboring words.
E N D
Word Prominence Detection using Robust yet Simple Prosodic Features • Intonational Prominence in Spoken Communication • Word prominence is acoustically realized by increased pitch, greater energy and longer duration than the neighboring words. • Indicates discourse-salient elements such as focus of an utterance, introduction of new topics, information status of a word (new or given), emotion or attitude about the topic being discussed, or simply to draw the listener's attention. • Automatic prominence detection or prediction is important for • Spoken language understanding (SLU): To identify discourse-salient elements • Text-to-speech (TTS) synthesis: To synthesize expressive speech • Challenges • Lexical and syntactic features are strongly correlated with the notion of word prominence. However, such features from the output of automatic speech recognition (ASR) is not reliable due to the recognition errors • Automatic identification of salient points (such as F0 peaks) in the F0 contour is inherently difficult. • Aggregate measures of F0 such as mean, slope, variance, max and minlose fair amount of salient information in the aggregation process, especially about the shape and amplitude of the F0 peak. • Our approach: Use computationally inexpensive prosodic features that capture the changes in F0 curve magnitude and shape, along with changes in duration and energy, without requiring explicit identification of particularly salient points • Contribution • We present novel prosodic features that capture changes in F0 curve shape and magnitude in conjunction with duration and energy • Features are robust with respect to contour approximation while being computationally inexpensive • New features are more predictive than standard aggregation-based features. • New features demonstrate significant improvements in word prominence accuracy on spontaneous speech. • Our features complement the lexico-syntactic features and outperform them when used in isolation(suitable for noisy text from ASR) • Prosodic Features ( ** denotes novel features) • ** Area under the F0 curve (AFC): This is intended to capture the raised F0 and the increased duration that is often associated with prominent words. It is the integral of the smoothed F0 and duration within the interval of the word. • ** Energy-F0-Integral (EFI): Since word prominence is also often accompanied by increase in energy, this feature is an integral of the F0, duration and energy spanning the within the interval of the word. • ** Voiced-to-unvoiced ratio (VUR): This is intended to act as measure of reliability. VUR informs the model how much the AFC and EFI features should be trusted. If the VUR is less than 0.5, then a majority of the segments in the word are unvoiced and most of F0 contour is obtained by smoothing and AFC and EFI are not very reliable. • ** F0 curve shape (SHP): For each word, isotonic regression is used to estimate the likelihood of the F0 curve associated with the word being (i) a rising curve, (ii) a falling curve, (iii) a curve containing a peak, and (iv) a curve containing a valley. • ** F0 peak/valley amplitude (FAMP) and location (FLOC): If a peak is encountered in the F0 contour of a word, its location is defined as its relative distance from the beginning of the word, and its amplitude is computed as the distance from the mean of the GMM-based low frequency component in the word interval. • Duration of the word (STANDARD-DUR): The duration of the word in number of 10 msec frames. • Aggregate statistics (AGG-STATS): Includes mean, median, max, min, and variance of F0 and energy computed per word. • Lexico-Syntactic Features • Word identity (WI): In our final prominence detection model, we do not want to use word identity as a feature because this feature does not generalize to unseen data. We have however used word identity as a feature for experimental comparison. • Part-of-speech tags (POS): The POS tags were hand-marked using the Penn Treebank tagset. • Word type: Each word was classified either as a content word or as a function word by using the POS information. • Number of syllables in word: Three values of this feature were considered; 1-syllable, 2-syllables, and more-than-2-syllables. Syllabification was performed using the AT&T's Natural Voices TTS system. • Break tags (BT): Three categories were considered. No break, small breaks (corresponding to a comma in punctuated text) and big breaks (corresponding to a terminal punctuation, e.g., period, question mark or exclamation in punctuated text). • Each lexico-syntactic feature was computed over a three word window: w(i-1), w(i), and w(i+1), where w(i) was the current word. • Experiments • Data description: • A subset of 67K words selected from the Switchboard Corpus. • Manually-corrected word segmentations and hand-marked prominence tags (Ostendorf et. al, 2001) • Words were simply marked prominent or not; roughly 1/3 marked prominent. • Dataset previously used for word prominence detection using prosodic and syntactic features (Sridhar et. al, 2008). • Prominence detection modeled as a binary classification task. Agiven word is classified as prominent (1) or non-prominent (0) based on 8 different subsets of the input features. • Two ensemble methods, Random Forest and AdaBoost, used to train models. • Randomly selected 70% of the data used for training, 15% for validation, and 15% for testing • Question: Which feature sets were best for word prominence detection? • Results: Word prominence detection models • Each feature set increases accuracy over the 69% baseline accuracy. • Results: Top-10 most important prosodic features • Four out of the top-5 most discriminative features are the new prosodic features. • Summary • Presented a set of novel prosodic features • That capture shape and magnitude changes in F0 • Are easily computed and robust to challenges of identifying salient points in the F0 contour • Are more discriminative than aggregate statistics of F0, duration & energy • In real-time prominence detection, new prosodic features are more predictive than • Commonly used aggregated statistics of F0, duration and energy • Lexico-syntactic features.