Probabilistic Modeling of Tone Perception

Probabilistic Modeling of Tone Perception Deepti Ramadoss and Colin Wilson Johns Hopkins University email : ramadoss@cogsci.jhu.edu APCAM Nov 19, 2009

Outline • Introduction • Linguistic tone • Thai • Moren and Zsiga (2006)’s representation of tone • Probabilistic model of tone perception • Results • Future directions

Introduction Tone languages: • Languages that use pitch contrastively. e.g. Thai Examples of Thai tones, examples taken from a single speaker in citation form (from Zsiga and Nitisaroj 2008)

Proposals of tone perception: • Gauthier et al. (2006) • Fine grained representation of tone • String of f0 values • Point-by-point matching of stimulus with ‘category’ We’re investigating a different theory of perception: • Moren and Zsiga (2006) • informed by linguistic theory

Linguistic Tone: • Tones: instantiations of certain tone targets • Tone target types: High (H), Mid (M) and Low (L) • Two kinds of resultant tones by instantiating these targets • Level tones • single target type: ‘level’ trajectory • Contour tones • multiple dissimilar target types: contour trajectory, with direction inflection

Moren and Zsiga (2006) • Phonological representation of Thai tones • Thai tones are carried on single syllables • Level tones: • single target right aligned to the syllable (i.e. second mora) High tone: _ H Mid tone: _ M or _ _ (since Mid is default) Low tone: _ L • Contour tones: • Two targets, first aligned around the mid point, (i.e. the first mora) and the second aligned to the right edge of the syllable(i.e. the second mora) Falling tone: HL Rising tone: LH • Production of tones = producing these targets • Perception of tones = perceiving these targets (Zsiga and Nitisaroj 2008)

The problem of perception • Naturally instances of tone targets are not always going to reach exact ‘canonical’ values • A significant amount of variation, even just within a single speaker’s utterances Our aim is it build a model that can categorize stimuli correctly despite these variations

Perception model based on Moren and Zsiga (2006) representation of Thai tones • Two departures from the theory: • An initial point added to the original representation • The representation for level tones also had a middle target • Total: 3 points: an initial point and middle and final target points • Values for each target point were drawn from the produced data from a single speaker (in Zsiga and Nitisaroj (2008)). • Mean and variance for each target was computed, creating 3 target distributions • Time normalized:

Probabilistic Model • Probability of stimulus given category can be computed for each category k by multiplying the category’s each target’s probability. p(x|category k) = p(x1|Catk1, Catk1) . p(x2|Catk2, Catk2) . p(x3|Catk3, Catk3) where, 1, 2, 3 are the target points

Probabilistic Model From Bayes Theorem posterior probability  likelihood  prior p(category k|x)  p (x|category k)  p(category k) • p(x|category k) is • the conditional probability stimulus x given category k • the effect of the observed data, or the likelihood function • this is relativized (normalized) to the probability of the stimulus given all categories (McMurray et al. 2005, Clayards et. al. 2006) For Thai tones, p(x|category k) is normalized by: p(x|categoryF)+p(x|categoryR)+p(x|categoryH)+p(x|categoryM)+p(x|categoryL) Hence, the probability of each of the categories given stimulus x is computed (priors assumed to be equal):

Fit to the learning data • The model was first tested on each speech stimulus used to create the distributions

The model was next tested on synthetic stimuli used in behavioral experiments in Zsiga and Nitisaroj (2008).

Categorization based on slopes • Dynamic stimuli; hence, rate of change of pitch is likely to be an important cue • Liberman p.c. • Slopes of first and second ‘halves’ of the tones • When tested on Training data, model performed reasonably well, classifying 90.7% correct • When tested on the steeply sloped synthesized stimuli, the model’s performance was 63.6% the same as human responses • When tested on straight line synthesized stimuli, the model’s performance was only 7.7% the same as human responses • May improve with encoding of starting pitch value, and an increase in number of slopes

Conclusion • Our model instantiates the phonological representation proposed by Moren and Zsiga (2006) • H and L tone targets per syllable • It uses Bayesian inference, by treating the targets as probability distributions • allows the model to classify stimuli using a predictive, generative method • possible to extend this method to only partial stimuli (to simulate processing as signal unfolds over time) • It appears as though using information provided only by the slopes is insufficient to characterize human categorization

Future Directions • Include time information • Independent distributions vs. co-varying distributions across targets: performance of the model dropped • Compare variance of independent model in comparison with covariance matrices of the other model: artifact of parameters • With the model that uses slope information • include information of initial point • increase number of “points”; increase number of slopes • Store tokens as exemplars; categorize stimuli based on some measure of similarity (Exemplar theory: Pierrehumbert 2003) • Extensive comparison with behavioural data • Compare probabilities with which humans and the models identify artificial stimuli • Consider priors; some tone categories are more frequent than others • Compare performance with non-straight line stimuli presented to human subjects • Note: when evaluating model’s performance on learning data, assumption made is human perceiver can distinguish stimuli reliably: this assumption needs to be rigorously tested

Thank you and Elizabeth Zsiga Rattima Nitisaroj Luigi Burzio

Probabilistic Modeling of Tone Perception