90 likes | 242 Views
Analyses on IFA corpus. Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC). Project meeting INTAS 915 May 23-25, 2003, Jyvaskyla. overview. structure of IFA corpus See three reports R. v Son and papers in open lit.
E N D
Analyses on IFA corpus Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC) Project meeting INTAS 915 May 23-25, 2003, Jyvaskyla
overview • structure of IFA corpus • See three reports R. v Son and papers in open lit. • why corrected means? • how corrected means • some results • conclusions INTAS 915, Jyvaskyla
structure of IFA corpus • 4 male & 4 female; 5 hrs. of speech; 8 styles a. o. informal story telling (I); retelling (R); reading a story (T), reading sentences (S) • ~50 K words (AIFC, 44 kHz, 16 bit) • label files with annotation tiers • phonemic segmentation and labeling (automatically generated, hand corrected; ~200k boundaries; 0.84 word labels/min; 3.3 boundaries/min) • description levels: phoneme, demi-syllable, syllable, word, sentence, paragraph • tiers: POS, lemma, lexical freq., etc. INTAS 915, Jyvaskyla
access of IFA corpus • use of CGN protocols • non-speech data in database structure • relational DB, SQL query language • basic structure = table items (indiv. phoneme occurrences) x attributes (phoneme parent word, duration, position, speaker, etc.) • WWW front end to simplify access (automatically generating SQL queries; direct links to relevant files) INTAS 915, Jyvaskyla
why corrected means? • non-ideal design (no fixed numbers of observations of all relevant factors; this precludes the use of e.g. ANOVA) • confounding (occurrence of factor values is correlated, thus many combinations of values are rare) • interaction (one factor being modulated by other factors): additive, multiplicative, or ordinal interaction • factors of interest vs. nuisance factors INTAS 915, Jyvaskyla
how corrected means? • incidence matrix from basic data • rows = combinations of levels on factors of interest columns = comb. of levels on nuisance factors • quasi-minimal pairs method • mean difference per row pair: by comparing (non-empty) pairs of columns • matrix of differences (fitted with additive model) • variable sample sizes: use weighting factors • corrected means INTAS 915, Jyvaskyla
example: vowel duration (ms) speaking style (I, R, S, T) vs. lexical stress (+, -) common means corrected means 38061 total counts 13323 row differences
row difference counts + signif. * 0.001 significance INTAS 915, Jyvaskyla
conclusions • simple averaging of unbalanced data is dangerous • free conversational speech data are always unbalanced • the corrected means method then is a good alternative • can be interpreted as a least RMS-error approximation of ‘balanced’ means with an unbalanced data set INTAS 915, Jyvaskyla