170 likes | 309 Views
Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and Its Application. Kiyotaka Uchimoto * Yasuharu Den † *National Institute of Information and Communications Technology (NICT) † Chiba University. Outline. Background Dependency Structure in the CSJ
E N D
Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and Its Application Kiyotaka Uchimoto* Yasuharu Den† *National Institute of Information and Communications Technology (NICT) † Chiba University
Outline • Background • Dependency Structure in the CSJ • Dependency-structure Annotation • Word-level Dependency-structure Analysis • Towards Construction of Middle Words • Summary and future work
Background (1) • Corpus of Spontaneous Japanese (CSJ) [Maekawa et al., 2000] • The largest spontaneous-speech corpus in the world • Include transcriptions of speeches as well as audio recordings • One tenth of the CSJ has been manually annotated with • Morphemes, sentence boundaries, syntactic structures, discourse structures, prosodic information, etc
Background (2) • Syntactic structure of a sentence • Represented by dependency relationships between bunsetus • As represented in the Kyoto University text corpus • Syntactic structure of a bunsetsu is not considered nihon gata kokusai kouken ga (Japanese style international contribution) motome rare te iru (is required)
Dependency Structure in the CSJ (1) • Dependency relationships between bunsetsus • Annotated within “sentences” in the CSJ • Dependency relationships between words • Annotated within bunsetsus • Word segments in the word-level dependency structure: short words • Short word approximates a term found in an ordinary dictionary • Long word represents various compounds nihon gata kokusai kouken ga (Japanese style international contribution) motome rare te iru (is required)
Dependency Structure in the CSJ (2) • Disfluencies characteristic to spontaneous speech • Self-correction • Represented as dependency betweenbunsetsus, and label D is assigned to them D Yamada (Yamada) Yamada san wa (Mr. Yamada) kyoujin na (strong) nikutai no (body) mochinushi da to (possessor) it te mashi ta ne (said) (Yamada, Mr. Yamada said that he had a strong body.)
Dependency Structure in the CSJ (3) • Disfluencies characteristic to spontaneous speech • Self-correction • Represented as dependency between words, and label D is assigned to them kokuritsu (national) Nihon (Japanese) go (word) kokugo (Japanese language) kenkyuu (research) jo (institure) de case marker D (At National Japanese word, Japanese language research institute)
Dependency-structure Annotation • Manual annotation • 199 speeches for dependency relationships between bunsetsus • 50 speeches for dependency relationships between words • Human annotation by using a tool • Initial: every bunsetsu depends on the next • Step 1: two annotators examined each dependency and modified it if it was inappropriate • Step 2: a checker examined all dependencies • Referred to audio recordings as well as transcriptions
Modified by mouse drag-and-drop Each line represents a bunsetsu Self-corrections, coordination, and appositives can be annotated with labels D, P, and A by right-clicking the mouse
Each line represents a word Modified by mouse drag-and-drop
Word-level Dependency-structure Analysis (1) • Finding a modifiee for each word in a bunsetsu • Each dependency goes from left to right • The rightmost word is assumed to have no modifiee • Existing methods were applied • Ex. shift-reduce method [Nivre and Scholz, 2004] nihon/noun gata/Suffix kokusai/noun kouken/noun ga/ppp … nihon gata kouken kokusai ga stack Input words
Word-level Dependency-structure Analysis (2) • Experiments • 50 speeches in the CSJ • Word-level dependencies (total: 33,429) • Every rightmost dependency in a bunsetsu was not counted • 10-fold cross validation • Features: words and their POS categories
Application of Word-level Dependency-structure • In text-to-speech synthesis • Basic unit is required to indicate appropriate pronunciation and accent “rendaku” (Weijer et al., 2005)
Application of Word-level Dependency-structure • A sound change or an accent change are blocked by right branched tree structures (Kubozono, 1995)
Construction of Middle Words • Construction rule • Combining adjacent short words that have dependency relationships under the condition that a middle word is not longer than a long word • Morphological information • If a middle word corresponds to a long word • Extracted from the long word. • Otherwise • Extracted from the rightmost short word in the middle word. • Example • kihon/shuuha/suu/pataan • Noun Noun Suffix Noun • (basic frequency pattern) • kihon | shuuha suu pataan • Noun Noun
Middle Words and Accent Phrases • Relationships between middle words and accent phrases (BI=2, 2+p, 2+b, 2+bp, 3) in the CSJ kaku|zokusei gen|jiten zen|shikiichi emuten|chuuouchi/heikatsuka yuudo/saidaika|kijun nihonjin/gakushuusha rittai/chuushajou should be reduced
Summary and Future Work • Dependency structure of a large, spontaneous, Japanese-speech corpus, Corpus of Spontaneous Japanese (CSJ) • Application of a word-level dependency-structure • Constructing new basic units, middle words • Middle words: useful as constituents of accent phrases • Annotation to the Balanced Corpus of Contemporary Written Japanese (BCCWJ) • Supported by the priority area program ‘Japanese Corpus’, a five-year (2006-2010) project