1 / 17

Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and Its Application

Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and Its Application. Kiyotaka Uchimoto * Yasuharu Den † *National Institute of Information and Communications Technology (NICT) † Chiba University. Outline. Background Dependency Structure in the CSJ

haru
Download Presentation

Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and Its Application

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and Its Application Kiyotaka Uchimoto* Yasuharu Den† *National Institute of Information and Communications Technology (NICT) † Chiba University

  2. Outline • Background • Dependency Structure in the CSJ • Dependency-structure Annotation • Word-level Dependency-structure Analysis • Towards Construction of Middle Words • Summary and future work

  3. Background (1) • Corpus of Spontaneous Japanese (CSJ) [Maekawa et al., 2000] • The largest spontaneous-speech corpus in the world • Include transcriptions of speeches as well as audio recordings • One tenth of the CSJ has been manually annotated with • Morphemes, sentence boundaries, syntactic structures, discourse structures, prosodic information, etc

  4. Background (2) • Syntactic structure of a sentence • Represented by dependency relationships between bunsetus • As represented in the Kyoto University text corpus • Syntactic structure of a bunsetsu is not considered nihon gata kokusai kouken ga (Japanese style international contribution) motome rare te iru (is required)

  5. Dependency Structure in the CSJ (1) • Dependency relationships between bunsetsus • Annotated within “sentences” in the CSJ • Dependency relationships between words • Annotated within bunsetsus • Word segments in the word-level dependency structure: short words • Short word approximates a term found in an ordinary dictionary • Long word represents various compounds nihon gata kokusai kouken ga (Japanese style international contribution) motome rare te iru (is required)

  6. Dependency Structure in the CSJ (2) • Disfluencies characteristic to spontaneous speech • Self-correction • Represented as dependency betweenbunsetsus, and label D is assigned to them D Yamada (Yamada) Yamada san wa (Mr. Yamada) kyoujin na (strong) nikutai no (body) mochinushi da to (possessor) it te mashi ta ne (said) (Yamada, Mr. Yamada said that he had a strong body.)

  7. Dependency Structure in the CSJ (3) • Disfluencies characteristic to spontaneous speech • Self-correction • Represented as dependency between words, and label D is assigned to them kokuritsu (national) Nihon (Japanese) go (word) kokugo (Japanese language) kenkyuu (research) jo (institure) de case marker D (At National Japanese word, Japanese language research institute)

  8. Dependency-structure Annotation • Manual annotation • 199 speeches for dependency relationships between bunsetsus • 50 speeches for dependency relationships between words • Human annotation by using a tool • Initial: every bunsetsu depends on the next • Step 1: two annotators examined each dependency and modified it if it was inappropriate • Step 2: a checker examined all dependencies • Referred to audio recordings as well as transcriptions

  9. Modified by mouse drag-and-drop Each line represents a bunsetsu Self-corrections, coordination, and appositives can be annotated with labels D, P, and A by right-clicking the mouse

  10. Each line represents a word Modified by mouse drag-and-drop

  11. Word-level Dependency-structure Analysis (1) • Finding a modifiee for each word in a bunsetsu • Each dependency goes from left to right • The rightmost word is assumed to have no modifiee • Existing methods were applied • Ex. shift-reduce method [Nivre and Scholz, 2004] nihon/noun gata/Suffix kokusai/noun kouken/noun ga/ppp … nihon gata kouken kokusai ga stack Input words

  12. Word-level Dependency-structure Analysis (2) • Experiments • 50 speeches in the CSJ • Word-level dependencies (total: 33,429) • Every rightmost dependency in a bunsetsu was not counted • 10-fold cross validation • Features: words and their POS categories

  13. Application of Word-level Dependency-structure • In text-to-speech synthesis • Basic unit is required to indicate appropriate pronunciation and accent “rendaku” (Weijer et al., 2005)

  14. Application of Word-level Dependency-structure • A sound change or an accent change are blocked by right branched tree structures (Kubozono, 1995)

  15. Construction of Middle Words • Construction rule • Combining adjacent short words that have dependency relationships under the condition that a middle word is not longer than a long word • Morphological information • If a middle word corresponds to a long word • Extracted from the long word. • Otherwise • Extracted from the rightmost short word in the middle word. • Example • kihon/shuuha/suu/pataan • Noun Noun Suffix Noun • (basic frequency pattern) • kihon | shuuha suu pataan • Noun Noun

  16. Middle Words and Accent Phrases • Relationships between middle words and accent phrases (BI=2, 2+p, 2+b, 2+bp, 3) in the CSJ kaku|zokusei gen|jiten zen|shikiichi emuten|chuuouchi/heikatsuka yuudo/saidaika|kijun nihonjin/gakushuusha rittai/chuushajou should be reduced

  17. Summary and Future Work • Dependency structure of a large, spontaneous, Japanese-speech corpus, Corpus of Spontaneous Japanese (CSJ) • Application of a word-level dependency-structure • Constructing new basic units, middle words • Middle words: useful as constituents of accent phrases • Annotation to the Balanced Corpus of Contemporary Written Japanese (BCCWJ) • Supported by the priority area program ‘Japanese Corpus’, a five-year (2006-2010) project

More Related