720 likes | 965 Views
The Syntax-Morphology Interface and Natural Language Processing. Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu. Thematic Training Course on Processing Morphologically Rich Languages 11-15 April 2011. Outline. Introduction
E N D
The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary vinczev@inf.u-szeged.hu Thematic Training Course on Processing Morphologically Rich Languages 11-15 April 2011
Outline • Introduction • Syntax vs. morphology from a linguistic viewpoint • Morphological coding systems in Hungarian • Morphosyntactic information in Hungarian corpora • Language-specific morphosyntactic problems • Effects on IE, NER and MT ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Syntax vs. morphology • Typological differences among languages • Agglutinative lg: role of morphology is stronger (lot of information in morphemes) • Isolating lg: role of syntax is stronger (less morphemes, more constructions) • Focus on Hungarian (agglutinative) and English (fusional/isolating) ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Basic Hungarian syntax • Lot of information encoded in morphemes • No fixed word order • Information structure is reflected in word order (theme-rheme, old-new) Péter szereti Marit. Peter love-3SgObj Mary-ACC ‘Peter loves Mary.’ Péter Marit szereti. ‘It is Mary who Peter loves.’ Marit szereti Péter. ‘It is Mary who Peter loves.’ Marit Péter szereti. ‘It is Peter who loves Mary.’ Szereti Péter Marit. ‘Peter LOVES Mary (and not hates).’ Szereti Marit Péter. ‘Peter LOVES Mary (and not hates).’ ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Morphosyntactic features of Hungarian • Nominal declination (nouns, adjectives, numerals) • Verbal conjugation • Several hundreds of word forms for each lemma • Grammatical relations encoded primarily by morphemes -> morpho + syntactic ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Nominal suffixes A stem can be extended by: • Derivational suffixes • Plural • Possessive • Case suffixes hat-ás-a-i-nak ‘to its effects’ stem-DERIV.SUFF-POSS-POSS.PL-DAT egész-ség-ed-re ‘cheers’ stem-DERIV.SUFF-POSS.Sg2-SUB ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Case suffixes in Hungarian • ~20 cases („rare” cases are not always counted: distributive-temporal (-nte), associative (-stul/-stül…)) • always at the right end of the word form • grammatical relations are encoded: • Arguments of the verb • Adjuncts (temporal and locative adverbials) ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
…and in English Pisti szerdánként edzésre jár. Steve Wednesday-DIST-TEMP training-SUB go-3Sg Each Wednesday Steve goes to training. Szerdánként – each Wednesday Edzésre – to training ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Pisti bort iszik. Steve wine-ACC drink-3Sg Steve is drinking wine. Pisti-NOM – Steve – subject Bort – wine - object ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
A fiú kutyája The boy dog-POSS The boy’s dog A(z ő) kutyája The (he) dog-POSS His dog Possessor in nominative Possessed with a possessive marker A fiúnak a kutyája The boy-DAT the dog-POSS Possessor in dative Possessed with a possessive marker Possessive in Hungarian ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
The boy’s dog His dog Possessor with a possessive marker (pronoun) Possessed with no marker The dog of the boy Possessive relation is marked by a preposition …and in English ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Hungarian vs. English - nouns • Number of word forms: several hundreds (HU) vs. 2-3 (EN) • Means to express grammatical relations: • Suffixes (HU) • Preposition, fixed position (word order), suffix, determiner (EN) • Methods for morphological parsing are very different for Hungarian and English ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Verbal suffixes A stem can be extended by: • Derivational suffixes • Mood markers • Tense markers • Person/number suffixes • Objective markers Vág-at-ná-k Cut-CAUS-COND-3PlObj ‘they would have it cut’ ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Mood and tense in Hungarian • Mood: • Indicative: default (not marked) • Conditional: suffixes (present) – analytic form (past) • Imperative: suffixes • Tense: • Present: default (not marked) • Past: suffixes • Future: analytic (auxiliary fog) ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
…and in English • Mood: • Indicative: default (not marked) • Conditional: past tense forms + analytic forms (auxiliary would) • Imperative: auxiliaries + grammatical structure • Tense: • Present: default (not marked) • Past: suffix / irregular forms (suppletives or ablaut (vowel change)) • Future: analytic (auxiliary will) ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Hungarian: suffixes Fut-ok Fut-sz Fut Fut-unk Fut-tok Fut-nak 3Sg is the default (not marked!) English: 3Sg + pronouns / obligatory subject I run You run He runs We run You run They run 3Sg marked! Person & Number ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Possibility/permission: fut-hat-ok run-MOD-1Sg ‘I may run’ Reflexive: mos-akod-unk wash-REFL-1Pl ‘we wash ourselves’ Frequentative: üt-öget-sz hit-FREQ-2Sg ‘you hit sg repeatedly’ Causative: csinál-tat-nak do-CAUS-3Pl ‘they have sg done’ Derivational suffixes in Hungarian ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
… and in English • Possibility/permission: auxiliaries • Reflexive: pronominal objects • Frequentative: adverb • Causative: construction ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Hungarian vs. English - verbs • Number of word forms: several hundreds (HU) vs. 4-5 (EN) • Means to express grammatical relations: • Suffixes + auxiliaries (HU) • Auxiliaries + reflexive pronouns + constructions (EN) • A lot of syntactic information is encoded in Hungarian morphemes ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
ThematicTrainingCourseonProcessingMorphologicallyRichLanguagesThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Morphosyntactic coding systems • Language independent (?) • Language dependent • (dis)advantages: • comparability • considering language-specific features • complexity • Different information is necessary for each language ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Hungarian coding systems • HUMOR • recall Thursday Session 1 • in the Hungarian National Corpus • MSD • In Szeged Treebank • Parser and POS-tagger available at: http://www.inf.u-szeged.hu/rgai/magyarlanc • KR • No database • Parser and POS-tagger available at: http://mokk.bme.hu/resources/hunmorph/index_html http://code.google.com/p/hunpos/ ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
MSD • Morphosyntactic Description • International coding system: • English • Romanian • Slovenian • Czech • Bulgarian • Estonian • Hungarian ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
MSD - 2 • Positional codes • A given position encodes a given type of information • Position 0: part-of-speech • Position 1: (sub)type within POS • Further positions: other grammatical information (person, number, case, etc.) • Irrelevant positions are marked with a hyphen (-) ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
KR • Created for Hungarian • Hierarchical attribute-value matrices • Default values (3Sg, singular…) • Derivational information is encoded • Compounds are also segmented ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
MSD vs. KR • Differences between the two systems: • derivation • compounds • Harmonization efforts in order to build a morphological parser the output of which is in total harmony with the Szeged Treebank (magyarlanc) (Farkas et al. 2010) ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Nouns in MSD ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Verbs in MSD ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Morphosyntactically annotated Hungarian corpora • Hungarian National Corpus • 100-million-word balanced reference corpus of present-day Hungarian • Word forms automatically annotated for stem, part of speech and inflectional information • http://corpus.nytud.hu/mnsz/index_eng.html • Szeged Treebank • 1-million words, 82K sentences • Manually annotated for lemma, POS-tags • Constituency and dependency trees • http://www.inf.u-szeged.hu/rgai/nlp ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Szeged Treebank • Manually annotated treebank for Hungarian • Covers various linguistics styles • literature, newspapers, laws, student essays, computer books, etc. • multilingual connection: Orwell’s 1984; Win2000 manual in Hungarian • Available free of charge for research • Developed by • University of Szeged, HLT group • MorphoLogic Ltd. • Academy of Sciences, Research Institute for Linguistics ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Szeged Treebank 2. • TEI XML format • Manually annotated • sentence split & word segmentation • morphological analysis • PTB-style syntactic structure • Verb argument structure • converted / extended to Dependency Grammar format manually ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Szeged Treebank 3. • Several versions • Constituency and dependency versions • Old MSD codes • New (harmonized) MSD codes • (dependency) parser under development • Being extended with folklore texts ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Dependency vs. constituency • Each node corresponds to a word -> no virtual nodes (CP, I’…) in dependency trees • Constituency grammars said to be good for languages with fixed word order • Syntactic relations are determined • by the position in the tree (constituency grammar) • by dependency relations (labeled edges) (dependency) ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Constituency trees in SzT2.0 • Based on generative syntax (É. Kiss et al. 1999) • Syntactic features of Hungarian also considered (i.e. not hardcore Chomskyan trees) • Verb-argument relations are encoded by labels • Very detailed information: different grammatical role for each case suffix • Semantic information also can be found (temporal and locative adverbials) ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Aggie all relative-POSS-ACC the day before yesterday see-PAST-3Sg-Obj guest-ESS ‘Aggie received all of her relatives the day before yesterday.’ ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
ThematicTrainingCourseonProcessingMorphologicallyRichLanguagesThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Dependency trees in Szeged Dependency Treebank • Based on SzT2.0 • Automatic conversion and manual correction • Word forms are the nodes of the tree • Simplified relations for nominal arguments: SUBJ, OBJ, DAT,OBL, ATT • Semantic information kept • Sentences without 3Sg copula are distinctively marked ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions. ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Virtual nodes • No overt copula in present tense 3Sg • Only subject and predicative noun/adjective manifest • No syntactic structure in SzT (grammatical roles are not marked) • Virtual nodes in SzDT ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
I like to go to school because it is good to be at school though not always. ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Szeged Treebank vs. Szeged Dependency Treebank • Labeled relations in both cases -> not so sharp contrast • Virtual nodes in SzDT -> grammatical structure marked for every sentence (IE, MT) • No word order constraints in SzDT • Word forms are marked • Other possibilities: morpheme-based syntax (Prószéky et al. (1989), Koutny, Wacha (1991)) ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Language-specific morphosyntactic problems • Morphology vs. syntax: • Pseudo-subjects • Pseudo-objects • Pseudo-datives • Morphological analysis of unknown words • Lemmatization of named entities ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Pseudo-subjects • a noun in nominative is not the subject of the sentence -> special attention required when parsing • Possessor: a kisfiú labdája the boy ball-3SgPOSS the boy’s ball • Predicative noun: István juhász maradt. Stephen shepherd remain-PAST Stephen remained a shepherd. • Object: A kutyám kergeti a macska. The dog-POSS chase-3SgObj the cat ‘The cat is chasing my dog.’(garden path sentence) A fiam szereti a lányod. The son-1SgPOSS love-3SgObj the daughter-2SgPOSS ‘My son loves your daughter’ or ‘Your daughter loves my son’ ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Solutions • Possessor: • SzT: one NP includes the possessor and the possessed ((a kisfiú) labdája) • SzDT: ATT relation • Predicative noun: PRED relation • Virtual node in SzDT • Object: OBJ relation • Sometimes contextual information is needed even for humans… ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Pseudo-objects Adverbials with an apparently accusative ending: Futottam egy jót. Run-PAST-1Sg a good-ACC I have had a good run. Nagyot aludtam. Big-ACC sleep-PAST-1Sg I have slept a lot. Intransitive verbs -> cannot be an object -> MODE relation ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Pseudo-datives Not all (semantic) subjects are in nominative: • Dative subject: Sándornak kell elrendeznie az ügyeket. Alexander-DAT must arrange-INF-3Sg the issue-PL Alexander has to arrange the issues. • DAT in both corpora • Certain auxiliaries with dative subjects (exceptions) • Dative-nominative parallelism in possessive as well ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
Unknown words can be: Compounds Named entities Derivations fémkapunk félmillió csokinyúl NATO-hoz Methods for analysis (Zsibrita et al. 2010): Segmentation into two or more analyzable parts Expert rules to filter impossible combinations (*V+N) Analysis of the last part goes to the whole word Substitution for hyphenated words (pre-defined patterns for each morphological class) Unknown words ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
félmillió fél+millió Mc-snl Expert rules: NUM + NUM * non-NUM + NUM ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
fémkapunk fém+kap+unk Vmip1p---n fém+kapu+nk Nc-sn---p1 Expert rules: N + N N-nonNOM + V * N-NOM + V ThematicTrainingCourseonProcessingMorphologicallyRichLanguages
csokinyúl csoki+nyúl Vmip3s---n Nc-sn cso+kinyúl (?) Vmip3s---n Expert rules: N + N N-nonNOM + V * N-NOM + V ThematicTrainingCourseonProcessingMorphologicallyRichLanguages