1 / 18

Disambiguation of homographic adjective and adverb forms in Croatian texts

Disambiguation of homographic adjective and adverb forms in Croatian texts. Danijela Merkler *, Daša Berović *, Željko Agić ** * Department of Linguistics ** Department of Information Sciences Faculty of Humanities and Social Sciences, University of Zagreb

archer
Download Presentation

Disambiguation of homographic adjective and adverb forms in Croatian texts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, DašaBerović*, Željko Agić** • * Department of Linguistics • ** Department of InformationSciences • Faculty of Humanities and Social Sciences, University of Zagreb dmerkler@ffzg.hr; dberovic@ffzg.hr; zagic@ffzg.hr NooJ 2011Dubrovnik2011-06-15

  2. Talk overview • project ACCURAT • problem and corpora • modeling local grammars and applying them • statistical evaluation NooJ2011Dubrovnik2011-06-15

  3. ACCURAT • FP7 project • main goal - to develop methods and techniques toovercome one of the central problems of machine translation – the lack of linguistic resources for under-resourced areas of machine translation • key innovation - creation of methodology and tools tomeasure, to find and to use comparablecorpora to improve the quality of MT • the ACCURAT project will significantlycontribute not only to the theory of MT, butalso to corpus linguistics, informationextraction and natural language processing in general NooJ2011Dubrovnik2011-06-15

  4. Scientific objectives • create comparability metrics – to develop the methodology and determine criteria tomeasure the comparability of source and target language documents in comparable corpora • establish research methods for alignment and extraction of lexical, terminological and other linguistic data from comparable corpora • disambiguation – important process for POS and MSD tagging NooJ2011Dubrovnik2011-06-15

  5. Problem • parallel and comparable resources are sparse for Croatian when paired with any of the languages included in the project, especially if the other language is under-resourced as well • importance of high quality annotation for existing language resources for Croatian • building (factored) language models for MT • using text anchors in comparable resources • MSD-tagging and lemmatization errors detected in existing Croatian language resources • e.g. Croatian National Corpus v2.5 (automatically lemmatized and MSD-tagged), manually annotated subcorpora, Croatian Dependency Treebank • manual analysis of their annotation reveals regular patterns in these errors NooJ2011Dubrovnik2011-06-15

  6. Problem • forms of descriptive adjectives in the nominative singular case in the neuter gender are the same as the forms of the adverbs that are made from those adjectives by suffixation • these adverbs are realized in context • in most cases adverb is made from adjective that has abstract meaning • there are several types of word forms • the forms of adverbs and adjectives that occur with no semantic constraints: razdragano (gleeful), bahato (arrogant), ubrzano (rapidly), uzrujano (upset), umiljato (cuddly) • forms that are made from verbs: drhtavo (shaking), laskavo (flattering), šepavo (lame) • forms that have dual meaning (concrete and abstract): mlako (lukewarm), šugavo (itchy), mračno (darkly), hladno (cold), gorko (bitter) • forms that denote spatial and temporal relations: rano (early), duboko (deeply), plitko (shallow), lijevo (left) NooJ2011Dubrovnik2011-06-15

  7. Corpora • Croatia Weekly • 100 kw newspaper corpus (newspaper published from 1998 to 2000, 118 numbers) • it covers different domains: politics, economy and finance, tourism, ecology, culture, art, sports • part of Croatian side of the Croatian-English Parallel Corpus manually lemmatized and MSD-tagged using the MULTEXT-East v3 morphosyntactic specifications • 1984. • Orwell’s "1984" corpus, manually lemmatized and MSD-tagged using MULTEXT-East v4 • languages: En, Ro, Sl, Cs, Et, Hu, Sr, Bg, Ru, Mk, Hr... • encoded in TEI P4 (XML) NooJ2011Dubrovnik2011-06-15

  8. Corpora • imported the corpora to NooJ • used the NooJ XML import feature • kept the MSD feature annotations for adjectives, adverbs, nouns and verbs • converted the annotations for these PoS from Multext-East to NooJ format for lexical resources • modified feature annotations • e.g. MTE verb type from auxiliary, copulative to PG (auxiliary verb) in NooJ • preprocessing enabled designing the rules without using Croatian resources for NooJ, i.e. skipping NooJ linguistic analysis NooJ2011Dubrovnik2011-06-15

  9. Patterns • we noticed several types of patterns in which adverbs that are homographic with adjectives occur • they are defined by their contextual environment • Vpg + A* + V → Vpg + R* + V • Vpg + A + A* → Vpg + R + A* • A* + V → R* + V • A + A* + N → R + A* + N NooJ2011Dubrovnik2011-06-15

  10. Vpg + A* + V NooJ2011Dubrovnik2011-06-15

  11. Vpg + A + A* NooJ2011Dubrovnik2011-06-15

  12. A* + V NooJ2011Dubrovnik2011-06-15

  13. A + A* + N NooJ2011Dubrovnik2011-06-15

  14. Statistics 1 • manually checked concordances • errors frequently include the word sve, so we upgraded all grammars in order not to recognize sve NooJ2011Dubrovnik2011-06-15

  15. Example of upgraded grammar NooJ2011Dubrovnik2011-06-15

  16. Statistics 2 • obtained results improved after we applied new grammars • significant difference between newspaper and literature corpus NooJ2011Dubrovnik2011-06-15

  17. Future work • forms of relational adjectives in the nominative singular case in the masculine gender are the same as the forms of the adverbs that are made from those adjectives by suffixation (junački, pučki, bratski, životinjski) • disambiguation of these forms also depends on the grammatical context in which they occur, so it can also be done in a similar way • applying the disambiguation rules to other Croatian language resources NooJ2011Dubrovnik2011-06-15

  18. Thank you for your attention. The research within the project Accurat leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013), grant agreement no 248347. www.accurat-project.eu

More Related