1 / 20

Comparative Structures in Croatian: MWU Approach

Explore the challenges and importance of idioms in Croatian computational processing, focusing on comparative structures as a subtype of idiom, with a computational approach and the use of NooJ tool.

rcheung
Download Presentation

Comparative Structures in Croatian: MWU Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb krkocijan@ffzg.hr, sara.librenjak@gmail.com Europhras 2015 Malaga, Spain 2015-07-01

  2. Language of our work - Croatian • South-Slavic language • High similarity to Bosnian, Serbian and Montenegrin • Latin alphabet • Properties: • Highly flective (7 cases) • Syntactically flexible (almost any word order possible) • Pronoun dropping • A challenge for computational processing

  3. Computional approach to idioms • Comparative structures as a subtype of idiomatic structures • Two manners of computational language processing • Statistical approach • Rule-based approach • Idioms • Higly specific part of language (i.e. replacing one word changes the whole meaning) • Statistical approach would yield unprecise results • Rule-based approach preferential, especially when dealing with flective languages

  4. Importance of idioms in computatonal processing of texts • Present in language, yet often ignored • Difficult to proccess – described only linguistically • Causing incomplete computational understanding of the language and unprecise translation • Lack of real data about their frequency • Why are they diffucult to process? • Because of their multi-word nature • Because of their elusive semantic properties (meaning is not the sum of the words) • Because of their cultural and historical nuances which render them very difficult to translate without special preparation

  5. Croatian phraseology and comparisons • Well described linguistically (Croatian Dictionary of Idioms with ~2500 entries) • Lack of systematic approach essential for text processing • Sorted into categories for the purpores of this work • Comparative structures as one of the main categories of idioms • Radi kao pčela (Working hard as a bee) • Puši kao Turčin (Smokes like a pipe, lit. Like a Turk) • Brz poput strijele (Fast as an arrow) • Approximately 540 set comparative phrases in Croatian (Fink-Arnovski)

  6. Comparisons in literature and beyond • Comparative structures (usporedbe ili poredbe) mainly a feature of literary texts and newspaper • Filaković (2008) assumes their presence in the works of fiction by analyzing the works of Croatian writer I.B.Mažuranić • Kovačević (2012) reports linguistic creativity in use of comparative structures in newspaper articles • Mance and Trtanj (2010) note the usage of modern slang variants of the comparisons • No statistical data about their real usage in various types of text

  7. Goals of this work • To build a tool for automated processing of the comparative idioms in Croatian texts • To be able to recognize them in any type of the text as the multi word unit • Extract, describe and ennumerate the structures • Collect the statistical data about their frequency in different styles of texts • Serve as an example for similar work in other languages • Be used as a tool in automated or semi-automated machine translation of Croatian to any lanugage (provided the additional work)

  8. NooJ – a tool for rule based automated text processing • NooJ – free to use linguisticdevelopmentenvironmentfor various kinds of rule-based automated text and corpora processing • http://nooj4nlp.net/ • Morphological, syntactic and semantic processing with options for translation and transformation of sentences • Ready made resources for dozen languages: • Acadian, Arabic, Armenian, Belarusian, Bulgarian, Catalan, Croatian, English, French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Polish, Portuguese, Russian, Serbian, Slovene, Spanish, Turkish, Vietnamese • Great tool for highly flective languages

  9. Methodology • Listing and categorizing the idioms • Definition and recognition of rules • Construction of training and testing corpora • Construction of grammars for processing texts • Using NooJ as a platform • Testing phase • Calculation of results

  10. Listing and categorizingthe idioms • Based on Croatian Dictionary of Idioms and idioms manually found in Croatian corpus • For the purposes of computational approach, we defined five major categories • a) Noun phrase with an attribute or apposition • b) Verbal phrase with a direct object • c) Verbal phrase with the optional direct object which can disrupt the syntactic structure • d) Comparative structure (A/V as N) • e) Fixed phrase which doesn't change in any syntactic environment

  11. Definition and recognition of rules • 312 different comparative construcion in our dictionary • Recognized in any form, tense, case and word order • Divided into 5 subcategories due to sytactic properties • Adjective AS Noun = 89 • Noun AS Preposition = 9 • AS a Noun/Adjective =49 • AS a Noun (7) • AS a PP fixed phrase (37) • AS a N + PP (5) • Verb AS Noun = 157 • AS IF Verb = 8

  12. Constructionoftrainingand testingcorpora • First phase: training • A smaller corpus of sentences exclusively containing the structures in question (comparative structures with phrases „kao” or „poput”) • Second phase: testing • After the completion of the grammars (NooJ files for processing texts), results are tested on the bigger corpus • Corpus 1: random texts from the Web corpus of differents styles of text (2,2 million words corpus) • Corpus 2: literal text of mostly Croatian authors (658 Kw corpus)

  13. Constructionofgrammarsfor processing texts • Grammar – a file constructed in NooJ environment, made for syntactic processing of the texts • Input, output, variebles, nested grammars • Concordance with marked texts as an output

  14. Adjective AS Noun Recognizes: Lijep kao slika (pretty as a picture) Pijan kao smuk (drunk as a sponge) Brz kao zec (fast as a bullet)

  15. Noun AS prepositon Recognizes: Mrak kao u rogu (pitch dark) AS a Noun Recognizes: Kao drvena Marija (being stiff, unrelaxed) Poput guske u magli (without thinking)

  16. VerbASNoun Recognizes: Ići kao po loju (go smoothly, slide like over the fat) Šutjeti kao grob (be silent as a grave) AS IF Verb Recognizes: Kaoda je u zemlju propao (as if the Earth swallowed him) Kao da je pao s Marsa (clueless, as if he came from Mars)

  17. Example of results Comparative structure

  18. Evaluation

  19. Conclusionsaboutcomparisonin Croatian • Number of comparative structures in different types of texts varies greatly • General texts (web corpus) – 1 per every 10000 words • Literal texts (books from Croatian authors) – 1 per every 1000 words • Confirmed hypothesis that such structures are pertaining mostly to literal style • 10 times more frequent in books and works of fiction • Rare in other styles of writing due to the stylistic marking they bring to the text

  20. Thank you for your attention. Questions?

More Related