10 likes | 147 Views
Sample text http://wt.jrc.it/lt/Acquis/. Prezentul regulament intră în vigoare în a douăzecea zi de la publicarea în Jurnalul Oficial al Uniunii Europene. This Regulation shall enter into force on the twentieth day following that of its publication in the Official Journal of the European Union.
E N D
Sample text http://wt.jrc.it/lt/Acquis/ Prezentul regulament intră în vigoare în a douăzecea zi de la publicarea în Jurnalul Oficial al Uniunii Europene. This Regulation shall enter into force on the twentieth day following that of its publication in the Official Journal of the European Union. direct object subject predicate parsing output POS-tagging output FipsRomanian: Towards a Romanian Version of the Fips Syntactic Parser Violeta Seretan, Eric Wehrli, Luka Nerima, Gabriela Soare LATL – Language Technology Laboratory {violeta.seretan, eric.wehrli, luka.nerima, gabriela.soare@unige.ch} Romanian language Extending Fips to Romanian: two main tasks • Lexicon construction • list of headwords (DEX, 1998) • morphological generation: given a base word form, generates all its forms according to the appropriate inflection paradigm • manual and semi-automatic insertion • manual insertion for verbs (specific information: subcategorization, selectional features, thematic function, …) • Current status: • simple entries: 60K lexemes/ 380K words • (10 K proper nouns) • complex entries: multi-word expressions (compounds and collocations): de jur împrejurul “around” problemă – a se pune “problem – to arise” • Grammar implementation • Specifications (Soare, 2005) • Customisation of FipsRomanian grammar for standard operations (syntactic transformations: relativization, interrogation, passivization, ...) • Similarities and differences. Examples: • clitic system • wh-fronting • Attachment rules: constraints on the main parser operation, Merge, which combines two adjacent structures into a larger structure • Current status: about 100 rules specified; nearly half implemented and tested • Vocabulary • Latin origin (fundamental vocabulary) • Slavic origin • Neologisms: French, Italian, … • Loanwords: Turkish, Greek, Hungarian, Albanian, ... • Morphology • Case system inherited from Latin • nominative-accusative, genitive-dative, vocative • Three grammatical genders • masculine, feminine, neuter • Rich declension of determiners, nouns, adjectives, and verbs • e.g., about 35 forms for a verb • The definite article is enclitic, i.e., suffixed to nouns and adjectives: • casă/house – casa/house-the • mare/big – marea/big-the Europe - Romance languages • Orthography • phonemic;Latin alphabet (since 1859) • Diacritics: ă/ə,â/ɨ, î/ɨ;cedilla: ş/ʃ, ţ/ʦ • Syntax • VSO language, relatively free word order Fips: a multilingual parsing architecture(Wehrli, 2007) FipsRomanian: Sample results • Underlying theory • Generative Grammar (Chomsky, 1995) • Similarities: • Simpler Syntax (Culicover and Jackendoff, 2005) • Lexical Functional Grammar (Bresnan, 2001) • Output • Rich sentence representation: • constituent structure • predicate-argument table • co-indexation chains • intra-sentential pronoun resolution Sample parse tree produced by Fips • Implementation • Left-to-right, bottom-up tabular parsing algorithm, relying on detailed lexical information • Language-independent core + language-specific implementation • Component Pascal, OOP paradigm, BlackBox IDE • Supported languages: French, English, German, Spanish, Italian, Greek; others in progress Preliminary results Screen captures • Parsing experiment • data: journalistic texts, 1.05M words • average sentence length: 26.9 tokens • 16.2% full parses (FipsFrench, FipsEnglish: about 80%) • average partial parses length : 5.3 tokens • unknown words: 6.5% (of which 39.2% proper nouns) • satisfactory lexical coverage • grammatical coverage needs to be improved (work in progress!) • Task-based evaluation • Collocation extraction from parsed data (Seretan, 2008) • Collocations are half idioms (of encoding, but not of decoding) • Used by parser and in-house rule-based machine translation system • Precision for top 2000 results: 30.3% • (Precision for French data: 65.9%, top 500 results) Fips interface Lexicon interface Sample collocations extracted Related work & Useful resources References • Data-driven dependency parser for Romanian based on the MaltParser, learns dependencies from manual annotations (Călăcean and Nivre, 2009). Problem: reduced treebank size and grammatical coverage (simple structures, no subordination, average sentence length only 9 words). • Sketch Engine for Romanian: shallow parsing (POS patterns), http://www.sketchengine.co.uk/ • Dependency treebank construction, work in progress at the University of Iaşi, Romania • Text processing webservices, RACAI – Research Institute for Artificial Intelligence, Romanian Academy, Bucarest, Romania. http://www.racai.ro/webservices/TextProcessing.aspx • A repository of tools for Romanian: ConsILR - Consortium for the Romanian Language: Resources & Tools, research groups from Iaşi, Bucarest and Chişinău http://consilr.info.uaic.ro/ • Bresnan, J. 2001. Lexical Functional Syntax. Blackwell, Oxford. • Chomsky, N. 1995. The Minimalist Program. MIT Press, Cambridge, Mass. • Călăcean, M. and J. Nivre. 2009. A data-driven dependency parser for Romanian. In Proceedings of the 7th International Workshop on Treebanks and Linguistic Theories (TLT 7), pages 65–76, Groningen, Holland. • 1998. DEX – Dicţionarul explicativ al limbii române. Academia Română, Bucharest. • Seretan, V. 2008. Collocation extraction based on syntactic parsing. Ph.D. thesis, University of Geneva. • Soare, G. 2005. Romanian syntax. Technical report, University of Geneva. • Wehrli, E. 2007. Fips, a “deep” linguistic multilingual parser. In ACL 2007 Workshop on Deep Linguistic Processing, pages 120–127, Prague, Czech Republic. Faculté des Lettes, Département de Linguistique