190 likes | 405 Views
Verb Valency Frame Extraction Using Morphological and Syntactic Features of Croatian . Krešimir Šojat, Željko Agić, Marko Tadić Department of Linguistics, Department of Information Sciences Faculty of Humanities and Social Sceinces, University of Zagreb {ksojat, zagic, marko.tadic}@ffzg.hr
E N D
Verb Valency Frame Extraction Using Morphological and Syntactic Features of Croatian Krešimir Šojat, Željko Agić, Marko Tadić Department of Linguistics, Department of Information SciencesFaculty of Humanities and Social Sceinces, University of Zagreb {ksojat, zagic, marko.tadic}@ffzg.hr FASSBL 7 Conference Dubrovnik, Croatia2010-10-05
Overview • What? • extraction and semi-automatic construction of verb valency frames • How? • rule-based extraction procedure run on the Croatian dependency treebank • manual assignment of tectogrammatical functors • inference of rules for assigning functors to unseen text • Why? • creation of treebank-based verb valency lexicon • enhancement and enrichment of existing resources
Valency frames • valency frame extraction means to detect all possible environments of particular verb as found in the treebank • such an approach aims at fast construction of valency frames • extraction is automatic, no elements of frames added manually by human annotators • such automatically acquired verb valency lexicon can serve as a basis for further enrichment and enhancement of manually constructed resources, either existing or constructed from scratch
The treebank • Croatian Dependency Treebank (HOBS) • follows the guidelines of the Prague DT • taken from the Croatia Weekly 100 kw sub-corpus of the Croatian National Corpus (HNK) • XCES-encoded up to the word level • sentence-delimited, tokenized, manually lemmatized and MSD-tagged • serves as the morphological layer of the treebank • annotated on the syntactic layer • approximately 2.700 sentences, 67.000 tokens • manually assigned syntactic functions • ca 1.300 sentences double-checked and used in this experiment
Extraction algorithm • the algorithm aims at extraction of verb valency frame instances • for each verb in the treebank sample, it descends • one level down the dependency tree to retrieve subjects (Sb), objects (Obj), adverbs (Adv) and nominal predicates (Pnom) • Two levels down to retrieve tokens from the previous step introduced by subordinate conjunctions (AuxC) or prepositions (AuxP)
Extraction algorithm • algorithm illustration dogovorila (dogovoriti Pred) [Unija Ncfsn Sb] [mjere Ncfpa Obj] [već Rt Adv] [kako Css AuxC]
Extraction algorithm • the first version retrieved predicates only and was expanded to retrieve all the verbs from the treebank sample • algorithm adapted to retrieve any verbs found in the dependency structure, regardless of their respective analytical functions and position within the dependency trees • the adaptation itself is implemented in order to raise the recall of the algorithm, while still maintaining its precision by not changing the simple set of descending rules • i.e. to retrieve as much verbs as possible given the limited size of the treebank sample used in the experiment CCCCyyyyLocationyyyy-mm-dd
Extraction algorithm • the verb “imati” (Vmn) is annotated as object (Obj)
Extraction algorithm • Thus, from each sentence the number of extracted frames correspondes to the number of verbs: • one frame for the main clause that captures the whole syntactic structure of the sentence • frames extracted from dependent clauses naglasio (naglasiti Vmps-sma Pred) [Mikuška Np-sn Sb] [kako->imati Css AuxC->Obj] imati (imati Vmn Obj) [stanovništvo Ncnsn Sb] [korist Ncfsa Obj] [od->projekta Spsg->Ncmsg AuxP->Adv] [kroz->ekoturizam Spsa->Ncmsa AuxP->Adv]
Functor assignment • In order to annotate verbal frames we used a set of 5 argument functors and functors for 32 free modification functors: • Argument functors: ACT,PAT,ADDR, ORIG, EFF • Temporal functors: TWHEN, TFHL, TFRWH, THL, THO, TOWH, TPAR, TSIN, TTILL • Locative and directional functors: DIR1, DIR2, DIR3, LOC • Functors for causal relations: AIM, CAUS, CNCS, COND, INTT • Functors for expressing manner: ACMP, CPR, CRIT, DIFF, EXT, MANN, MEANS, REG, RESL, RESTR • Functors for specific modifications: BEN, CONTRD, HER, SUBS • 936 frame instances were manually annotated for 424 different verbs
Results • valency frame frequency across verb lemmas
Results • frequency of verb valency frames, i.e. n-tuples of tectogrammatical functors
Results • frames annotated with MSD, analytical functions and tectogrammatical functors CCCCyyyyLocationyyyy-mm-dd
Results • Distribution of (MSD, analytical function) pairs across tectogrammatical functors • serves as basis for defining functor assignment rules from MSD and analytical function
Conclusions • in this experimentwe have designed and implemented one possible approach: • to semi-automatic extraction of a valency frame lexicon for Croatian verbs • to the refinement of existing lexicons by using the Croatian Dependency Treebank as an underlying resource • we have automatically extracted 2930 verb valency frame instances and annotated 936 frames: • the distribution of valency frames for each of the encountered verbs • the distribution of analytical functions and morphosyntactic tags for each of the tectogrammaticalfunctors
Future work • the first result enables the enrichment of existing valency lexicons, such as CROVALLEX • the second result enables the implementation of a rule-based system for automatic assignment of tectogrammatical functors to morphosyntactically tagged and dependency-parsed unseen text • this procedure of automatic detection of valency frames will be used also in several other projects dealing with factored SMT (e.g. ACCURAT) • regarding dependency parsing of Croatian by using the Croatian Dependency Treebank, we shall undergo various research directions in order to increase overall parsing accuracy
Thank you for your attention. The research within the project ACCURAT leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013), grant agreement no 248347. www.accurat-project.eu