100 likes | 364 Views
Sentence Classification and Clause Detection for Croatian . Kristina Vučković, Željko Agić, Marko Tadić Department of Information Sciences, Department of Linguistics Faculty of Humanities and Social Sceinces, University of Zagreb {kvuckovi, zagic, marko.tadic}@ffzg.hr FASSBL 7 Conference
E N D
Sentence Classification and Clause Detection for Croatian Kristina Vučković, Željko Agić, Marko Tadić Department of Information Sciences, Department of LinguisticsFaculty of Humanities and Social Sceinces, University of Zagreb {kvuckovi, zagic, marko.tadic}@ffzg.hr FASSBL 7 Conference Dubrovnik, Croatia2010-10-05
Overview • What? • classifying Croatian sentences by structure • detecting independent and dependent clauses • How? • implemented a prototype system in NooJ • linked it with a morphosyntactic tagger • evaluated on a sample from Croatian corpora • Why? • rule-based chunking and shallow parsing
Classification and detection • sentence segmentation is easy when considering sentence boundaries only • here, we: • detect boundaries of clauses in complex sentences • assign type to sentences • sentence classification • purpose: declarative, interrogative, etc. • structure: simple and complex • complex sentences • independent complex, i.e. compound sentences • dependent complex sentences
Classification and detection • independent complex sentences • independent clause connected to the main clause by using a conjunction • type defined by the choice of conjunction • e.g. constituent clause, conjunctions {i, pa, te, ni, niti} • disjunctive, opposite, exclusive, conclusive and explanatory clause • Svi su spavali, jedino sam ja bio budan. (exclusive) • dependent complex sentences • main clause is independent, all the others depend on it and cannot stand alone in a sentence • Predicative, subjective, objective, attributive, appositional and adverbial clause • Ispričat ću tišto mi se dogodilo.(objective)
The system • prototype implemented in NooJ • finite state transducer cascades (local grammars) • Croatian lexical resources • each cascade detects and annotates a different type of clause • built on top of a chunker for Croatian • the top-level grammar • two types of subgraphs: main clauses and independent clauses
The system • Main clause grammar • presence of a VP and possibly any other phrase • independent clauses recognized just by using the conjunctions • implementation of dependent clause detection varies across clause types
Experiment setup • used the CW100 corpus • XCES-encoded to word level • sentence delimited, tokenized, manually lemmatized and MSD-annotated • 200 randomly selected sentences • 100 for the development and 100 for testing • utilized the CroTag tagger • NooJ input format allows external annotation • created three systems • no preprocessing • tagging input sentences with CroTag (~85% accuracy) • using the manually assigned tags from CW100 • recall, precision, F1-measure
Results • scores for the three systems • “perfect” tagging system is the top-performer • benefits of automatic tagging? • distribution of assigned types • main, objective, opposite, adverbial, attribute, ... • misclassifications • attributive and objective most commonly misclassified • data sparseness
Conclusions and future work • the system scores good in terms of F1-measure • open issues • verb coordination • dislocated nominal predicates • attribute classes starting with a PP • complex insertion of dependent clauses • no real benefit from automatic MSD-tagging • future work • resolving the issues • re-evaluation on a larger test set? • integration with a rule-based shallow parser
Thank you for your attention. The research within the project ACCURAT leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013), grant agreement no 248347. www.accurat-project.eu