Sentence Classification and Clause Detection for Croatian

Sentence Classification and Clause Detection for Croatian Kristina Vučković, Željko Agić, Marko Tadić Department of Information Sciences, Department of LinguisticsFaculty of Humanities and Social Sceinces, University of Zagreb {kvuckovi, zagic, marko.tadic}@ffzg.hr FASSBL 7 Conference Dubrovnik, Croatia2010-10-05

Overview • What? • classifying Croatian sentences by structure • detecting independent and dependent clauses • How? • implemented a prototype system in NooJ • linked it with a morphosyntactic tagger • evaluated on a sample from Croatian corpora • Why? • rule-based chunking and shallow parsing

Classification and detection • sentence segmentation is easy when considering sentence boundaries only • here, we: • detect boundaries of clauses in complex sentences • assign type to sentences • sentence classification • purpose: declarative, interrogative, etc. • structure: simple and complex • complex sentences • independent complex, i.e. compound sentences • dependent complex sentences

Classification and detection • independent complex sentences • independent clause connected to the main clause by using a conjunction • type defined by the choice of conjunction • e.g. constituent clause, conjunctions {i, pa, te, ni, niti} • disjunctive, opposite, exclusive, conclusive and explanatory clause • Svi su spavali, jedino sam ja bio budan. (exclusive) • dependent complex sentences • main clause is independent, all the others depend on it and cannot stand alone in a sentence • Predicative, subjective, objective, attributive, appositional and adverbial clause • Ispričat ću tišto mi se dogodilo.(objective)

The system • prototype implemented in NooJ • finite state transducer cascades (local grammars) • Croatian lexical resources • each cascade detects and annotates a different type of clause • built on top of a chunker for Croatian • the top-level grammar • two types of subgraphs: main clauses and independent clauses

The system • Main clause grammar • presence of a VP and possibly any other phrase • independent clauses recognized just by using the conjunctions • implementation of dependent clause detection varies across clause types

Experiment setup • used the CW100 corpus • XCES-encoded to word level • sentence delimited, tokenized, manually lemmatized and MSD-annotated • 200 randomly selected sentences • 100 for the development and 100 for testing • utilized the CroTag tagger • NooJ input format allows external annotation • created three systems • no preprocessing • tagging input sentences with CroTag (~85% accuracy) • using the manually assigned tags from CW100 • recall, precision, F1-measure

Results • scores for the three systems • “perfect” tagging system is the top-performer • benefits of automatic tagging? • distribution of assigned types • main, objective, opposite, adverbial, attribute, ... • misclassifications • attributive and objective most commonly misclassified • data sparseness

Conclusions and future work • the system scores good in terms of F1-measure • open issues • verb coordination • dislocated nominal predicates • attribute classes starting with a PP • complex insertion of dependent clauses • no real benefit from automatic MSD-tagging • future work • resolving the issues • re-evaluation on a larger test set? • integration with a rule-based shallow parser

Thank you for your attention. The research within the project ACCURAT leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013), grant agreement no 248347. www.accurat-project.eu

Sentence Classification and Clause Detection for Croatian

Sentence Classification and Clause Detection for Croatian

Presentation Transcript

A Sentence Boundary Detection System

Sentiment Classification using LM and Sentence Information

Croatian-English / English-Croatian

The Clause and Sentence Classification

Conditional sentence (If-clause )

Prosodic and Phonetic Features for Speaking Styles Classification and Detection

Efficient kernels for sentence pair classification

The Clause and Sentence Structure Self-Test

Sentence Unit Detection in Conversational Dialogue

Working Memory and Relative Clause Attachment under Increased Sentence Complexity

Global and Efficient Self-Similarity for Object Classification and Detection

Automated Detection and Classification of NFRs

Sentence Level Information Patterns for Novelty Detection

Malware Classification And Detection

Sentence Structure and Clause Identification

Measures for Classification and Detection in Steganalysis

Text Classification and Named Entities for New Event Detection

Naïve Bayes for Text Classification: Spam Detection

Emotion Classification and Detection

Mass Detection and Classification System for Mammography Image Preprocessing

Sentence Semantic Distance and Novelty Detection