330 likes | 463 Views
Semantic Argument Classification and Semantic Categorization of Turkish Existential Sentences Using Support Vector Learning. by Aylin Koca December 7, 2004. OUTLINE. Introduction On Turkish Existential Sentences Categories Due to Senses of var and yok Corpus and Semantic Annotation
E N D
Semantic Argument Classification and Semantic Categorization of Turkish Existential Sentences Using Support Vector Learning by Aylin Koca December 7, 2004
OUTLINE • Introduction • On Turkish Existential Sentences • Categories Due to Senses of var and yok • Corpus and Semantic Annotation • Abstract Thematic Roles • Methodology • Shallow Semantic Parsing • Classifier: SVM • Features • Sentence Categorization • Experimentation & Results • Without Semantic Information • With Semantic Information • Concluding Remarks
INTRODUCTION • Three types of sentences • Verbal sentences (e.g. She read the book.) • Copulative sentences (e.g. The book is on the table.) • Existential sentences (e.g. There is a book on the table.) • Overview of system • Shallow semantic parsing for defining the predicate-argument relationships in a Turkish existential sentence on a word-by-word basis via support vector learning • Accurately categorizing these sentences accordingly • Improving the system • Incorporating semantic information
TURKISH EXISTENTIAL SENTENCES • A somewhat “overlooked” sentence type • Controversial status : meaning & category • Minimally characterized by two particles • Var: ‘there is/are’ • Yok: ‘there is/are no’ • In [Sezer, 2003]*: • “Sence aşk yok mu?” (As far as you are concerned, doeslove not exist?) • “İçimde bir şüphe var.” (There is a doubt in me.) • “Bizim bir şikâyetimiz yok.”(We don’thave a complaint.) • “İçeride Müdür Bey var.” (Mr. Director is inside.) • “Biz o toplantıda vardık.”(We were present at that meeting.) * On Syntactic and Semantic Properties of Turkish Existential Sentences. Harvard University.
Bare Existentials • Overt subject • Sence aşk yok mu? according to you love NE Q “As far as you are concerned, does love not exist?” • Bugün su var. today water E “Today there is water.” • Hiç şüphe yok. no doubt NE “There is no doubt.”
Case Existentials • Case information (i.e. locative, ablative, dative, instrumental) • Şehir-de güzel evler var. city-LOC nice houses E “There are nice houses in the city.” • Anne-m-den haber yok. mother-P1SG-ABL news NE “There is no news from my mother.” • Siz-e bir mektup var. you-DAT a letter E “There is a letter to you.” • Göz-üm-le ilgili bir derd-im yok-tu. eye-P1SG-INS about a problem-P1SG NE-APAST “I did not have a problem with my eye.”
Existential Possession • Due to the lack of a verb meaning ‘to have’ in Turkish • Ad-ınız yok mu? name-P2PL NE Q “Don’t you have a name?” • Kim-in silgi-si var? who-GEN eraser-P3SG E “Who has an eraser?”
Other Categories… • Initial definite subjects in existential sentences assign <participant> role on their subject and <scene> role on their locative NP: • O bu komite-de var. s/he this committee-LOC E “S/he is on this committee.” • Picture existentials • Ayşe bu dosya-da yok. Ayşe this file-LOC NE “Ayşe is not in this file.” • Compound tense existentials • Kimse-ye kız-dığ-ım yok. anyone-LOC angry-PASTPART-P1SG NE “I am not angry at anyone.”
CORPUS • March 2004 release of the Turkish Treebank* [Oflazer et al., 2003] • 7262 sentences • var: 187 occurrences • yok: 105 occurrences • 232 of 292 sentences taken as existential sentences for manual semantic annotation *METU-Sabancı Turkish Treebank (www.ii.metu.edu.tr/~corpus/treebank.html)
Semantic Annotation • Manual annotation of semantic arguments of existential sentences • [POSSESSOROnun] [LOCATIONbu evde] [POSSESSEDyeri] [predicateyok] [NULLartık]. • Further annotation on a word-by-word basis using IOB representation: • [B-POSSESSOROnun] [B-LOCATIONbu][I-LOCATIONevde] [B-POSSESSEDyeri][predicateyok] [Oartık]. • IOB2* representation: • I means word is inside a chunk • O means word is outside a chunk • B means word is the beginning of a chunk * E. Sang and J. Veenstra. Representing Text Chunks. Proc. of EACL, 1999.
Abstract Thematic Roles • Type of semantic knowledge required: • <THEME> = overt subject† of predicate var/yok • <LOCATION> = place in which subject is situated • <SOURCE> = entity from which subject originates • <GOAL> = entity towards which subject heads • <RELATION> = entity with which subject shares • <POSSESSOR> = referent of subject that possesses • <POSSESSED> = entity that is possessed † Overt subject should not be marked with possession information.
“Refined” Corpus • Added semantic SEM tags • Eliminated LEM and MORPH tags
SHALLOW SEMANTIC PARSING • Process of assigning a simple structure to sentences in text: • WHO did WHAT to WHOM, WHEN, WHERE, WHY, HOW, etc. • Technically: • Group sequences of words together (identification) • Assign labels to these semantic arguments (classification)
Classifier • Chunking and labeling as a classification-based learning task • Support Vector Machines (SVMs) • Capable of handling a large number of features with strong generalization properties [Joachims, 1998†; Kudoh and Matsumoto, 2000‡] • Binary classifiers • But, semantic parsing is a multi-class classification problem † Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proc. of ECML, 1999. ‡ Use of Support Vector Learning for Chunk Identification. Proc. ofCONLL-2000 and LLL-2000, 2000.
Classifier (cont’d) • “One class vs. all others” (OVA) approach • For K classes, build K classifiers that separate one class from among all others • “Pairwise” (OVO) approach • For K classes, build K(K-1)/2 classifiers, considering all pairs of classes • Tradeoff: • Number of classifiers to be trained • Amount of data used in training each classifier
Features • Used in assigning semantic roles • Represent various aspects of: • Syntactic structure of sentence • Lexical information • Features: POS category of the word, the POS category of the word that this word has a relation with in the sentence, the name of this relation, and whether this word appears before or after the predicate [, predicted semantic labels of previous words within context].
context Current prediction Features and Context 5-wordcontext and features used to classify a word
SENTENCE CATEGORIZATION • Thematic hierarchy <LOCATION> <SOURCE> <GOAL> <RELATION> <POSSESSOR> <POSSESSED> > > <THEME> > > POSSESSION EXISTENTIALS CASE EXISTENTIALS BARE EXISTENTIALS
SENTENCE CATEGORIZATION (cont’d) • Performance evaluation based on: • Precision • Recall • Overall accuracy Fβ, where β = 1
EXPERIMENTATION & RESULTS • LIBSVM* software • Standard package for the OVO approach • One of its multi-class classification tools for the OVA approach * http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Experiments • Cross Validation • In v-fold cross-validation, the train set is divided into v subsets of equal size. Sequentially one subset is tested using the classifier trained on the remaining v-1 subsets. Thus, each instance of the whole train set is predicted once, and the cross-validation accuracy is the percentage of data that are correctly classified. • Classification (9 vs. 1 split) • Sentence Categorization
Without Semantic Information • Cross Validation
Without Semantic Information • Classification MSE = Mean Squared Error SCC = Squared Correlation Coefficient
With Semantic Information • Cross Validation
With Semantic Information • Classification
CONCLUDING REMARKS • A novel way of utilizing Turkish Treebank to do domain-independent shallow semantic parsing of Turkish ES by recognizing their predicate-argument structures • Automatic categorization • Thematic role hierarchy • Semantic annotation and refining of Turkish ESs of the Treebank
CONCLUDING REMARKS (cont’d) • Evaluation of results: • Incorporation of semantic information to the input files of the SVM • promise for applications in various natural language tasks in Turkish • Results of the task of ES categorization did not seem to get affected in any way by the incorporation of semantic information • Word-level vs. sentence-level feature
CONCLUDING REMARKS (cont’d) • Future work: • More consistently and accurately annotated corpus • Enhancing size of the data • Research scope: • Issues of various strands of linguistics and computer science such as natural language processing, and machine learning
CONCLUDING REMARKS (cont’d) • Big picture: • Results can play a major role in tasks like Information Extraction, Question Answering, and Summarization • Also an intermediate step in machine translation • Can always be extended to cover phonology and speech processing, if we decide to base this system on speech rather than text, hence better serving for the field of Artificial Intelligence
Selected Bibliography • Joachims T. 1998. “Text Categorization with Support Vector Machines: Learning with Many Relevant Features”. In Proceedings of ECML. • Kudoh T., Matsumoto Y. 2000. “Use of Support Vector Learning for Chunk Identification”. In Proceedings of the 4th Conference on CONLL-2000 and LLL-2000, pp. 142-144. • Oflazer K., Say B., Hakkani-Tür D. Z., Tür G. 2003. “Building a Turkish Treebank” In A. Abeille (ed.) Building and Exploiting Syntactically Annotated Corpora, pp. 1-18. Kluwer Academic Publishers. • Pradhan, S., Ward, W., Hacioglu, K., Martin, J., Jurafsky, D. 2004. “Support Vector Learning for Semantic Argument Classification” to appear in Journal of Machine Learning, Center for Spoken Language Research, Boulder, CO. • Sang, E., Veenstra, J. 1999. “Representing Text Chunks” In Proceedings of EACL, pp. 173-179, Bergen, Norway. • Sezer, E. 2003. “On Syntactic and Semantic Properties of Turkish Existential Sentences” Harvard University.