280 likes | 397 Views
Atanas Georgiev Chanev. PhD student in Cognitive Sciences and Education, Univeristy of Trento Bachelor’s: FMI, University of Plovdiv, Bulgaria. A PP-Attachment Conundrum for Bulgarian.
E N D
Atanas Georgiev Chanev PhD student in Cognitive Sciences and Education, Univeristy of Trento Bachelor’s: FMI, University of Plovdiv, Bulgaria
A PP-Attachment Conundrum for Bulgarian Based on the parser I have implemented (an extension of the Earley-Stolcke’s algorithm) and the results I have obtained a.k.a. the diploma work for my bachelor’s
Note: I won’t discuss algorithms dealing with PP attachments. I’ll show that my approach fails to resolve PP Attachment ambiguities in most of the cases
Contents: The problem The prerequisites The algorithm The grammar The results The PP-Attachment problem Future Work Acknowledgements Slides: 28
The Problem: Parsing natural languages (Bulgarian) Shallow parsingVs.Full parsing
What Is Syntax? POS tagging ? Phrase Structures Grammatical Relations Grammatical Functions ?
Constituent Structures: Rules like: S-> NP VP; NP-> NP PP … Problem: Ambiguity An approach for resolving ambiguity in Bulgarian – Tanev’ 2001
The Prerequisites: Morphological Processor for Bulgarian (Krushkov’ 97) POS Tagger (Tachev’ 2001) … I have encapsulated the Grammar in a separate section
The Algorithm: Three steps: Predictor Scanner Completer The Earley Algorithm (Earley’ 70)
Stolcke’s extension: Each rule is assigned two types of probabilities: inner probability and forward probability. They are calculated differently in each step (predictor, scanner, completer) Stochastic extension – Capable of solving ambiguities (Stolcke’ 93)
Shallow trees are better? The deep trees do always have smaller probabilities!
+ Basic Unification: Basic Unification mechanism, based on agreement constraints A full unification as described in (Jurafsky, Martin’ 2001) performing at each step is too ineffective
The Grammar: Two versions of the grammar, collected from a mini corpus of sentences in the newspaper articles register
The mini corpus: 5331 Tokens > 450 sentences Grammatically and syntactically annotated
The PPs: Two types of PPs: Modifying the verb – AdvPs Modifying the noun – PPs
POS tags: Shte – future tense auxiliary or particle Govoreshtiqt – verb or adjective As in ‘govoreshtiqt student’
The Results: Precision: 42.42% Recall: 66.00% F-measure: 51.65%
How do I define precision and recall: Precision = (The number of correctly parsed sentences)/(The number of sentences given any predictions) Recall = (The number of sentences given any predictions)/(The number of tested sentences)
The PP-Attachment Problem (+the next 4 slides): How many of the correctly parsed sentences contain PPs? How many of the mistaken sentences contain PPs?
A considerable Amount of mistaken PPs: • BUT: • Sentences, which are not given any prediction contain PPs • AdvPs sometimes are not ambiguos – e.g. at the beginning of the sentence How many of the correctly parsed sentences contain PPs? 27.59% How many of the mistaken sentences contain PPs? 32.61%
Conclusion: Stochastic Context-Free Grammars are not powerful enough to deal with the PP-Attachment Problem. (or at least using this approach)
Future Work: Clause Splitter for Bulgarian A better grammar = a better corpus A better unification processor Semantic Constraints
Acknowledgements: [1] Крушков, Хр., Моделиране и изграждане на машинни речници и морфологични процесори, Пловдив, Дисертация за присъждане на образователна и научна степен “Доктор”, ПУ "П.Хилендарски", Пловдив, 1997. [2]Тачев, Г. Стохастичен маркировчик на частите на речта, Дипломна работа, ПУ "П.Хилендарски", Пловдив, 2001. [3] Танев, Хр., Автоматичен анализ на текстове и решаване на многозначности в българския език, Дисертация за присъждане на образователна и научна степен “Доктор”, ПУ "П.Хилендарски", Пловдив, 2001. [4] Earley, J., An Efficient Context-free Parsing Algorithm, Communications of the ACM, 6(8):451-455, 1970. [5] Stolcke, A., An Efficient Probabilistic Context-free Parsing Algorithm That Computes Prefix Probabilities, Technical Report TR-93-065, International Computer Science Institute, Berkeley, CA, 1993. Revised 1994. [6] Jurafsky, D., Martin J. H., Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics,and Speech Recognition, Prentice Hall, New Jersey, 2001.