190 likes | 199 Views
This research paper presents a system for automatically extracting keyphrases from Croatian newspaper articles. The system utilizes various features and techniques, including candidate generation, feature calculation, and learning and ranking. The results show promising performance, although there are some deficiencies in the corpora used.
E N D
Automatic Keyphrase Extraction from Croatian Newspaper Articles Renee Ahel, Bojana Dalbelo Bašić, Jan Šnajder Knowledge Technologies Lab Department of Electronics, Microelectronics, Computer and Intelligent Systems University of Zagreb, Faculty of Electrical Engineering and Computing INFuture 2009
Agenda • Assigning keyphrases • Related work • Extraction system • Corpora • Results • Conclusion INFuture 2009, 05.11.2009.
Assigning keyphrases • Keyphrases • Summarize documents • Are often not assigned to documents • Manual assignment is a tedious task • Automatic assignment methods • Keyphrase assignment (in narrow sense) • Keyphrase extraction INFuture 2009, 05.11.2009.
Related work • Our work is inspired by KEA (Witten et al. 1999) • Good performance despite using a simple set of features • Our approach: more features, improvements on candidate generation • POS tag filtering similar to Hulth (2003) • Larger set of POS tags for filtering (Petrović et al. 2009) INFuture 2009, 05.11.2009.
Extraction system Program Relational database Candidate generation Pre - processing Staging area Candidate warehouse Knowledge base Learning candidate feature matrix Learning Feature calculation Ranking Classification candidate feature matrix INFuture 2009, 05.11.2009.
Phrase boundaries Tokenization Lemmatization Extraction – pre-processing “Vrhovni sud potvrdio je presudu Županijskog suda u Splitu.” HINA categories INFuture 2009, 05.11.2009.
vrhovni A vrhovni sud AN vrhovni sud potvrdio ANV Extraction – candidate generation Candidate generation INFuture 2009, 05.11.2009.
Extraction – feature calculation Appearance_ index Lemma_ matches Original_ text Document_ length Lemmatized_ text Is_in_ categories First_lemma_match Is_in_title Ngram_level Candidate_ID IsKeyword Corpus Feature calculation First_appearance_relative Is_in_categories Is_in_title TF_IDF presuda 0.212 1 0 0.967 INFuture 2009, 05.11.2009.
Naive Bayes learning Knowledge base Extraction – learning and ranking Discretization for naive Bayes Ranking Discretization for naive Bayes INFuture 2009, 05.11.2009.
Corpora • Deficiencies • Assigned keyphrases not appearing in text removed (57% of original keyphrases) • Unknown inter-annotator agreement • Inconsistent keyphrases (63% of keyphrases assigned to only one document) • Experimental set • 200 documents • On average, 6.5 keyphrases and 370 candidates per document INFuture 2009, 05.11.2009.
Results – basic configurations INFuture 2009, 05.11.2009.
Results – additional POS filter F = lemmatization failed, X = stopword • Filtering out the candidates that do not match the POS patterns N, AN, NN, NXN • discards 30% negative candidates • discards only 7.5% positive candidates INFuture 2009, 05.11.2009.
Results – additional POS filter INFuture 2009, 05.11.2009.
Results – ablation study • Influence of each feature on performance • Holding out one feature and doing keyphrase extraction using the remaining features INFuture 2009, 05.11.2009.
Examples * Keyphrase normalization is a work in progress INFuture 2009, 05.11.2009.
Examples INFuture 2009, 05.11.2009.
Conclusion • Overall best result achieved – MDL + additional POS filtering, 10 extracted keyphrases • In absence of comparable results, we consider our results to be of modest performance • Possible causes • Low inter-annotator agreement suspected • Inconsistently assigned keyphrases • Results show that performance can be improved, despite deficiencies in corpora • New corpus of much higher quality obtained INFuture 2009, 05.11.2009.
Acknowledgements • This work has been supported by the Ministry of Science, Education and Sports, Republic of Croatia and under the Grant 036-1300646-1986 • The authors are grateful to the Croatian News Agency (HINA) for making available the newspaper corpora INFuture 2009, 05.11.2009.
Thank you! Questions? INFuture 2009, 05.11.2009.