Automatic Keyphrase Extraction from Croatian Newspaper Articles

Automatic Keyphrase Extraction from Croatian Newspaper Articles Renee Ahel, Bojana Dalbelo Bašić, Jan Šnajder Knowledge Technologies Lab Department of Electronics, Microelectronics, Computer and Intelligent Systems University of Zagreb, Faculty of Electrical Engineering and Computing INFuture 2009

Agenda • Assigning keyphrases • Related work • Extraction system • Corpora • Results • Conclusion INFuture 2009, 05.11.2009.

Assigning keyphrases • Keyphrases • Summarize documents • Are often not assigned to documents • Manual assignment is a tedious task • Automatic assignment methods • Keyphrase assignment (in narrow sense) • Keyphrase extraction INFuture 2009, 05.11.2009.

Related work • Our work is inspired by KEA (Witten et al. 1999) • Good performance despite using a simple set of features • Our approach: more features, improvements on candidate generation • POS tag filtering similar to Hulth (2003) • Larger set of POS tags for filtering (Petrović et al. 2009) INFuture 2009, 05.11.2009.

Extraction system Program Relational database Candidate generation Pre - processing Staging area Candidate warehouse Knowledge base Learning candidate feature matrix Learning Feature calculation Ranking Classification candidate feature matrix INFuture 2009, 05.11.2009.

Phrase boundaries Tokenization Lemmatization Extraction – pre-processing “Vrhovni sud potvrdio je presudu Županijskog suda u Splitu.” HINA categories INFuture 2009, 05.11.2009.

vrhovni A vrhovni sud AN vrhovni sud potvrdio ANV Extraction – candidate generation Candidate generation INFuture 2009, 05.11.2009.

Extraction – feature calculation Appearance_ index Lemma_ matches Original_ text Document_ length Lemmatized_ text Is_in_ categories First_lemma_match Is_in_title Ngram_level Candidate_ID IsKeyword Corpus Feature calculation First_appearance_relative Is_in_categories Is_in_title TF_IDF presuda 0.212 1 0 0.967 INFuture 2009, 05.11.2009.

Naive Bayes learning Knowledge base Extraction – learning and ranking Discretization for naive Bayes Ranking Discretization for naive Bayes INFuture 2009, 05.11.2009.

Corpora • Deficiencies • Assigned keyphrases not appearing in text removed (57% of original keyphrases) • Unknown inter-annotator agreement • Inconsistent keyphrases (63% of keyphrases assigned to only one document) • Experimental set • 200 documents • On average, 6.5 keyphrases and 370 candidates per document INFuture 2009, 05.11.2009.

Results – basic configurations INFuture 2009, 05.11.2009.

Results – additional POS filter F = lemmatization failed, X = stopword • Filtering out the candidates that do not match the POS patterns N, AN, NN, NXN • discards 30% negative candidates • discards only 7.5% positive candidates INFuture 2009, 05.11.2009.

Results – additional POS filter INFuture 2009, 05.11.2009.

Results – ablation study • Influence of each feature on performance • Holding out one feature and doing keyphrase extraction using the remaining features INFuture 2009, 05.11.2009.

Examples * Keyphrase normalization is a work in progress INFuture 2009, 05.11.2009.

Examples INFuture 2009, 05.11.2009.

Conclusion • Overall best result achieved – MDL + additional POS filtering, 10 extracted keyphrases • In absence of comparable results, we consider our results to be of modest performance • Possible causes • Low inter-annotator agreement suspected • Inconsistently assigned keyphrases • Results show that performance can be improved, despite deficiencies in corpora • New corpus of much higher quality obtained INFuture 2009, 05.11.2009.

Acknowledgements • This work has been supported by the Ministry of Science, Education and Sports, Republic of Croatia and under the Grant 036-1300646-1986 • The authors are grateful to the Croatian News Agency (HINA) for making available the newspaper corpora INFuture 2009, 05.11.2009.

Thank you! Questions? INFuture 2009, 05.11.2009.

Automatic Keyphrase Extraction from Croatian Newspaper Articles

Automatic Keyphrase Extraction from Croatian Newspaper Articles

Presentation Transcript

Writing Newspaper Articles Writing

Newspaper articles

Semi-Automatic Content Extraction from Specifications

How to write Newspaper Articles

Automatic Extraction of Subcategorization Frames From Corpora

Automatic Extraction of Hierarchical Relations from Text

Automatic Creation of Web Services from Extraction Ontologies

Implementing Automatic Value Extraction from Structured Web Pages

Automatic Keyphrase Extraction from Croatian Newspaper Articles

Automatic Timeline Generation from News Articles

Extraction and Visualisation of Emotion from News Articles

AUTOMATIC KEYPHRASE EXTRACTION VIA TOPIC DECOMPOSITION

Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA)

Sidney Brown Newspaper Articles

Topic Extraction From Turkish News Articles

Automatic Keyphrase Extraction by Bridging Vocabulary Gap

Reading comprehension Newspaper articles.

Articles Data Scraping from Newspaper

Types of newspaper Articles

Sidney Brown Newspaper Articles

Automatic term extraction from domain corpora