AUTOMATIC KEYPHRASE EXTRACTION VIA TOPIC DECOMPOSITION

AUTOMATIC KEYPHRASEEXTRACTION VIA TOPIC DECOMPOSITION reporter: Ying-Ying, Chen ProceedingEMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

OUTLINE • Introduction • Building Topic Interpreters • Topical PageRank for Keyphrase Extraction • Experiments • Related work • Conclusion

INTRODUCTION • Keyphrases are defined as a set of terms in a document that give a brief summary of its content for readers. • It is widely used in information retrieval and digital library • It is also an essential step in document categorization, clustering and summarization • Two principle approach: supervised and unsupervised • Supervised method • regards keyphrase extraction as a classification task • required a documents set with human-assigned keyphrases

INTRODUCTION • Unsupervised method • Graph-based rank • Process: • first build a word graph according to word co-occurrences within the document, • use random walk techniques to measure word importance • top ranked words are selected as keyphrases • Problems: • keyphrases should be relevant to the major topics of the given document • keyphrases should also have a good coverage of the document’s major topics

INTRODUCTION • To address the problem, it is intuitive to consider the topics of words and document in random walk for keyphrase extraction. • decompose traditional PageRank into multiple PageRanks specific to various topics • obtain the importance scores of words under different topics • We call the topic-decomposed PageRank as Topical PageRank (TPR). • Moreover, TPR is unsupervised and language independent • TPR for keyphrase extraction is a two-stage process: • Build a topic interpreter to acquire the topics of words and documents. • Perform TPR to extract keyphrases for documents.

BUILDING TOPIC INTERPRETERS • There are two method to acquire topic distributions of words • Use manually annotated knowledge bases. • Ex. WordNet • Use unsupervised machine learning techniques to obtain word topics from a large-scale document collection. • LSA(Latent Semantic Analysis) • pLSA(probability LSA), • LDA(Latent Dirichlet Allocation)

BUILDING TOPIC INTERPRETERS • LDA • Each word w of a document d is regarded to be generated by first sampling a topic z from d’s topic distribution θ(d) , and then sampling a word from the distribution over words φ(z) that characterizes topic z. • In LDA, θ(d) and φ(z) are drawn from conjugate Dirichlet priors α and β, separately. • Therefore, θ and φ are integrated out and the probability of word w given document d and priors is represented as follows: • Where K is the number of topics

LDA(LATENT DIRICHLET ALLOCATION) • Dirichlet distribution(狄氏分配) • Dirichlet分配是多項式分配的共軛分配 • 先驗機率為Dirichlet分配，相似度函數為多項式分配，那麼後驗分配仍為Dirichlet分配 • P(Y|X): 後驗機率; P(X):先驗機率; P(X|Y):相似度函數

LDA(LATENT DIRICHLET ALLOCATION) • LDA透過將文本映射到主題空間，也就是他認為一篇文章是由很多個主題隨機構成，透過主題得到文本與文本之間的關係。 • LDA和LSA、 pLSA的前提都相同，是bag of word所以不考慮任何語法及出現順序的問題。 • LDA與pLSA的差異 • pLSA的文件參數是由訓練文集中有出現的文件訓練得到 • LDA會給予沒有出現在訓練文集中的文件一個機率形式的表現方式，所以需要的參數量較少

LDA(LATENT DIRICHLET ALLOCATION) • LDA是一個生成模型，其可以隨機生成可觀測的數據，也就是可以隨機生成一篇由多個主題組成的文章。其建模過程是逆向透過文本的集合建立生成模型，生成步驟如下: • 選擇N，N遵守poisson(ξ)分配，這裡N代表文章長度(文章字數) • 選擇θ，θ遵守Dirichlet(α)分配，θ代表每個主題發生的機率，α是Dirichlet分配的參數 • 對N個文字中的每一個文字: • 選擇主題zn，zn會遵守Multinominal(θ)多項分配。zn代表當前選擇的主題 • 選擇wn，根據p(wn|zn;β): 在zn條件下的多項分配，β是一個K*V的矩陣，βij=P(wj=1|zi=1) • 在LDA中，不同的文章會有不同的θ對應，而θ可以用來判斷文章的相似度

TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION • Given a document d, the process of keyphrase extraction using TPR consists of the following four steps : 1. Construct a word graph for d according to word co-occurrences within d. 2. Perform TPR to calculate the importance scores for each word with respect to different topics. 3. Using the topic-specific importance scores of words, rank candidate keyphrases respect to each topic separately. 4. Given the topics of document d, integrate the topic-specific rankings of candidate keyphrases into a final ranking, and the top ranked ones are selected as keyphrases.

TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION • We construct a word graph according to word co-occurrences within the given document • Link weight between words • the co-occurrence countwithin a sliding window with maximumW words inthe word sequence. • Direction • When sliding a W-width window, at each position,we add links from the first word pointing to otherwords within the window. • Format • only add adjectives and nounsin word graph

TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION • PageRank • The basic idea of PageRank is that a vertex is important if there are other important vertices pointing to it. • This can be regarded as voting or recommendation among vertices. • G = (V,E) as the graph of a document • vertex set V = {w1,w2, · · · ,wN} • link set (wi,wj) ∈ E if there is a link from wi to wj • the weight of link (wi,wj) as e(wi,wj) • the out-degree of vertex wi as • λ is a damping factor range from 0 to 1 • |V| is the number of vertices

TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION • TopicalPageRank(TPR) • Eachtopic-specific PageRank prefers those words withhigh relevance to the corresponding topic. • In the PageRank of a specific topicz, we will assign a topic-specific preference valuepz(w) to each word w as its random jump probabilitywith

TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION • TopicalPageRank(TPR) • We use three measures to set preference values for TPR: • pz(w) = pr(w|z), • This indicates how much that topic z focuses on word w. • pz(w) = pr(z|w), • This indicates how much that word w focuses on topic z. • pz(w) = pr(w|z) * pr(z|w), • This measure is inspired by the work in (Cohn and Chang, 2000). • Terminate conditions: • when the number of iterations reaches 100 • the difference of each vertex between two neighbor iterations is less than 0.001.

TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION • Extract Keyphrases Using Ranking Scores • We thus select noun phrases from a document as candidate keyphrases for ranking. • The document is first tokenized. • After that, we annotate the document with part of-speech (POS) tags. • Third, we extract noun phrases with pattern (adjective)*(noun)+ • We regard these noun phrases as candidate keyphrases.

TOPICAL PAGERANK FOR KEYPHRASE EXTRACTION • Extract Keyphrases Using Ranking Scores • We rankthem using the ranking scores obtained by TPR. • By considering the topic distribution of document,we further integrate topic-specific rankings of candidatekeyphrases into a final ranking

EXPERIMENTS • Datasets • One dataset was built by Wan and Xiao which was used in (Wan and Xiao, 2008b). • It contains 308 news articles in DUC2001 (Over et al.,2001) • 2, 488 manually annotated keyphrases. • There are at most 10 keyphrases for each document. • In experiments we refer to this dataset as NEWS. • The other dataset was built by Hulth 3 which was used in (Hulth, 2003). • It contains 2, 000 abstracts of research articles • 19, 254 manually annotated keyphrases. • In experiments we refer to this dataset as RESEARCH.

EXPERIMENTS • Dataset • we use theWikipedia snapshot at March 2008to build topicinterpreters with LDA. • collected 2, 122, 618 articles • build thevocabulary by selecting 20, 000 words according totheir document frequency. • learned several models with different numbers of topics, from 50 to 1, 500 respectively.

Experiments • Evaluation Metrics • In experiments we select three evaluation metrics. • Precision / recall / F-measure • Binary preference measure(Bpref) • R: correctkeyphrases ; M: extracted keyphrases ; • r: a correct keyphrase ; n: an incorrect keyphrase • Mean reciprocal rank(MRR) • d: a document ; rankd: the rank of the first correct keyphrase with all extracted keyphrases

EXPERIMENTS • Influences of Parameters to TPR • There are four parameters in TPR that may influencethe performance of keyphrase extraction: • window size W for constructing word graph • the number of topics K learned by LDA • differentsettings of preference values pz(w) • damping factor λ of TPR • Exceptthe parameter under investigation, we set parametersto the following values: W =10,K=1000, λ=0.3and pz(w) = pr(z|w)

EXPERIMENTS • Window Size W • In experiments on NEWS and W ranges from 5 to 20 as shown in Table 1: • Similarly, W ranges from 2 to 10, the performance on RESEARCH does not change much but it will become poor when W = 20. • RESEARCH(121 words) are much shorter than NEWS(704 words) • the graph will become full-connected • the weights of links will tend to be equal

EXPERIMENTS • The Number of Topics K • We demonstrate the influence of the number of topics K of LDA models in Table 2. • The influence is similar on RESEARCH • It indicates that LDA is appropriate for obtaining topics of words and documents for TPR to extract keyphrases.

EXPERIMENTS • Damping Factor λ • Damping factor λ of TPR reconciles the influences of graph walks

EXPERIMENTS • Preference Values • In Table3 we show the influence when the number ofkeyphrases M = 10 on NEWS. • pr(w|z) assignspreference values according to how frequentlythat words appear in the given topic. • pr(z|w) prefers those words that are focusedon the given topic.

EXPERIMENTS • Comparing with Baseline Methods • We select three baseline methods to compare with TPR • TFIDF • PageRank • TFIDF amd PageRank don’t use the topic information • LDA • computes the ranking score for each word using the topical similarity between the word and the document. • The LDA baseline calculated using cosine similarity which performs the best.

EXPERIMENTS • In Tables 4 and 5 we show the comparing results of the four methods on both NEWS and RESEARCH. • The improvements of TPR are all statistically significanttested with bootstrap re-sampling with 95%confidence. • LDA performs equal or better than TFIDF and PageRank under precision/recall/F measure. • the performance of LDA under MRR is much worse than TFIDF and PageRank

EXPERIMENTS • In Figures 3 and 4 we show theprecision-recall relations of four methods on NEWSand RESEARCH. • Each point on the precision-recallcurve is evaluated on different numbers of extractedkeyphrases M

EXPERIMENTS • in Table 6 we show an example ofextracted keyphrases using TPR from a news articlewith title “Arafat Says U.S. Threatening to KillPLO Officials” • Top 3 topic: • Palestine • Israel • terrorism

EXPERIMENTS • TFIDF • only considered the frequency • highly ranked the phrases with “PLO” which appeared about 16 times in this article • LDA • without considering the frequency • failed to extract keyphrase “political assassination”, in which the word “assassination” occurred 8 times in this article.

RELATED WORK • supervised methods • regarded keyphrase extraction as a classification task (Turney, 1999) • need manually annotated training set which is time-consuming • clustering techniques on word graphs for keyphrase extraction (Grineva et al., 2009; Liu et al., 2009). • performed well on short abstracts but poorly on long articles • Topical PageRank with random jumps between topics(Nie et al., 2006) • did not help improve the performance for keyphrase extraction • Peter D. Turney. 1999. Learning to extract keyphrases from text. National Research Council Canada, Institute for Information Technology, Technical Report ERB-1057. • M. Grineva, M. Grinev, and D. Lizorkin. 2009. Extractingkey terms from noisy and multi-theme documents. In Proceedings of WWW, pages 661–670. • Lan Nie, Brian D. Davison, and Xiaoguang Qi. 2006. Topical link analysis for web search. In Proceedings of SIGIR, pages 91–98.

CONCLUSION • We propose a new graph-based framework, Topical PageRank • We investigate the influence of various parameters on TPR • Future work • We design to obtain topics using other machine learning methods and from other knowledge bases • consider topic information in other graph-based ranking algorithms such as HITS (Kleinberg, 1999). • We will investigate the influence of corpus selection in training LDA for keyphrase extraction using TPR.

RELATED WORK • Topical link analysis for web search (Nie et al., 2006) • when surfing following a graph link from vertex wi to wj , the ranking score on topic z of wi will have a higher probability to pass to the same topic of wj and have a lower probability to pass to a different topic of wj .

AUTOMATIC KEYPHRASE EXTRACTION VIA TOPIC DECOMPOSITION

AUTOMATIC KEYPHRASE EXTRACTION VIA TOPIC DECOMPOSITION

Presentation Transcript

Automatic Bibliographic Extraction System ABES

Rapid Protein Side-Chain Packing via Tree Decomposition

Semi-Automatic Content Extraction from Specifications

Decomposition of Cruorine and Extraction of Hematin

CETR: Content Extraction via Tag Ratios

CRLB via Automatic Differentiation: DESPOT2

Automatic Product Profiling via NLP

Automatic Extraction of Hierarchical Relations from Text

Automatic Wrappers for Large Scale Web Extraction

Automatic Centerline Extraction for Virtual Colonoscopy

DSpace, ETDs, Automatic Metadata Extraction

Rapid Protein Side-Chain Packing via Tree Decomposition

SECURE WEB APPLICATIONS VIA AUTOMATIC PARTITIONING

Secure Web Applications via Automatic Partitioning

Pat-Tree-Based Adaptive keyphrase Extraction for Intelligent Chinese Information Retrieval

Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA)

Topic Extraction From Turkish News Articles

Automatic Keyphrase Extraction by Bridging Vocabulary Gap

Relation Extraction (RE) via Supervised Classification

Typed Tensor Decomposition of Knowledge Bases for Relation Extraction

Automatic term extraction from domain corpora