250 likes | 500 Views
Automatic Keyphrase Extraction by Bridging Vocabulary Gap. Xinxiong Chen Tsinghua University 2013-04-26. Main Idea. Vocabulary gap: Appropriate keyphrases are not always statistically significant or even do not appear in the given document.
E N D
Automatic Keyphrase Extraction by Bridging Vocabulary Gap Xinxiong Chen Tsinghua University 2013-04-26 THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Main Idea • Vocabulary gap: Appropriate keyphrases are not always statistically significant or even do not appear in the given document. • Use word alignment models in statistical machine translation to learn translation probabilities between the words in documents and the words in keyphrases. THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Introduction – Keyphrase • What is keyphrase • a set of terms selected from a document as a short summary of the document. THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Introduction – Keyphrase Extraction • Why keyphrase extraction • Digital libraries • Information Retrieval • Goal : automatically extract keyphrasesfrom documents • Unsupervised THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Example • A News article: (translated from Chinese) THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Example • Existing unsupervised method: • TFIDF : Nuclear bombs , Iran , Israeli , enriched uranium , speech • TextRank : Iran , Israeli , chief , Nuclear bombs , Military • Use a window whose size is a constant to build a word graph • Use PageRank to decide which word is more important THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Example • LDA : Iran , England , America , Nation , Speech • Learn topics from documents • ExpandRank : Iran , enriched uranium , Israeli , atomic energy, Lebanon • Find k nearest neighbor documents to build word graphs THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Idea - Association • If a word is mentioned, it remind people of other words. • iPhone – Apple • Nuclear bombs – Nuclear Weapon • What is the probability between “Nuclear bombs” and “Nuclear Weapon”? THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Idea – SMT for Keyphrase Extraction • Both the content and the keyphrase are parallel summaries of a news • Unsupervised : Use title or summarization instead • Estimate the translation probabilities between the words in content and title • word alignment models Translation THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Translation Probability • Example: • Nuclear bombs: • Nuclear bombs : 0.515757 • Liquid : 0.0871815 • Nuclear Weapon : 0.0808868 • Military Action: 0.0239178 • Israeli Military : 0.0215988 • Miniaturization : 0.0118 • Possible : 0.0113688 • enriched uranium : 0.0100252 THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Keyphrase Extraction Using WAM • Given news, rank keyphrasesby computing the scores • Iran , Israeli , chief, Nuclear bombs , Military … • Iran , Israeli , chief, Nuclear bombs , Nuclear weapon , Military , speech THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Word Trigger Method (WTM) • Three Steps : • Preparing translation pairs • Learning a translation model • IBM Model-1 • Extracting keyphrasegiven a resource THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Translation Pairs • Length unbalance problem • Unable to list all tags on the annotation side • Tags may have different importance for the resource THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Content-Title Pairs • Length unbalanced problem • Unable to list all tags on the annotation side • Tags may have different importance for the resource • Sampling Method • Tag weighting type • TFt, TF-IRFt • Length ratio THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Learning Translation Probabilities • IBM Model-1 as WAM algorithms • Asymmetric: Prd2a(t|w), Pra2d(t|w) • Linear Combination • Prd2a(t|w) • Pra2d(t|w) • When λ = 1 or λ = 0, it simply uses model Prd2a(t|w) or Pra2d(t|w) correspondingly THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Tag Suggestion Using Triggered Words • Given description, rank tags by computing the scores THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Tag Suggestion Using Triggered Words • Given description, rank tags by computing the scores • Trigger power of the word w in the content • TF-IRFw • TextRank • Their product THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Keyphrase Extraction Using Triggered Words • Given description, rank tags by computing the scores • Translation probabilities from words in description to keyphraes THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Emphasize Tags Appearing In Content for WTM (EWTM) • Emphasize tags appearing in description • It(w): indicator function to emphasize the tags appearing in content • Gets 1 when t = w • Gets 0 when t != w THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Experiments • Datasets • 13702 news from www.163.com • Evaluation Metrics • Precision, recall and F-measure THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Experiment Results THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Parameters – Length Ratio • The length ratio: content/title THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn
Application SINA APP(http://app.thunlp.org/weibo) Now we have more than 2 million registered users THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn