1 / 23

A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering. Author : Yaohong JIN Source : International Conference on Computer Science and Electronics Engineering (ICCSEE), Date : 2013/10/7 Presenter : 曹昌林. Outline. Introduction CLUSTERING ALGORITHM

kami
Download Presentation

A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering Author : Yaohong JIN Source : International Conference on Computer Science and Electronics Engineering (ICCSEE), Date : 2013/10/7 Presenter : 曹昌林

  2. Outline • Introduction • CLUSTERING ALGORITHM • TOPIC DETECTION AND TRACKING ALGORITHM • Conclusion

  3. TDT(Topic Detection and Tracking,話題檢測與跟蹤) • 一種訊息處理的技術 • 可用於識別主要議題,並追蹤延伸話題 • 運用在news mining,會隨著時間產生位移

  4. suffix tree(後綴樹) • 一棵包含 m 個字的字符串S 的後綴樹 T 僅有 m 個葉子節點的樹,且每條邊都被標上非空的 S 的子串,並且從一個節點發出的兩條邊不能包含相同詞開始的字串。ex:bananas

  5. suffix tree clustering(後綴樹組)(1) • 將 n 個字串集合到一棵後綴樹,叫後綴樹組。 • 每個葉子節點被標示為( j , i ),從根到該葉子節點的整個路徑的邊串起來的內容就是 j(0 < j ≦n)從位置 i 起的後綴子串

  6. suffix tree clustering(後綴樹組)(2) • ex: S = { "cat ate cheese", "mouse ate cheese too", "cat ate mouse too" }

  7. Outline • Introduction • CLUSTERING ALGORITHM • TOPIC DETECTION AND TRACKING ALGORITHM • Conclusion

  8. CLUSTERING ALGORITHM

  9. Feature Selection(1) • 為了clustering使用NLP algorithm來選擇較有意義的字 • 使用 stop word table來過濾高頻率單字 (such as "the", "I", "a“) • 使用TF-IDF來計算單字的權重,並且過濾常使用的單字

  10. Feature Selection(2) • 初始化STC ,來追蹤任何長度的單字 • 對所有單字標註詞性和意思 • 選擇 noun、verb和意思作為文件的key word

  11. Suffix Tree Clustering • 將feature selection過濾後的結果,輸入到STC • 保留在文本的標點符號和他們的位置關係 • 優點在於一個文檔可以出現在多個 clusters,而且任何句子輸入到tree僅需 linear time

  12. Scoring Clusters(1) • 每日的新聞標題被分散到一連串的clusters • 一個cluster的重要性,關於有多少文章包含此topic跟有多少媒體將此topic放入文章中,而兩者皆高的,就會具有最高的關注度 • 經過下一頁式子計算,選出最高的50個cluster來當作TDT的source

  13. Scoring Clusters(2) • is the importance of the topic • is the number of articles in the topic • is the total number of articles in the day • is the number of the medias in which the topic is involved • is the total number of medias in corpus.

  14. Outline • Introduction • CLUSTERING ALGORITHM • TOPIC DETECTION AND TRACKING ALGORITHM • Conclusion

  15. TOPIC DETECTION AND TRACKING ALGORITHM(1) • Suppose A={a1,a2,……an} is the set of topics in one period time. Initially A is an empty set. • B ={ }is the set of clusters in one day, where i is the ith day, and m is 50 • Step 1, to initialize the topic set A; • Step 2, if set A is empty set, add all the elements of B into A;

  16. TOPIC DETECTION AND TRACKING ALGORITHM(2) • Step 3, to compute the similarity of each pair of (ak, bij); • Step 4, If a cluster bij is similar with ak, bij is linked with ak (This procedure is tracking), and bij is called as sub-topic of ak; • Step 5, If bij is not similar with anyone of set A, bij is a new topic, and was added into the set A (This procedure is detection); • Step 6, to generate a description for each topic.

  17. TOPIC DETECTION AND TRACKING ALGORITHM(3) • The difficulty of TDT algorithm above is the similarity computing of clusters because the focus of topic is gradually shifting over time • similarity computing has to take the shifting phenomenon into account • a new description has to be generated from a list of topics if a topic is linked by other topics

  18. Similarity of two Clusters(1) • use Vector Space Model (VSM) to represent the content of the cluster • In addition to the label of the cluster, we added the top K words into the vector • K words were extracted from the nodes of suffix tree by the Mutual Information algorithm • K is set to 50

  19. Similarity of two Clusters(2) • use Jaccard distance to measure the correlation of two vectors of clusters • is the number of words appears in two clusters • is the total number of words in two clusters.

  20. Similarity of two Clusters(3) • means these two clusters are similar, and can be linked • means they are not similar, and a new topic have to be added

  21. Description Generation • use semantic analysis based on theHierarchical Network of Concepts theory (HNC theory) toextract the description from the labels. • The words with same meaning or hyponymy have to be filtered, and the noun is prior to be retained in the list • The common phrase has to be extracted from the remaining word list

  22. Outline • Introduction • CLUSTERING ALGORITHM • TOPIC DETECTION AND TRACKING ALGORITHM • Conclusion

  23. Conclusion • Advantage • can track the topics effectively • Drawback • The different aspects of the topic were revealed correctly, but not linked with each other • the ambiguity of topic detection and tracking was not processed very well • combine the semantic analysis technology with TDT to deal with the ambiguity of topic detection and tracking

More Related