A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering

A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering Author : Yaohong JIN Source : International Conference on Computer Science and Electronics Engineering (ICCSEE), Date : 2013/10/7 Presenter : 曹昌林

Outline • Introduction • CLUSTERING ALGORITHM • TOPIC DETECTION AND TRACKING ALGORITHM • Conclusion

TDT(Topic Detection and Tracking,話題檢測與跟蹤) • 一種訊息處理的技術 • 可用於識別主要議題，並追蹤延伸話題 • 運用在news mining，會隨著時間產生位移

suffix tree(後綴樹) • 一棵包含 m 個字的字符串S 的後綴樹 T 僅有 m 個葉子節點的樹，且每條邊都被標上非空的 S 的子串，並且從一個節點發出的兩條邊不能包含相同詞開始的字串。ex:bananas

suffix tree clustering(後綴樹組)(1) • 將 n 個字串集合到一棵後綴樹，叫後綴樹組。 • 每個葉子節點被標示為( j , i )，從根到該葉子節點的整個路徑的邊串起來的內容就是 j(0 < j ≦n)從位置 i 起的後綴子串

suffix tree clustering(後綴樹組)(2) • ex: S = { "cat ate cheese", "mouse ate cheese too", "cat ate mouse too" }

CLUSTERING ALGORITHM

Feature Selection(1) • 為了clustering使用NLP algorithm來選擇較有意義的字 • 使用 stop word table來過濾高頻率單字 (such as "the", "I", "a“) • 使用TF-IDF來計算單字的權重，並且過濾常使用的單字

Feature Selection(2) • 初始化STC ，來追蹤任何長度的單字 • 對所有單字標註詞性和意思 • 選擇 noun、verb和意思作為文件的key word

Suffix Tree Clustering • 將feature selection過濾後的結果，輸入到STC • 保留在文本的標點符號和他們的位置關係 • 優點在於一個文檔可以出現在多個 clusters，而且任何句子輸入到tree僅需 linear time

Scoring Clusters(1) • 每日的新聞標題被分散到一連串的clusters • 一個cluster的重要性，關於有多少文章包含此topic跟有多少媒體將此topic放入文章中，而兩者皆高的，就會具有最高的關注度 • 經過下一頁式子計算，選出最高的50個cluster來當作TDT的source

Scoring Clusters(2) • is the importance of the topic • is the number of articles in the topic • is the total number of articles in the day • is the number of the medias in which the topic is involved • is the total number of medias in corpus.

TOPIC DETECTION AND TRACKING ALGORITHM(1) • Suppose A={a1,a2,……an} is the set of topics in one period time. Initially A is an empty set. • B ={ }is the set of clusters in one day, where i is the ith day, and m is 50 • Step 1, to initialize the topic set A; • Step 2, if set A is empty set, add all the elements of B into A;

TOPIC DETECTION AND TRACKING ALGORITHM(2) • Step 3, to compute the similarity of each pair of (ak, bij); • Step 4, If a cluster bij is similar with ak, bij is linked with ak (This procedure is tracking), and bij is called as sub-topic of ak; • Step 5, If bij is not similar with anyone of set A, bij is a new topic, and was added into the set A (This procedure is detection); • Step 6, to generate a description for each topic.

TOPIC DETECTION AND TRACKING ALGORITHM(3) • The difficulty of TDT algorithm above is the similarity computing of clusters because the focus of topic is gradually shifting over time • similarity computing has to take the shifting phenomenon into account • a new description has to be generated from a list of topics if a topic is linked by other topics

Similarity of two Clusters(1) • use Vector Space Model (VSM) to represent the content of the cluster • In addition to the label of the cluster, we added the top K words into the vector • K words were extracted from the nodes of suffix tree by the Mutual Information algorithm • K is set to 50

Similarity of two Clusters(2) • use Jaccard distance to measure the correlation of two vectors of clusters • is the number of words appears in two clusters • is the total number of words in two clusters.

Similarity of two Clusters(3) • means these two clusters are similar, and can be linked • means they are not similar, and a new topic have to be added

Description Generation • use semantic analysis based on theHierarchical Network of Concepts theory (HNC theory) toextract the description from the labels. • The words with same meaning or hyponymy have to be filtered, and the noun is prior to be retained in the list • The common phrase has to be extracted from the remaining word list

Conclusion • Advantage • can track the topics effectively • Drawback • The different aspects of the topic were revealed correctly, but not linked with each other • the ambiguity of topic detection and tracking was not processed very well • combine the semantic analysis technology with TDT to deal with the ambiguity of topic detection and tracking

A Topic Detection and Tracking method combining NLP with Suffix Tree Clustering