220 likes | 302 Views
Clustering Documents using the 3-Gram Graph Representation Model. SUPER Social sensors for secUrity Assessments and Proactive EmeRgencies management. National Technical University of Athens.
E N D
Clustering Documents using the 3-Gram Graph Representation Model SUPER Social sensors for secUrity Assessments and Proactive EmeRgencies management National Technical University of Athens John Violos, KonstantinosTserpes, AthanasiosPapaoikonomou, MagdaliniKardara, Theodora Varvarigou PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
SUPER PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
SUPER • Hurricane Sandy 2012 • 20 million tweets • 10pics/sec Instagram • Virginia U.S. 2011 5.8 Richter • 40.000 tweets hit the 1st min PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
Topic Communities 3 / 10 / 2014 Detect Topic Communities in Social Networks. Texts of Users, Social Graph, Actions (likes, follow). PCI 2014 18th Panhellenic Conference in Informatics
Text Clustering Users write texts about : Interests Habits Events in their life Cluster texts in topics => Cluster their writers in topic communities. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
LDA Latent Dirichlet Allocation What is the weakness? It is a bag of words model. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
Sequence of Words The sequence of words is a valuable information. Furthermore Derivative of Words are Similar Words. We need a representation model: Keeps the information of the word sequence. Captures the similarity between derivatives of words. A good solution is the N-Gram Graphs! PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
Overview Basic Steps Input: Corpus of texts, number of Clusters k. Ngram Graph that represents the Corpus. Ngram Graph that represents each text. Partition of the Corpus Graph (k subgraphs). Comparison between each text with all partitions. Allocation for each text to the cluster with the highest comparison result. Output: k Clusters with the texts which include. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
N-Grams (1) What is the N-Grams? An N-gram is a contiguous sequence of N items from a given sequence of text. The items can be phonemes, syllables, letters, words. In our research we use letters and N=3 An example“home_phone”“hom”, “ome, “me_” , “e_p”, “_ph”, “pho”, “hon”, “one” PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
N-Grams (2) NGrams are used in many applications. Approximate string matching. Find likely candidates for the correct spelling of a misspelled word. Language identification. Species identification from a small sequence of DNA. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
N-Gram Graph Nodes are all the NGrams of a text. Edges join only the neighbor NGrams. How many edges will be added is defined by a threshold. Edges : Weighted or Unweighted Directed or Undirected PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
Example of 3-Gram Graph In this example the graph is: • Undirected • Weighted • Each node is a 3Gram • The threshold of neighbor nodes is 3 The 3Gram Graph “home_phone”. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
Graph Comparison How to compare two NGram Graphs? Containment Similarity (unweighted) Value Similarity (weighted) PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
k subgraphs • Min number of edges between the k subgraphs. A graph partition can represent a topic. • There are many graph partitioning algorithms: • Kernighan–Lin algorithm • Using the Edge betweenness centrality • Fast Kernel-based Multilevel Algorithm for Graph • Clustering Graph Partitioning PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
Fast Kernel-based Multilevel Algorithm Random initial Partitioning For each node i, we compute the cost of belonging the node i in each cluster. The node i is assigned in the cluster with the min cost. We iterate until none node change Cluster. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
Experimental Results Reuters-21578 The most widely used test collection for text categorization research. Data set: 18457 documents belonging to 428 labels. Multi label documents: Each document belong to 0 - 29 labels. The complete method was implemented using Java SE. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
Experimental Results 3Gram Graph: recognizes the clusters which include many documents.LDA: small clusters like broken parts of the gold standard clusters. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
Precision & Recall PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
The method can catch the sequence of words. • Derivatives of a word are not handled as different words. • Big clusters can be recognized and more documents can • be assigned to them. • It supports document partial matching and soft membership. • It can capture writing characteristics of a writer. Advantages of the 3-Gram Graphs Clustering PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
General notes Our method is not case sensitive. Punctuations and numbers are omitted. The partitions depend on the initial randomly clustering. Maximum nodes of a 3Gram Graph is = 19683. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
Experiment with 4Grams, 5Grams, 6Grams. • Experiment with various sizes of threshold. • Experiment with various graph similarity functions. • Experiment with various graph partitioning algorithms. • Remove Stop Words. • Filter out edges which do not provide useful information. Future work PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014
SUPER Thank you for your attention! PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014