National Technical University of Athens

Clustering Documents using the 3-Gram Graph Representation Model SUPER Social sensors for secUrity Assessments and Proactive EmeRgencies management National Technical University of Athens John Violos, KonstantinosTserpes, AthanasiosPapaoikonomou, MagdaliniKardara, Theodora Varvarigou PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

SUPER PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

SUPER • Hurricane Sandy 2012 • 20 million tweets • 10pics/sec Instagram • Virginia U.S. 2011 5.8 Richter • 40.000 tweets hit the 1st min PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

Topic Communities 3 / 10 / 2014 Detect Topic Communities in Social Networks. Texts of Users, Social Graph, Actions (likes, follow). PCI 2014 18th Panhellenic Conference in Informatics

Text Clustering Users write texts about : Interests Habits Events in their life Cluster texts in topics => Cluster their writers in topic communities. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

LDA Latent Dirichlet Allocation What is the weakness? It is a bag of words model. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

Sequence of Words The sequence of words is a valuable information. Furthermore Derivative of Words are Similar Words. We need a representation model: Keeps the information of the word sequence. Captures the similarity between derivatives of words. A good solution is the N-Gram Graphs! PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

Overview Basic Steps Input: Corpus of texts, number of Clusters k. Ngram Graph that represents the Corpus. Ngram Graph that represents each text. Partition of the Corpus Graph (k subgraphs). Comparison between each text with all partitions. Allocation for each text to the cluster with the highest comparison result. Output: k Clusters with the texts which include. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

N-Grams (1) What is the N-Grams? An N-gram is a contiguous sequence of N items from a given sequence of text. The items can be phonemes, syllables, letters, words. In our research we use letters and N=3 An example“home_phone”“hom”, “ome, “me_” , “e_p”, “_ph”, “pho”, “hon”, “one” PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

N-Grams (2) NGrams are used in many applications. Approximate string matching. Find likely candidates for the correct spelling of a misspelled word. Language identification. Species identification from a small sequence of DNA. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

N-Gram Graph Nodes are all the NGrams of a text. Edges join only the neighbor NGrams. How many edges will be added is defined by a threshold. Edges : Weighted or Unweighted Directed or Undirected PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

Example of 3-Gram Graph In this example the graph is: • Undirected • Weighted • Each node is a 3Gram • The threshold of neighbor nodes is 3 The 3Gram Graph “home_phone”. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

Graph Comparison How to compare two NGram Graphs? Containment Similarity (unweighted) Value Similarity (weighted) PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

k subgraphs • Min number of edges between the k subgraphs. A graph partition can represent a topic. • There are many graph partitioning algorithms: • Kernighan–Lin algorithm • Using the Edge betweenness centrality • Fast Kernel-based Multilevel Algorithm for Graph • Clustering Graph Partitioning PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

Fast Kernel-based Multilevel Algorithm Random initial Partitioning For each node i, we compute the cost of belonging the node i in each cluster. The node i is assigned in the cluster with the min cost. We iterate until none node change Cluster. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

Experimental Results Reuters-21578 The most widely used test collection for text categorization research. Data set: 18457 documents belonging to 428 labels. Multi label documents: Each document belong to 0 - 29 labels. The complete method was implemented using Java SE. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

Experimental Results 3Gram Graph: recognizes the clusters which include many documents.LDA: small clusters like broken parts of the gold standard clusters. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

Precision & Recall PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

The method can catch the sequence of words. • Derivatives of a word are not handled as different words. • Big clusters can be recognized and more documents can • be assigned to them. • It supports document partial matching and soft membership. • It can capture writing characteristics of a writer. Advantages of the 3-Gram Graphs Clustering PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

General notes Our method is not case sensitive. Punctuations and numbers are omitted. The partitions depend on the initial randomly clustering. Maximum nodes of a 3Gram Graph is = 19683. PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

Experiment with 4Grams, 5Grams, 6Grams. • Experiment with various sizes of threshold. • Experiment with various graph similarity functions. • Experiment with various graph partitioning algorithms. • Remove Stop Words. • Filter out edges which do not provide useful information. Future work PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

SUPER Thank you for your attention! PCI 2014 18th Panhellenic Conference in Informatics 3 / 10 / 2014

National Technical University of Athens

National Technical University of Athens

Presentation Transcript

The National Technical University of Athens QSAR Group – Overview of Research Activities

National Technical University of Athens Unit of Environmental Science and Technology

Prof. Maria Loizidou National Technical University of Athens (NTUA) mloiz@chemeng.ntua.gr

Dr. Konstantinos Moustakas National Technical University of Athens konmoust@central.ntua.gr

Nikos Anastopoulos Nectarios Koziris National Technical University of Athens

National and Kapodistrian University of Athens

Christina Alexandris National University of Athens and

NATIONAL AND KAPODISTRIAN UNIVERSITY OF ATHENS GREECE

National Technical University of Athens Unit of Environmental Science and Technology

NATIONAL TECHNICAL UNIVERSITY OF ATHENS DEPARTMENT OF PHYSICS

NATIONAL UNIVERSITY OF ATHENS

STEFANOS KOLLIAS DC-NET Project ECE School, National Technical University of Athens

The National Technical University of Athens Unit of Process Control and Informatics

National Technical University of Athens

Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens

Konstantinos Moustakas National Technical University of Athens

Dr. D. Fatta National Technical University of Athens

NATIONAL TECHNICAL UNIVERSITY OF ATHENS SCHOOL OF MINING AND METALLURGICAL ENGINEERING

National Technical University of Athens (NTUA) Greece

NATIONAL TECHNICAL UNIVERSITY OF ATHENS DEPARTMENT OF PHYSICS

NATIONAL TECHNICAL UNIVERSITY OF ATHENS SCHOOL OF CHEMICAL ENGINEERING

Aristotelis Charalampakis and Vlasis Koumousis National Technical University of Athens