190 likes | 309 Views
Semantic Wordfication of Document Collections. Presenter: Yingyu Wu. Outline. Introduction ProjCloud Technique Results and Comparisons Discussion and Limitations Conclusion. Introduction. Word Cloud. Two issues of word cloud:
E N D
Semantic Wordfication of Document Collections Presenter: Yingyu Wu
Outline • Introduction • ProjCloud Technique • Results and Comparisons • Discussion and Limitations • Conclusion
Introduction • Word Cloud
Two issues of word cloud: • (1) Existing methods do not yet provide an intuitive visual representation that allows to link words on the layout to the documents they are meant to represent. • (2) The construction of word clouds inside general polygons with semantical preservation between words.
Contributions: • A novel word cloud-based visualization technique, named ProjCloud. • (1) combine multidumensional projection and word clouds, which enables to visualize the similarity among documents as well as their corresponding word clouds, extend the exploratory capabilities of the word clouds. • (2) A new approach for building word clouds inside polygons while still preserving the semantic relationship among keywords. • (3) A mechanism based on spectral sorting that allows arranging words according to their semantic relationship as well as highlighting the most important words in the cloud.
ProjCloud Technique • Overview of the sequence of steps
Steps: • (1) Mapping document collection into the visual space using a multidimensional projection technique(LSP). • (2) Points in the visual space are clustered(polygons). Two versions: automatically and user interactive. • (3) Keywords extracted (most frequent words). Compute their relevance in order to guide the semantic preserving placement of words • (4) The scaling step take place, keyword are size based on their relevance and on the area of the containing polygon. • (5) The optimization algorithm take places to generate the word cloud.
Keyword Relevance and Semantic Relation • Let M be the document x tem frequency matrix. • Covariance matrix C obtained from M. • Build a graph G where each node corresponds to a keyword and an edge eij connects between two keywords ( Wi and Wj ) if only if the covariance Cij is among the k-largest ones. • Assuming that edge eijhas weight Cij, it used Fiedler vector, assigns a scalar value aij to each keyword that minimizes: • If Cij is big then the Wi and Wj will receive similar values when they are closely related.
The most relevant keyword: • Cijmax is the largest covariance in C and Wi and Wj are the corresponding words. • The most relevant keyword is Wi if the average covariance between Wi and Wk (k = 1,2,3,..n) is larger than the average covariance of Wj. • Once we get the most relevant keyword (Wr), the keyword are sorted in increasing order according to • In ProjCloud, the order given by Fiedler vector dictates the position of words into the cloud.
Sizing keywords • (1) bounding boxes. • (2) the size of keyword is set to the scale value which fits in the interval [fmin, fmax](12,50). • (3) If the areas of all keyword bounding boxes is smaller than the area of polygon P, fmax is increased and the values are re-scaled. This process is repeated until the sum of areas of the keywords exceeds the area or P.
Discussion and Limitations • ProjCloud is largely dependent on the clustering process. • If the clustering performs poorly, it will make the word cloud very hard to fit and reed. • Empty space between clusters.