1 / 19

Semantic Wordfication of Document Collections

Semantic Wordfication of Document Collections. Presenter: Yingyu Wu. Outline. Introduction ProjCloud Technique Results and Comparisons Discussion and Limitations Conclusion. Introduction. Word Cloud. Two issues of word cloud:

teddy
Download Presentation

Semantic Wordfication of Document Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semantic Wordfication of Document Collections Presenter: Yingyu Wu

  2. Outline • Introduction • ProjCloud Technique • Results and Comparisons • Discussion and Limitations • Conclusion

  3. Introduction • Word Cloud

  4. Two issues of word cloud: • (1) Existing methods do not yet provide an intuitive visual representation that allows to link words on the layout to the documents they are meant to represent. • (2) The construction of word clouds inside general polygons with semantical preservation between words.

  5. Contributions: • A novel word cloud-based visualization technique, named ProjCloud. • (1) combine multidumensional projection and word clouds, which enables to visualize the similarity among documents as well as their corresponding word clouds, extend the exploratory capabilities of the word clouds. • (2) A new approach for building word clouds inside polygons while still preserving the semantic relationship among keywords. • (3) A mechanism based on spectral sorting that allows arranging words according to their semantic relationship as well as highlighting the most important words in the cloud.

  6. ProjCloud Technique • Overview of the sequence of steps

  7. Steps: • (1) Mapping document collection into the visual space using a multidimensional projection technique(LSP). • (2) Points in the visual space are clustered(polygons). Two versions: automatically and user interactive. • (3) Keywords extracted (most frequent words). Compute their relevance in order to guide the semantic preserving placement of words • (4) The scaling step take place, keyword are size based on their relevance and on the area of the containing polygon. • (5) The optimization algorithm take places to generate the word cloud.

  8. Keyword Relevance and Semantic Relation • Let M be the document x tem frequency matrix. • Covariance matrix C obtained from M. • Build a graph G where each node corresponds to a keyword and an edge eij connects between two keywords ( Wi and Wj ) if only if the covariance Cij is among the k-largest ones. • Assuming that edge eijhas weight Cij, it used Fiedler vector, assigns a scalar value aij to each keyword that minimizes: • If Cij is big then the Wi and Wj will receive similar values when they are closely related.

  9. The most relevant keyword: • Cijmax is the largest covariance in C and Wi and Wj are the corresponding words. • The most relevant keyword is Wi if the average covariance between Wi and Wk (k = 1,2,3,..n) is larger than the average covariance of Wj. • Once we get the most relevant keyword (Wr), the keyword are sorted in increasing order according to • In ProjCloud, the order given by Fiedler vector dictates the position of words into the cloud.

  10. Sizing keywords • (1) bounding boxes. • (2) the size of keyword is set to the scale value which fits in the interval [fmin, fmax](12,50). • (3) If the areas of all keyword bounding boxes is smaller than the area of polygon P, fmax is increased and the values are re-scaled. This process is repeated until the sum of areas of the keywords exceeds the area or P.

  11. The optimization Problem

  12. Results

  13. Comparisons

  14. Discussion and Limitations • ProjCloud is largely dependent on the clustering process. • If the clustering performs poorly, it will make the word cloud very hard to fit and reed. • Empty space between clusters.

  15. Conclusion

  16. Thank you

More Related