220 likes | 343 Views
Efficient Visualization of Document Streams. Miha Gr č ar { miha.grcar @ijs.si} Vid Podpe čan Matjaž Juršič Prof. Dr. Nada Lavrač Jozef Stefan Institute, Dept. of Knowledge Technologies Ljubljana, Slovenia Discovery Science, Canberra, October 2010. Outline. Motivation
E N D
Efficient Visualization of Document Streams Miha Grčar{miha.grcar@ijs.si} Vid Podpečan Matjaž Juršič Prof. Dr. Nada Lavrač Jozef Stefan Institute, Dept. of Knowledge Technologies Ljubljana, Slovenia Discovery Science, Canberra, October 2010
Outline • Motivation • Original algorithm • Document corpus visualization pipeline • Our modified algorithm • Visualization of document streams • Experiments (speed tests) • Conclusions and further work DS 2010
MotivationGoal: Visualization of Document Streams Documentstream Outdateddocuments DS 2010
Corpus Visualization Pipeline Paulovich et al. (2006) Neighborhoodscomputation Corpus preprocessing k-means clustering Least-squaresinterpolation Stressmajorization Document corpus Layout DS 2010
Corpus Visualization Pipeline • Corpus preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation • Tokenization • Stop-word removal • Lemmatization • n-grams Sparse TF-IDF vectors in a high-dimensional space DS 2010
Corpus Visualization Pipeline • Corpus preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation Iterative method DS 2010
Corpus Visualization Pipeline • Corpus preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation Iterative method High-dimensional 2D DS 2010
Corpus Visualization Pipeline • Corpus preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation DS 2010
Corpus Visualization Pipeline 1 (0,0) 1 (0,0) … -1/k 1 -1/k -1/k (x1,y1) 1 (x2,y2) • Corpus preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation • pi = (1/|Np|)rNpr • pi + rNp(–1/k)r = (0, 0), k = |Np| • ci = (xi*, yi*) … 1 Iterative method … 1 1 … 1 = … 1 1 (0,0) … 1 (0,0) (xn-1,yn-1) 1 (x1*,y1*) (xn,yn) … 1 … 1 1 (xr*,yr*) argminX{||AX – B||2} AX = B DS 2010
Stream Visualization Pipeline Neighborhoodscomputation Preprocessing k-means clustering Least-squaresinterpolation Stress majorization Buffer (FIFO) Documentstream Outdateddocuments DS 2010
Stream Visualization Pipeline • Preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation • TF-IDF weights • TF: the number of times the term occurs in the document • DF: the number of documents in the corpus containing the term • IDF: log(|D| / DF) • Not possible to compute IDF from (infinite) real-time streams DS 2010
Stream Visualization Pipeline • Preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation • TF vector • TF-IDF vector VocabularyDF values • TF vector TF vector • TF vector • TF vector • TF vector DS 2010
Stream Visualization Pipeline • Preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation Warmstart! DS 2010
Stream Visualization Pipeline • Preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation Warmstart! DS 2010
Stream Visualization Pipeline • Preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation • Remove outdated instances • Add new instances … DS 2010
Stream Visualization Pipeline 1 (0,0) 1 (0,0) … 1 (x3,y3) (x4,y4) 1 • Preprocessing • k-means clustering • Stress majorization • Neighborhoods • Least-squares interpolation • Remove outdated instances • Add new instances … 1 Warmstart! … 1 1 … 1 = 1 (0,0) … 1 (0,0) 1 (x1,y1) 1 (0,0) 1 (x2,y2) (0,0) (0,0) 1 1 (x3,y3) … … 1 (0,0) 1 (x4,y4) 1 … (xn-1,yn-1) 1 = 1 (x1*,y1*) … 1 (xn,yn) … 1 (0,0) 1 … 1 (0,0) … 1 (xn-1,yn-1) 1 (x1*,y1*) (xn,yn) … 1 1 (xr*,yr*) … 1 1 (xr*,yr*) DS 2010
Speed Tests • First 30,000 news from Reuters Corpus Vol. 1 (“natural” rate: 1.4 news / minute) • Experimental setting • Maximum rate? • 10 news in a batch (u = 10) • Buffer capacity: nQ = 5,000 news • 100 control points, 30 + 30 neighbors DS 2010
Speed Tests DS 2010
Speed Tests Processing delay: ~9 sec + 4 sec to form a batch Exit delay: ~4 sec Exit frequency: ~1 / 4 batches per sec (2.5 docs / sec) Neighborhoodscomputation Preprocessing k-means clustering Least-squaresinterpolation Stress majorization Buffer (FIFO) Documentstream Outdateddocuments DS 2010
Speed Tests DS 2010
Conclusions and Further Work • Conclusions • Efficient online distance-preserving document stream visualization technique (2.5 docs / sec, 5 parallel processes) • Tricks: warm start, pipelining, parallelization • Further work • Performance at different nQand u? • Optimize k-means (done!) and k-NN (easy) • Find use cases, perform user studies • Decision making in financial domain (FIRST) • Press clipping (media monitoring) DS 2010