1 / 23

On Compression-Based Text Classification

On Compression-Based Text Classification. Yuval Marton 1 , Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University. ECIR-05 Santiago de Compostela, Spain. March 2005. Compression for Text Classification??.

inge
Download Presentation

On Compression-Based Text Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Compression-Based Text Classification Yuval Marton1, Ning Wu2 and Lisa Hellerstein2 1) University of Maryland and 2) Polytechnic University ECIR-05 Santiago de Compostela, Spain. March 2005

  2. Compression for Text Classification?? • Proposed in the last ~10 years. Not well-understood why works. • Compression is stupid! slow! Non-standard! • Using compression tools is easy.. • Does it work? (Controversy. Mess) Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  3. Overview • What’s Text classification (problem setting) • Compression-based text classification • Classification Procedures ( + Do it yourself !) • Compression Methods (RAR, LZW, and gzip) • Experimental Evaluation • Why?? (Compression as char-based method) • Influence of sub/super/non-word features • Conclusions and future work Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  4. Text Classification • Given training corpus (labeled documents). • Learn how to label new (test) documents. • Our setting: • Single-class: document belongs to exactly one class. • 3 topic classification and 3 authorship attribution tasks. Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  5. Classification by Compression • Compression programs build a model or dictionary of their input (language modeling). • Better model  better compression • Idea: • Compress a document using different class models. • Label with class achieving highest compression rate. • Minimal Description Length (MDL) principle: select model with shortest length of model + data. Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  6. Standard MDL (Teahan & Harper) FREEZE! Concat. training data Ai D1 A1 D2 D3 M1 T Class 1 Compress Ai model Mi Compress T using each Mi Assign T to its best compressor M2 A2 D1 D2 D3 Class 2 … and the winner is… D1 An D2 D3 Mn Class n Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  7. Do it yourself Five minutes on how to classify text documents e.g., according to their topic or author, using only off-the-shelf compression tools (such as WinZip, gzip, or RAR)… Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  8. AMDL (Khmelev / Kukushkina et al. 2001) D1 A1 A1T D2 D3 T Class 1 Concat. training data  Ai Concat. Ai and T  AiT Compress each Ai and AiT Class 2 A2T A2 D1 D2 D3 Subtract compressed file sizes vi = |AiT| - |Ai| Assign T to class i w/ min vi Class n An AnT D1 D2 D3 Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  9. BCN (Benedetto et al. 2002) D1T D1 T D1 D2 D3 Class 1 Like AMDL, but concat. each doc Dj with T  DjT D2T D3T Compress each Dj and DjT Class 2 D4 D4T D4 D5 D6 Subtract compressed file sizes vDT = |DjT| - |Dj| D5T D6T Assign T to class i of doc Dj with min vDT Class n D7 D7T D7 D8 D9 D8T D9T Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  10. Compression Methods • Gzip: Lempel-Ziv compression (LZ77). - “Dictionary”-based - Sliding window typically 32K. • LZW (Lempel-Ziv-Welch) - Dictionary-based (16 bit). - Fills up on big corpora (typically after ~300KB). • RAR (proprietary shareware) - PPMII variant on text.- Markov Model, n-grams frequencies. -32K- -16 bit (~300K) - - (almost) unlimited - Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  11. Previous Work • Khmelev et al. (+Kukushkina): Russian authors. Thaper: LZ78, char- and word-based PPM. • Frank et al.: compression (PPM) bad for topic. Teahan and Harper: compression (PPM) good. • Benedetto et al.: gzip good for authors. Goodman: gzip bad! • Khmelev and Teahan: RAR (PPM). • Peng et al.: Markov Language Models. Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  12. Compression Good or Bad? Scoring: we measured Accuracy:Total # correct classifications Total # tests (Micro-averaged accuracy) Why? Single-class labels, no tuning parameters. Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  13. AMDL Results Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  14. RAR is a Star! • RAR is best performing method on all but small Reuters-9 corpus. • Poor performance of gzip on large corpora due to its 32K sliding window. • Poor performance of LZW: dictionary fills up after ~ 300KB, other reasons too. Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  15. RAR on Standard Corpora - Comparison • 90.5% for RAR on 20news:- 89.2% Language Modeling (Peng et al. 2004)- 86.2% Extended NB (Rennie et al. 2003)- 82.1% PPMC (Teahan and Harper 2001) • 89.6% for RAR on Sector:- 93.6% SVM (Zhang and Oles 2001)- 92.3% Extended NB (Rennie et al. 2003) - 64.5% Multinomial NB (Ghani 2001) Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  16. AMDL vs. BCN • Gzip / BCN good.Due to processing each doc separatelywith T (1-NN). • Gzip / AMDL bad. • BCN was slow.Probably due to more sys calls and disk I/O. Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  17. Why Good?! • Compression tools are character-based.(Stupid, remember?) • Better than word-based? WHY? Can they capture • sub-word • word • super-word • non-word features? Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  18. Pre-processing “the more – the better!” • STD: no change to input. • NoP: remove punctuation; replace white spaces (tab, line, parag & page breaks) with spaces. • WOS: NoP + Word Order Scrambling. • RSW: NoP + random-string words.…and more… the more the better better the the more dqf tmdw dqf lkwe Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  19. Non-words: Punctuation • Intuition: • punctuation usage is characteristic of writing style (authorship attribution). • Results: • Accuracy remained the same, or even increased, in many cases. • RAR insensitive to punctuation removal. Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  20. Super-words: word seq. Order Scrambling (WOS) • WOS removes punctuation and scrambles word order. • WOS leaves sub-word and word info intact. Destroys super-word relations. • RAR: accuracy declined in all but one corpus  seems to exploit word seq. (n-grams?).Advantage over SVM state-of-the-art bag-of-words methods. • LZW & gzip: no consistent accuracy decline. Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  21. Summary • Compared effectiveness of compression for text classification(compression methods x classification procedures). • RAR (PPM) is a star – under AMDL.- BCN (1-NN) slow(er) and never better in accuracy. - Compression good (Teahan and Harper).- Character-based Markov models good (Peng et al.) • Introduced pre-processing testing techniques:novel ways to test how compression (and other character-based methods) exploit sub/super/non-word features. - RAR benefits from super-word info. - Suggests word-based methods might benefit from it too. Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  22. Future Research • Test / confirm results on more and bigger corpora. • Compare to state-of-the-art techniques: • Other compression / character-based methods. • SVM • Word-based n-gram language modeling (Peng et al). • Word-based compression? • Use Standard MDL (Teahan and Harper). • Faster, better insight. • Sensitivity to class training data imbalance • When is throwing away data desirable for compression? Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

  23. Thank you! Yuval Marton, Ning Wu, and Lisa Hellerstein ECIR-05

More Related