230 likes | 350 Views
On Compression-Based Text Classification. Yuval Marton 1 , Ning Wu 2 and Lisa Hellerstein 2 1) University of Maryland and 2) Polytechnic University. ECIR-05 Santiago de Compostela, Spain. March 2005. Compression for Text Classification??.
E N D
On Compression-Based Text Classification Yuval Marton1, Ning Wu2 and Lisa Hellerstein2 1) University of Maryland and 2) Polytechnic University ECIR-05 Santiago de Compostela, Spain. March 2005
Compression for Text Classification?? • Proposed in the last ~10 years. Not well-understood why works. • Compression is stupid! slow! Non-standard! • Using compression tools is easy.. • Does it work? (Controversy. Mess)
Overview • What’s Text classification (problem setting) • Compression-based text classification • Classification Procedures ( + Do it yourself !) • Compression Methods (RAR, LZW, and gzip) • Experimental Evaluation • Why?? (Compression as char-based method) • Influence of sub/super/non-word features • Conclusions and future work
Text Classification • Given training corpus (labeled documents). • Learn how to label new (test) documents. • Our setting: • Single-class: document belongs to exactly one class. • 3 topic classification and 3 authorship attribution tasks.
Classification by Compression • Compression programs build a model or dictionary of their input (language modeling). • Better model better compression • Idea: • Compress a document using different class models. • Label with class achieving highest compression rate. • Minimal Description Length (MDL) principle: select model with shortest length of model + data.
Standard MDL (Teahan & Harper) FREEZE! Concat. training data Ai A1 D1 D2 D3 M1 T Class 1 Compress Ai model Mi Compress T using each Mi Assign T to its best compressor M2 A2 D1 D2 D3 Class 2 … and the winner is… An D1 D2 D3 Mn Class n
Do it yourself Five minutes on how to classify text documents e.g., according to their topic or author, using only off-the-shelf compression tools (such as WinZip or RAR)…
AMDL (Khmelev / Kukushkina et al. 2001) A1T A1 D1 D2 D3 T Class 1 Concat. training data Ai Concat. Ai and T AiT Compress each Ai and AiT Class 2 A2T D1 A2 D2 D3 Subtract compressed file sizes vi = |AiT| - |Ai| Assign T to class i w/ min vi Class n An D1 AnT D2 D3
BCN (Benedetto et al. 2002) D1T D1 T D1 D2 D3 Class 1 Like AMDL, but concat. each doc Dj with T DjT D2T D3T Compress each Dj and DjT Class 2 D4 D4T D4 D5 D6 Subtract compressed file sizes vDT = |DjT| - |Dj| D5T D6T Assign T to class i of doc Dj with min vDT Class n D7 D7 D7T D8 D9 D8T D9T
Compression Methods • Gzip: Lempel-Ziv compression (LZ77). - “Dictionary”-based - Sliding window typically 32K. • LZW (Lempel-Ziv-Welch) - Dictionary-based (16 bit). - Fills up on big corpora (typically after ~300KB). • RAR (proprietary shareware) - PPMII variant on text.- Markov Model, n-grams frequencies. -32K- -16 bit (~300K) - - (almost) unlimited -
Previous Work • Khmelev et al. (+Kukushkina): Russian authors. Thaper: LZ78, char- and word-based PPM. • Frank et al.: compression (PPM) bad for topic. Teahan and Harper: compression (PPM) good. • Benedetto et al.: gzip good for authors. Goodman: gzip bad! • Khmelev and Teahan: RAR (PPM). • Peng et al.: Markov Language Models.
Compression Good or Bad? Scoring: we measured Accuracy:Total # correct classifications Total # tests (Micro-averaged accuracy) Why? Single-class labels, no tuning parameters.
RAR is a Star! • RAR is best performing method on all but small Reuters-9 corpus. • Poor performance of gzip on large corpora due to its 32K sliding window. • Poor performance of LZW: dictionary fills up after ~ 300KB, other reasons too.
RAR on Standard Corpora - Comparison • 90.5% for RAR on 20news:- 89.2% Language Modeling (Peng et al. 2004)- 86.2% Extended NB (Rennie et al. 2003)- 82.1% PPMC (Teahan and Harper 2001) • 89.6% for RAR on Sector:- 93.6% SVM (Zhang and Oles 2001)- 92.3% Extended NB (Rennie et al. 2003) - 64.5% Multinomial NB (Ghani 2001)
AMDL vs. BCN • Gzip / BCN good.Due to processing each doc separatelywith T (1-NN). • Gzip / AMDL bad. • BCN was slow.Probably due to more sys calls and disk I/O.
Why Good?! • Compression tools are character-based.(Stupid, remember?) • Better than word-based? WHY? Can they capture • sub-word • word • super-word • non-word features?
Pre-processing “the more – the better!” • STD: no change to input. • NoP: remove punctuation; replace white spaces (tab, line, parag & page breaks) with spaces. • WOS: NoP + Word Order Scrambling. • RSW: NoP + random-string words.…and more… the more the better better the the more dqf tmdw dqf lkwe
Non-words: Punctuation • Intuition: • punctuation usage is characteristic of writing style (authorship attribution). • Results: • Accuracy remained the same, or even increased, in many cases. • RAR insensitive to punctuation removal.
Super-words: word seq. Order Scrambling (WOS) • WOS removes punctuation and scrambles word order. • WOS leaves sub-word and word info intact. Destroys super-word relations. • RAR: accuracy declined in all but one corpus seems to exploit word seq. (n-grams?).Advantage over SVM state-of-the-art bag-of-words methods. • LZW & gzip: no consistent accuracy decline.
Summary • Compared effectiveness of compression for text classification(compression methods x classification procedures). • RAR (PPM) is a star – under AMDL.- BCN (1-NN) slow(er) and never better in accuracy. - Compression good (Teahan and Harper).- Character-based Markov models good (Peng et al.) • Introduced pre-processing testing techniques:novel ways to test how compression (and other character-based methods) exploit sub/super/non-word features. - RAR benefits from super-word info. - Suggests word-based methods might benefit from it too.
Future Research • Test / confirm results on more and bigger corpora. • Compare to state-of-the-art techniques: • Other compression / character-based methods. • SVM • Word-based n-gram language modeling (Peng et al). • Word-based compression? • Use Standard MDL (Teahan and Harper). • Faster, better insight. • Sensitivity to class training data imbalance • When is throwing away data desirable for compression?