Text Mining Tools

Text Mining Tools 22C:196 Text Retrieval & Text Mining Seminar

Tools • WordNet • MxTerminator • Lingpipe • Stanford TP Tools • Stanford-NER • SVM Light • Rainbow Toolkit • Manjal

WordNet • http://wordnet.princeton.edu/ • English lexical database • Developed at Princeton Univ. by George A. Miller, etc. • Organized as Synsets • Cognitive synonym sets • Synsets for Nouns, Verbs, Adjectives and Adverbs

WordNet • Synsets interlinked via lexical and conceptual-sematic relations • Network of meaningfully related concepts and words • Available online and can also be freely downloaded • Perl and Java packages available to interface with WordNet

WordNet • WordNet 2.0 on sulu and geordi • Command line interface • Example • /usr/local/WordNet-2.0/bin/wn <w> -over • Provides overview of various senses • /usr/local/WordNet-2.0/bin/wn <w> -synsn • Provides list of synonyms

MxTerminator • http://www.id.cbs.dk/~dh/corpus/tools/MXTERMINATOR.html • Java sentence boundary detection tool • Algorithm described in • J.C. Reynar and A. Ratnaparkhi. A Maximum Entropy Approach to Identifying Sentence Boundaries. 1997.

MxTerminator • Installed on sulu and geordi • Command-line interface • Requires two parameters • Trained model directory • Text File to parse • Syntax • /usr/local/mxterminator/mxterminator ‘modeldir’ < ‘textfile’ • Comes with pre-trained model • /usr/local/mxterminator/eos.project

MxTerminator • New models can be trained • trainmxterminator <projectdir> <traindata> • <projectdir> is newly created model directory • <traindata> is training data with one sentence per line • Package also includes mxpost • part-of-speech tagger • /usr/local/mxterminator/mxpost ‘modeldir < ‘wordfile’ • Pre-built model - /usr/local/mxterminator/tagger.project • wordfile - contains words; one sentence per line

LingPipe • http://www.alias-i.com/lingpipe/ • Suite of Java libraries for different kinds of analyses • Sentence detection • Part-of-speech tagging • Named-entity extraction • Phrase extraction • Entity co-reference • Spell checker • Clustering • Chinese language support

LingPipe • Also contains tools for database text mining • Directly work-off a database such as MySQL • Package contains demos, tutorials, pre-trained models and javadoc • Widely used in text mining community • Especially for general and biomedical named-entity recognition • Website has links to blogs and developer discussion forum

Stanford TP Tools • http://nlp.stanford.edu/software/index.shtml • Variety of text processing tools • Made available by Stanford NLP group • All tools are implemented in Java • Freely downloadable

Stanford TP Tools • Parser • POS Tagger • Named Entity Recognizer • Chinse word segmenter • Classifier • Tregex and Tsurgeon • Matching patterns in trees

Stanford-NER • Based on CRFs • Contains demo programs • 4 pre-built models • 3 class basic model trained on US and UK Newswire data from CoNLL, MUC and ACE • Labels PERSON, ORGANIZATION and LOCATION • 4 class model trained on CoNLL training data • Additionally labels MISC • 2 more accurate distsim versions of above models

Stanford-NER • Example • java -mx600m -cp ./stanford-ner.jar:. stanfordNER ner-eng-ie.crf-3-all2006-distsim.ser.gz “text” • Advanced distsim model • Example • java -mx300m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -textFile sample.txt • Default basic model

SVMLight • http://svmlight.joachims.org/ • C support-vector-machine implementation by Thorsten Joachims • Does classification, regression and ranking • Many other functions • Estimate error-rate and precision and recall directly • Freely downloadable • Instructions on website

SVMLight • Contains 2 main executable files • svm_learn (learn model from training set) • svm_classify (classify test set) • Input file contains weighted term vectors • Strategy: index doc files using Lucene or SMART and obtain term vectors • Example: -1 1:0.43 3:0.12 9284:0.2 +1 1:0.20 3:0.14 9284:0.97 • Use different kernel functions • Support for linear and non-linear kernels

SVMLight • Syntax: • svm_learn [options] example_file model_file • svm_classify [options] example_file model_file output_file • Example data included in distribution

Rainbow Toolkit • http://www.cs.cmu.edu/~mccallum/bow/rainbow/ • Part of the Bow toolkit • http://www.cs.cmu.edu/~mccallum/bow/ • Text Classification tool • Supports 4 classification methods • Naïve Bayes (default) • TFIDF/Rocchio • K-nearest neighbor • Probabilistic Indexing

Rainbow Toolkit • Building a model • rainbow -d ./model --index <modeldir> --use-stemming --skip-html • <modeldir> contains individual folders (with text files) for each class • Model is stored in./model • Test model • rainbow -d ~/model --test-set=0.4 --test=3 • Train-test split is 0.6/0.4; 3 iterations

Rainbow Toolkit • Test model • rainbow -d ~/model --test-set=0.5 --test=1 • Specify test set • Half chosen randomly • rainbow -d ~/model --test-files <testdir> • Classify previously unseen files in<testdir>

Rainbow Toolkit • Formatted output • rainbow-stats • Example • rainbow -d ./model --test-set=0.4 --test=2 | rainbow-stats • Confusion matrix, Percent accuracy, Std. error,

Manjal • Online demo

Text Mining Tools