160 likes | 183 Views
Text Mining. with R and the tm package. Agenda. Motivation Preliminaries Operations Demo Thoughts System prerequisites Resources References. Motivation. Exciting new possibilities to deal with unstructured text data (tweets, news articles/feeds, customer complaints)
E N D
Text Mining with R and the tm package
Agenda • Motivation • Preliminaries • Operations • Demo • Thoughts • System prerequisites • Resources • References
Motivation • Exciting new possibilities to deal with unstructured text data (tweets, news articles/feeds, customer complaints) • Research categories: • Machine learning • Data mining • Sentiment analysis • ...
Preliminaries • Some terminology • Document • Corpus • Term document matrix • Dissimilarity matrix • We will see some of these in the demo
Typical TM operations • Import • Preprocessing • Stop words • White space • Punctuation • (to) Lower case • Numeric removal • ... Other “mappings”
Typical TM Operations (cont’d) • Metadata management • per document • per corpus • Term document matrix preparation • Distance/nearness calculations • Plotting • ...
Thoughts • Package documentation • Overlap/misalignment with other packages • Integration with “big data” facilities
System Prerequisites • Suggested • Weka (for lazy classifiers) • GraphViz (for plot()) • Snowball (for stemDocument()) • Seriation (for dissplot()) • Optional • Antiword (to read Word documents) • pdftotext (to read PDF documents)
Resources • Antiword • http://www.winfield.demon.nl/ • pdftotext • poppler.freedesktop.org • Rgraphviz • http://www.bioconductor.org/packages/release/bioc/html/Rgraphviz.html • Seriation • http://rgm2.lab.nig.ac.jp/RGM2/func.php?rd_id=seriation:dissplot • Weka • http://sourceforge.net/projects/weka/
References • Ingo Feinerer (2012). tm: Text Mining Package. R package version 0.5-7.1. • Jeff Gentry, Li Long, Robert Gentleman, Seth, Florian Hahne, Deepayan Sarkar and Kasper Hansen (). Rgraphviz: Provides plotting capabilities for R graph objects. R package version 1.32.0. • Ingo Feinerer. An introduction to text mining in R. R News, 8(2):19-22, October 2008. • Ingo Feinerer, Kurt Hornik, and David Meyer. Text mining infrastructure in R. Journal of Statistical Software, 25(5):1-54, March 2008.
Contact Information • Kent Manley • GMU STAT 763, Spring 2012