180 likes | 346 Views
Text Mining SAS-L Topics. Larry Hoyle, Policy Research Institute, University of Kansas. SAS-L topics. Read each weekly topic list from http://www.listserv.uga.edu/archives/sas-l.html Parse topic, HTMLdecode Strip “Re: “ /* strip variations of re: */
E N D
Text Mining SAS-L Topics Larry Hoyle, Policy Research Institute, University of Kansas
SAS-L topics • Read each weekly topic list from http://www.listserv.uga.edu/archives/sas-l.html • Parse topic, HTMLdecode • Strip “Re: “ /* strip variations of re: */ topicRE = prxparse('/^ *[R|r][E|e] *: *(.*)/'); if prxmatch(topicRE, topic) then do; topic = prxposn(topicRE, 1,topic); end; • Proc SQL to aggregate topic counts across weeks
SAS-L 2005 • 35324 thread/topic lines in the html files • 7081 threads after merging across weeks and a little cleaning
Web scraping with tmfilter options noxwait; %macro aweek(week=0501a); x "md C:\ddrive\projects\sugs\sugi31\SASLBOF\posts\&week"; x "md C:\ddrive\projects\sugs\sugi31\SASLBOF\filteredposts\&week"; libname sugi31 'C:\ddrive\projects\sugs\sugi31\SASLBOF\datasets'; %tmfilter( dataset=sugi31.SL&week., dir=C:\ddrive\projects\sugs\sugi31\SASLBOF\posts\&week, destdir=C:\ddrive\projects\sugs\sugi31\SASLBOF\filteredPosts\&week, URL=http://listserv.uga.edu/cgi-bin/wa?A1=ind&week.%NRSTR(&L=sas-l), depth=1, links=sugi31.SL&week.L, norestrict=' ', numchars=2000) %mend aweek; %aweek(week=0501a); %aweek(week=0501b);
Parse date and sender Should parse this out