140 likes | 305 Views
Scraping the Web with SAS. Tom Kari Tom Kari Consulting OASUS, June 12 2013. Google is wonderful, but…. The first page is full of junk! I can’t tell how many pages I’m getting from each site. I KNOW the page I want is in here somewhere, how can I find it?
E N D
Scraping the Web with SAS Tom Kari Tom Kari Consulting OASUS, June 12 2013
Google is wonderful, but… • The first page is full of junk! • I can’t tell how many pages I’m getting from each site. • I KNOW the page I want is in here somewhere, how can I find it? • I’m not using SAS when I use Google! • How can I keep ALL the results to analyze? Tom Kari, Tom Kari Consulting
The Basics data URL_Retrieval_Results; length HTML_Rec $32767; filename HTML_Inurl "http://www.dolphinsdance.ca"; infileHTML_Inlrecl=32767; input; HTML_Rec = _infile_; run; Tom Kari, Tom Kari Consulting
The Process What goes in the reference to google? Get results from Google How do I find the web sites listed by Google? Figure out how to get 1000 web site listings Post process the results (SAS data management) Extract the web sites Tom Kari, Tom Kari Consulting
How to send a search to Google? • In Internet Explorer: • F12 to open Developer Tools • Network Start Capturing • Enter your search string • Stop Capturing • Dig around in the results http://www.google.ca/s?gs_rn=14&gs_ri=psy-ab&cp=41&gs_id=a&xhr=t&q=beautiful%20vaca tion%20resort%20puerto%20vallarta&es_nrs=true&pf=p&output=search&sclient=psy-ab&oq =&gs_l=&pbx=1&bav=on.2,or.r_qf.&bvm=bv.47008514,d.dmg&fp=5ad817295c2c0080&biw=1 123&bih=374&tch=1&ech=1&psi=8xOlUdWjBOT84AO1iYCwDw.1369773041400.1 http://www.google.ca/search?q=beautiful+vacation+resort+puerto+vallarta&start=1 Tom Kari, Tom Kari Consulting
Get Results from Google data GoogleResults; length HTML_Rec $32767; filename HTML_Inurl "http://www.google.ca/search?q=beautiful+vacation+resort+puert o+vallarta%nrstr(&start)=1"; infileHTML_Inlrecl=32767; input; HTML_Rec = _infile_; run; 32,767 bytes Tom Kari, Tom Kari Consulting
How do I find the web sites listed by Google? <div id="res"><div id="topstuff"></div><div id="search"><div id="ires"><ol><li class="g"><h3 class="r"><a href="/url?q=http://www.tripadvisor.ca/Hotel_Review-g150793-d481596-Reviews-Dreams_Puerto_Vallarta_Resort_Spa-Puerto_Vallarta.html&sa=U&ei=bhmlUbyUHPKw0QHk1YFg&ved=0CCYQFjAAOAE&usg=AFQjCNFLqCMjy4b4raYjbA8nvqHjJARGlA">Dreams <b>Puerto Vallarta Resort</b> & Spa - All-inclusive <b>Resort</b> Reviews <b>...</b></a></h3><div class="s"><div class="kv" style="margin-bottom:2px"><cite>www.tripadvisor.ca/Hotel_Review-g150793-d481596-Reviews-Dreams_ <b>Puerto</b>_<b>Vallarta</b>_<b>Resort</b>_Spa-<b>Puerto</b>_<b>Vallarta</b>.html</cite><span class="flc"> - <a href="/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:gaglProuhbkJ:http://www.tripadvisor.ca/Hotel_Review-g150793-d481596-Reviews-Dreams_Puerto_Vallarta_Resort_Spa- Tom Kari, Tom Kari Consulting
How do I find the web sites listed by Google? (cont’d) The magic of PRX routines! "Pattern matching enables you to search for and extract multiple matching patterns from a character string in one step. Pattern matching also enables you to make several substitutions in a string in one step. You do this by using the PRX functions and CALL routines in the DATA step. For example, you can search for multiple occurrences of a string and replace those strings with another string. You can search for a string in your source file and return the position of the match. You can find words in your file that are doubled." Tom Kari, Tom Kari Consulting
Extract the web sites data GoogleHTMLResult; retain prxid; if _n_=1 then prxid=prxparse('/(?<=<h3 class="r"><a href="\/url\?q=)[[:alnum:]- \._~:\/\?#\[\]@!\$''\(\)\*\+,;=]+(?=&)/o'); length HTML_Rec $32767; filename HTML_Inurl "http://www.google.ca/search?q=beautiful+vacation+resort+puerto+vallarta%nrstr(&start)=1"; infileHTML_Inlrecl=32767; input; HTML_Rec = _infile_; call prxsubstr(prxid,HTML_Rec,pos,len); CiteData=substr(HTML_Rec,HTML_Pos,HTML_Len); output; run; Tom Kari, Tom Kari Consulting
Figure out how to get 1000 web site listings Quirks to remember • Many characters can’t appear in Google search strings, so must be encoded (spaces to +, etc.) • Ampersands in your URL need %nrstr or will fail in SAS • To use a new urlinfile in SAS, you need a new data step. This is easy with a macro loop. • Every now and then it fails – “ERROR: Invalid reply received from the HTTP server. Use the debug option for more info.” Beats me! Tom Kari, Tom Kari Consulting
Figure out how to get 1000 web site listings (cont’d) Code is in “Example 4 Extract 1000 URLs” Tom Kari, Tom Kari Consulting
Post-process the results • Count how many time each URL appears • For each unique URL, retain the page and index where it first appears • Create a nice looking HTML page • Code is in “Example 5 Post-processed” Tom Kari, Tom Kari Consulting
Appendix: PRX parse strings prxid=prxparse('parse string'); /(?<=<h3 class="r"><a href="\/url\?q=)[[:alnum:]-\._~:\/\?#\[\]@!\$''\(\)\*\+,;=]+(?=&)/o outer control non-captured groupany-of one or more as-isas-isescapedescaped grouping Tom Kari, Tom Kari Consulting