1 / 14

Scraping the Web with SAS

Scraping the Web with SAS. Tom Kari Tom Kari Consulting OASUS, June 12 2013. Google is wonderful, but…. The first page is full of junk! I can’t tell how many pages I’m getting from each site. I KNOW the page I want is in here somewhere, how can I find it?

wilma
Download Presentation

Scraping the Web with SAS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scraping the Web with SAS Tom Kari Tom Kari Consulting OASUS, June 12 2013

  2. Google is wonderful, but… • The first page is full of junk! • I can’t tell how many pages I’m getting from each site. • I KNOW the page I want is in here somewhere, how can I find it? • I’m not using SAS when I use Google! • How can I keep ALL the results to analyze? Tom Kari, Tom Kari Consulting

  3. The Basics data URL_Retrieval_Results; length HTML_Rec $32767; filename HTML_Inurl "http://www.dolphinsdance.ca"; infileHTML_Inlrecl=32767; input; HTML_Rec = _infile_; run; Tom Kari, Tom Kari Consulting

  4. The Process What goes in the reference to google? Get results from Google How do I find the web sites listed by Google? Figure out how to get 1000 web site listings Post process the results (SAS data management) Extract the web sites Tom Kari, Tom Kari Consulting

  5. How to send a search to Google? • In Internet Explorer: • F12 to open Developer Tools • Network  Start Capturing • Enter your search string • Stop Capturing • Dig around in the results http://www.google.ca/s?gs_rn=14&gs_ri=psy-ab&cp=41&gs_id=a&xhr=t&q=beautiful%20vaca tion%20resort%20puerto%20vallarta&es_nrs=true&pf=p&output=search&sclient=psy-ab&oq =&gs_l=&pbx=1&bav=on.2,or.r_qf.&bvm=bv.47008514,d.dmg&fp=5ad817295c2c0080&biw=1 123&bih=374&tch=1&ech=1&psi=8xOlUdWjBOT84AO1iYCwDw.1369773041400.1 http://www.google.ca/search?q=beautiful+vacation+resort+puerto+vallarta&start=1 Tom Kari, Tom Kari Consulting

  6. Get Results from Google data GoogleResults; length HTML_Rec $32767; filename HTML_Inurl "http://www.google.ca/search?q=beautiful+vacation+resort+puert o+vallarta%nrstr(&start)=1"; infileHTML_Inlrecl=32767; input; HTML_Rec = _infile_; run; 32,767 bytes Tom Kari, Tom Kari Consulting

  7. How do I find the web sites listed by Google? <div id="res"><div id="topstuff"></div><div id="search"><div id="ires"><ol><li class="g"><h3 class="r"><a href="/url?q=http://www.tripadvisor.ca/Hotel_Review-g150793-d481596-Reviews-Dreams_Puerto_Vallarta_Resort_Spa-Puerto_Vallarta.html&amp;sa=U&amp;ei=bhmlUbyUHPKw0QHk1YFg&amp;ved=0CCYQFjAAOAE&amp;usg=AFQjCNFLqCMjy4b4raYjbA8nvqHjJARGlA">Dreams <b>Puerto Vallarta Resort</b> &amp; Spa - All-inclusive <b>Resort</b> Reviews <b>...</b></a></h3><div class="s"><div class="kv" style="margin-bottom:2px"><cite>www.tripadvisor.ca/Hotel_Review-g150793-d481596-Reviews-Dreams_ <b>Puerto</b>_<b>Vallarta</b>_<b>Resort</b>_Spa-<b>Puerto</b>_<b>Vallarta</b>.html</cite><span class="flc"> - <a href="/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:gaglProuhbkJ:http://www.tripadvisor.ca/Hotel_Review-g150793-d481596-Reviews-Dreams_Puerto_Vallarta_Resort_Spa- Tom Kari, Tom Kari Consulting

  8. How do I find the web sites listed by Google? (cont’d) The magic of PRX routines! "Pattern matching enables you to search for and extract multiple matching patterns from a character string in one step. Pattern matching also enables you to make several substitutions in a string in one step. You do this by using the PRX functions and CALL routines in the DATA step. For example, you can search for multiple occurrences of a string and replace those strings with another string. You can search for a string in your source file and return the position of the match. You can find words in your file that are doubled." Tom Kari, Tom Kari Consulting

  9. Extract the web sites data GoogleHTMLResult; retain prxid; if _n_=1 then prxid=prxparse('/(?<=<h3 class="r"><a href="\/url\?q=)[[:alnum:]- \._~:\/\?#\[\]@!\$''\(\)\*\+,;=]+(?=&amp)/o'); length HTML_Rec $32767; filename HTML_Inurl "http://www.google.ca/search?q=beautiful+vacation+resort+puerto+vallarta%nrstr(&start)=1"; infileHTML_Inlrecl=32767; input; HTML_Rec = _infile_; call prxsubstr(prxid,HTML_Rec,pos,len); CiteData=substr(HTML_Rec,HTML_Pos,HTML_Len); output; run; Tom Kari, Tom Kari Consulting

  10. Figure out how to get 1000 web site listings Quirks to remember • Many characters can’t appear in Google search strings, so must be encoded (spaces to +, etc.) • Ampersands in your URL need %nrstr or will fail in SAS • To use a new urlinfile in SAS, you need a new data step. This is easy with a macro loop. • Every now and then it fails – “ERROR: Invalid reply received from the HTTP server. Use the debug option for more info.” Beats me! Tom Kari, Tom Kari Consulting

  11. Figure out how to get 1000 web site listings (cont’d) Code is in “Example 4 Extract 1000 URLs” Tom Kari, Tom Kari Consulting

  12. Post-process the results • Count how many time each URL appears • For each unique URL, retain the page and index where it first appears • Create a nice looking HTML page • Code is in “Example 5 Post-processed” Tom Kari, Tom Kari Consulting

  13. Tom Kari, Tom Kari Consulting

  14. Appendix: PRX parse strings prxid=prxparse('parse string'); /(?<=<h3 class="r"><a href="\/url\?q=)[[:alnum:]-\._~:\/\?#\[\]@!\$''\(\)\*\+,;=]+(?=&amp)/o outer control non-captured groupany-of one or more as-isas-isescapedescaped grouping Tom Kari, Tom Kari Consulting

More Related