Web Scraping Using Nutch and Solr 3/3

Solr Extracting Data • Start this session with a full Solr indexed repository • Movie cAiYBD4BQeE showed installation • Movie Th5Scvlyt-E showed Nutch web crawl • This movie will show how to • Extract data from Solr • Extract to xml or csv • Show aim to load into data warehouse • This movie assumes you know Linux

Solr Extracting Data • Progress so far, greyed out area yet to be examined

Checking Solr Data • Data should have been indexed in Solr • In Solr Admin window • Set 'Core Selector' = collection1 • Click 'Query' • In Query window set fl field = url • Click Execute Query • The result ( next ) shows the filtered list of urls in Solr

Checking Solr Data

How To Extract • How could we get at Solr data ? • In admin console via query • Via http solr select • Via curl -o call using solr http select • What format of data – that suits this purpose • Xml • Comma separated variable (csv)‏

How To Extract • We want to extract two columns from Solr • tstamp, url • We want to extract as csv ( csv in call below could be xml )‏ • We want to extract to a file • So we will use an http call • http://localhost:8983/solr/select?q=*:*&fl=tstamp,url&wt=csv • We will also use a curl call • curl -o <csv file> '<http call>'

How To Extract • Ceate a bash file in Solr install directory • cd solr-4-2-1/extract ; touch solr_url_extract.bash • chmod 755 solr_url_extract.bash • Add contents to bash file • #!/bin/bash • curl -o result.csv 'http://localhost:8983/solr/select?q=*:*&fl=tstamp,url&wt=csv' • mv result.csv result.csv.$(date +”%Y%m%d.%H%M%S”)‏ • Now run the bash script • ./solr_url_extract.bash

Check Output • Now we check whether we have data • ls -l shows • result.csv.20130506.124857 • Check the content , wc -l shows 11 lines • Check the content , head -2 shows • tstamp, url • 2013-05-04T01:56:58.157Z,http://www.mysite.co.nz/Search? DateRange=7& ... • Congratulations, you have extracted data from Solr • It's in CSV format ready to be loaded into a data warehouse

Possible Next Steps • Choose more fields to extract from data • Allow Nutch crawl to go deeper • Allow Nutch crawl to collect a lot more data • Look at facets in Solr data • Load CSV files into Data Warehouse Staging schema • Next movie will show next step in progress

Contact Us • Feel free to contact us at • www.semtech-solutions.co.nz • info@semtech-solutions.co.nz • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems

Web Scraping Using Nutch and Solr 3/3

Web Scraping Using Nutch and Solr 3/3

Presentation Transcript

Nicolas Fiorini , Zhiyong Lu NCBI/NLM/NIH Twitter: #AMIA2017

Web Crawlers

Web Scraping Using Nutch and Solr 2/3

Introduction to Open Source Search with Apache Lucene and Solr

Revolutionizing enterprise web development

Nutch Search Engine Tool

Crawling

Restaurants.com Data Scraping

Bouldercoloradousa.com - Data Scraping

Diversityjobs.com - Data Scraping

Nysba.org - Data Scraping

Njsba.com - Data Scraping

Riabiz.com - Data Scraping

Inbar.org - Data Scraping

Lawyerlegion.com - Data Scraping

Overseasjobs.com - Data Scraping

Wyndham.com - Data Scraping

Magento Advanced Search With Solr Extension

Deals Information Scraping From Groupon

3 worth-a-shot Dosâ€™ of Web Scraping Service for the beginners to follow up each time

Scraping data from amazon| Amazon web scraping

Apartments Data Scraping from Real Estate Websites