Web Scraping Using Nutch and Solr 2/3

Web Scraping Using Nutch and Solr - Part 2 • The following example assumes that you have • Watched “web scraping with nutch and solr” • The above movie identity is cAiYBD4BQeE • Set up Linux based Nutch/Solr environment • Run the web scrape in the above movie • Now we will • Clean up that environment • Web scrape a parameterised url • View the urls in the data

Empty Nutch Database • Clean up the Nutch crawl database • Previously used apache-nutch-1.6/nutch_start.sh • This contained -dir crawl option • This created apache-nutch-1.6/crawl directory • Which contains our Nutch data • Clean this as • cd apache-nutch-1.6; rm -rf crawl • Only because it contained dummy data ! • Next run of script will create dir again

Empty Solr Database • Clean Solr database via a url • Book mark this url • Only use it if you need to empty your data • Run the following ( with solr server running )‏ • http://localhost:8983/solr/update?commit=true -d '<delete><query>*:*</query></delete>'

Set up Nutch • Now we will do something more complex • Web scrape a url that has parameters i.e. • http://<site>/<function>?var1=val1&var2=val2 • This web scrape will • Have extra url characters '?=&' • Need greater search depth • Need better url filtering • Remember that you need to get permission to scrape a third party web site

Nutch Configuration • Change seed file for Nutch • apache-nutch-1.6/urls/seed.txt • In this instance I will use a url of the form • http://somesite.co.nz/Search?DateRange=7&industry=62 • ( this is not a real url – just an example )‏ • Change conf regex-urlfilter.txt entry i.e. • # skip URLs containing certain characters • -[*!@] • # accept anything else • +^http://([a-z0-9]*\.)*somesite.co.nz\/Search • This will only consider some site Search urls

Run Nutch • Now run nutch using start script • cd apache-nutch-1.6 ; ./nutch_start.bash • Monitor for errors in solr admin log window • The Nutch crawl should end with • crawl finished: crawl

Checking Data • Data should have been indexed in Solr • In Solr Admin window • Set 'Core Selector' = collection1 • Click 'Query' • In Query window set fl field = url • Click Execute Query • The result ( next ) shows the filtered list of urls in Solr

Checking Data

Results • Congratulations you have completed your second crawl • With parameterised urls • More complex url filtering • With a Solr Query search

Contact Us • Feel free to contact us at • www.semtech-solutions.co.nz • info@semtech-solutions.co.nz • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems

Web Scraping Using Nutch and Solr 2/3

Web Scraping Using Nutch and Solr 2/3

Presentation Transcript

Nicolas Fiorini , Zhiyong Lu NCBI/NLM/NIH Twitter: #AMIA2017

Web Crawlers

Introduction to Open Source Search with Apache Lucene and Solr

Revolutionizing enterprise web development

Nutch Search Engine Tool

Crawling

Restaurants.com Data Scraping

Bouldercoloradousa.com - Data Scraping

Diversityjobs.com - Data Scraping

Nysba.org - Data Scraping

Njsba.com - Data Scraping

Riabiz.com - Data Scraping

Inbar.org - Data Scraping

Lawyerlegion.com - Data Scraping

Overseasjobs.com - Data Scraping

Wyndham.com - Data Scraping

Magento Advanced Search With Solr Extension

Deals Information Scraping From Groupon

3 worth-a-shot Dosâ€™ of Web Scraping Service for the beginners to follow up each time

Scraping data from amazon| Amazon web scraping

Apartments Data Scraping from Real Estate Websites

COURSE MATERIAL DETAILS DATA SCRAPING