Web Scraping Using Nutch and Solr 1/3

Web Scraping Using Nutch and Solr • A simple example of using open source code • Web Scrape a single web site - ours • Environment and code • Using Centos V6.2 ( Linux )‏ • Apache Nutch 1.6 • Solr 4.2.1 • Java 1.6

Nutch and Solr Architecture • Nutch processes urls and feeds content to Solr • Solr indexes content

Where to get source code • Nutch • http://nutch.apache.org • Solr • http://lucene.apache.org/solr • Java • http://java.com

Installing Source - Nutch • Nutch is delivered as • apache-nutch-1.6-bin.tar ( 64M )‏ • apache-nutch-1.6-src.tar ( 20M )‏ • Copy each tar file to your desired location • Install each tar file as • tar xvf <tar file> • Second tar file optional

Installing Source - Solr • Solr is delivered as • solr-4.2.1.zip ( 116M )‏ • Copy file to your desired location • Install each tar file as • unzip <zip file>

Configuring Nutch Part 1 • Assuming we will crawl a single web site • Ensure that JAVA_HOME is set • cd apache-nutch-1.6 • Edit agent name in conf/nutch-site.xml <property> <name>http.agent.name</name> <value>Nutch Spider</value> </property> • mkdir -p urls ; cd urls ; touch seed.txt

Configuring Nutch Part 2 • Add following url ( ours ) to seed.txt • http://www.semtech-solutions.co.nz • Change url filtering in conf/regex-urlfilter.txt, change the line • # accept anything else • +. • To be • +^http://([a-z0-9]*\.)*semtech-solutions.co.nz/ • This means that we will filter the urls found to only be from the local site

Configuring Solr Part 1 • cd solr-4.2.1/example/solr/collection1/conf • Add some extra fields to schema.xml after _version_ field i.e.

Start Solr Server – Part 1 • Within solr-4.2.1/example • Run the following command • java -jar start.jar • Now try to access admin web page for solr • http://localhost:8983/solr/admin • You should now see the admin web site • ( see next page )‏

Start Solr Server – Part 2 • Solr Admin web page

Run Nutch / Solr • We are ready to crawl our first web site • Go to apache-nutch-1.6 directory • Run the following commands • touch nutch_start.bash • chmod 755 nutch_start.bash • vi nutch_start.bash • Add the text to the file #!/bin/bash bin/nutch crawl urls -solr http://localhost:8983/solr/ \ -dir crawl -depth 3 -topN 3

Run Nutch / Solr • Now run the nutch bash file • ./nutch_start.bash • Select the Logging option on the admin console • Monitor for errors in Logging console • The crawl should finish with no errors and the line • Crawl finished: crawl • In the crawl window

Check Crawled Data • Now we check the data that we have crawled • In Admin Console window • Set Core Selector to collection1 • Select the Query option • Click execute query button • You should now see some of the data that you have crawled

Crawled Data • Crawled data in solr query

Crawled Data • Thats your first simple crawl completed • Further reading at • http://nutch.apache.org • http://lucene.apache.org/solr • Now you can • Add more urls to your seed.txt • Increase the depth of your link search via options • -depth • -topN • Modify your url filtering

Contact Us • Feel free to contact us at • www.semtech-solutions.co.nz • info@semtech-solutions.co.nz • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems

Web Scraping Using Nutch and Solr 1/3

Web Scraping Using Nutch and Solr 1/3

Presentation Transcript

Nicolas Fiorini , Zhiyong Lu NCBI/NLM/NIH Twitter: #AMIA2017

Web Crawlers

Web Scraping Using Nutch and Solr 2/3

Introduction to Open Source Search with Apache Lucene and Solr

Revolutionizing enterprise web development

Nutch Search Engine Tool

Crawling

Restaurants.com Data Scraping

Bouldercoloradousa.com - Data Scraping

Diversityjobs.com - Data Scraping

Nysba.org - Data Scraping

Njsba.com - Data Scraping

Riabiz.com - Data Scraping

Inbar.org - Data Scraping

Lawyerlegion.com - Data Scraping

Overseasjobs.com - Data Scraping

Wyndham.com - Data Scraping

Magento Advanced Search With Solr Extension

Deals Information Scraping From Groupon

3 worth-a-shot Dosâ€™ of Web Scraping Service for the beginners to follow up each time

Scraping data from amazon| Amazon web scraping

Apartments Data Scraping from Real Estate Websites