100 likes | 208 Views
A short presentation ( part 2 of 3 ) describing the use of open source code nutch and solr to web crawl the internet and process the data.
E N D
Web Scraping Using Nutch and Solr - Part 2 • The following example assumes that you have • Watched “web scraping with nutch and solr” • The above movie identity is cAiYBD4BQeE • Set up Linux based Nutch/Solr environment • Run the web scrape in the above movie • Now we will • Clean up that environment • Web scrape a parameterised url • View the urls in the data
Empty Nutch Database • Clean up the Nutch crawl database • Previously used apache-nutch-1.6/nutch_start.sh • This contained -dir crawl option • This created apache-nutch-1.6/crawl directory • Which contains our Nutch data • Clean this as • cd apache-nutch-1.6; rm -rf crawl • Only because it contained dummy data ! • Next run of script will create dir again
Empty Solr Database • Clean Solr database via a url • Book mark this url • Only use it if you need to empty your data • Run the following ( with solr server running ) • http://localhost:8983/solr/update?commit=true -d '<delete><query>*:*</query></delete>'
Set up Nutch • Now we will do something more complex • Web scrape a url that has parameters i.e. • http://<site>/<function>?var1=val1&var2=val2 • This web scrape will • Have extra url characters '?=&' • Need greater search depth • Need better url filtering • Remember that you need to get permission to scrape a third party web site
Nutch Configuration • Change seed file for Nutch • apache-nutch-1.6/urls/seed.txt • In this instance I will use a url of the form • http://somesite.co.nz/Search?DateRange=7&industry=62 • ( this is not a real url – just an example ) • Change conf regex-urlfilter.txt entry i.e. • # skip URLs containing certain characters • -[*!@] • # accept anything else • +^http://([a-z0-9]*\.)*somesite.co.nz\/Search • This will only consider some site Search urls
Run Nutch • Now run nutch using start script • cd apache-nutch-1.6 ; ./nutch_start.bash • Monitor for errors in solr admin log window • The Nutch crawl should end with • crawl finished: crawl
Checking Data • Data should have been indexed in Solr • In Solr Admin window • Set 'Core Selector' = collection1 • Click 'Query' • In Query window set fl field = url • Click Execute Query • The result ( next ) shows the filtered list of urls in Solr
Results • Congratulations you have completed your second crawl • With parameterised urls • More complex url filtering • With a Solr Query search
Contact Us • Feel free to contact us at • www.semtech-solutions.co.nz • info@semtech-solutions.co.nz • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems