170 likes | 350 Views
A short presentation ( part 1 of 3 ) describing the use of open source code nutch and solr to web crawl the internet and process the data.
E N D
Web Scraping Using Nutch and Solr • A simple example of using open source code • Web Scrape a single web site - ours • Environment and code • Using Centos V6.2 ( Linux ) • Apache Nutch 1.6 • Solr 4.2.1 • Java 1.6
Nutch and Solr Architecture • Nutch processes urls and feeds content to Solr • Solr indexes content
Where to get source code • Nutch • http://nutch.apache.org • Solr • http://lucene.apache.org/solr • Java • http://java.com
Installing Source - Nutch • Nutch is delivered as • apache-nutch-1.6-bin.tar ( 64M ) • apache-nutch-1.6-src.tar ( 20M ) • Copy each tar file to your desired location • Install each tar file as • tar xvf <tar file> • Second tar file optional
Installing Source - Solr • Solr is delivered as • solr-4.2.1.zip ( 116M ) • Copy file to your desired location • Install each tar file as • unzip <zip file>
Configuring Nutch Part 1 • Assuming we will crawl a single web site • Ensure that JAVA_HOME is set • cd apache-nutch-1.6 • Edit agent name in conf/nutch-site.xml <property> <name>http.agent.name</name> <value>Nutch Spider</value> </property> • mkdir -p urls ; cd urls ; touch seed.txt
Configuring Nutch Part 2 • Add following url ( ours ) to seed.txt • http://www.semtech-solutions.co.nz • Change url filtering in conf/regex-urlfilter.txt, change the line • # accept anything else • +. • To be • +^http://([a-z0-9]*\.)*semtech-solutions.co.nz/ • This means that we will filter the urls found to only be from the local site
Configuring Solr Part 1 • cd solr-4.2.1/example/solr/collection1/conf • Add some extra fields to schema.xml after _version_ field i.e.
Start Solr Server – Part 1 • Within solr-4.2.1/example • Run the following command • java -jar start.jar • Now try to access admin web page for solr • http://localhost:8983/solr/admin • You should now see the admin web site • ( see next page )
Start Solr Server – Part 2 • Solr Admin web page
Run Nutch / Solr • We are ready to crawl our first web site • Go to apache-nutch-1.6 directory • Run the following commands • touch nutch_start.bash • chmod 755 nutch_start.bash • vi nutch_start.bash • Add the text to the file #!/bin/bash bin/nutch crawl urls -solr http://localhost:8983/solr/ \ -dir crawl -depth 3 -topN 3
Run Nutch / Solr • Now run the nutch bash file • ./nutch_start.bash • Select the Logging option on the admin console • Monitor for errors in Logging console • The crawl should finish with no errors and the line • Crawl finished: crawl • In the crawl window
Check Crawled Data • Now we check the data that we have crawled • In Admin Console window • Set Core Selector to collection1 • Select the Query option • Click execute query button • You should now see some of the data that you have crawled
Crawled Data • Crawled data in solr query
Crawled Data • Thats your first simple crawl completed • Further reading at • http://nutch.apache.org • http://lucene.apache.org/solr • Now you can • Add more urls to your seed.txt • Increase the depth of your link search via options • -depth • -topN • Modify your url filtering
Contact Us • Feel free to contact us at • www.semtech-solutions.co.nz • info@semtech-solutions.co.nz • We offer IT project consultancy • We are happy to hear about your problems • You can just pay for those hours that you need • To solve your problems