380 likes | 393 Views
Discover the overview, components, and advantages of YouSeer, a complete and flexible open source search engine that integrates Heritrix and Solr. Learn about its architecture, workflow, and demo setup.
E N D
Introduction to YouSeer Partha Mukherjee pom5109@ist.psu.edu
Outline • Overview • YouSeer components • Heritrix • Solr • Demo
Overview • YouSeer: is a complete and powerful open source search engine available on SourceForge that integrates the open source crawler Heritrix with the open source indexer Solr/Lucene. • Java-based, and run successfully on Windows Requirements • 512 MB RAM, 6.5 GB on Hard Disk • Java 1.6 ( Java 1.5 also works)
Advantages of YouSeer • Built on top of scalable components – Tested on 23M documents, while Solr and Heritrix can scale to billions • Very flexible, and easy to extend • Modifying the index and the ingestion module is easy • The crawler supports complicated crawling policies
YouSeer Components • Heritrix: – The Internet Archive’s crawler – Reported to scale up to 1B documents – Written in Java, and has a web interface • Apache Solr: – open source enterprise search server based on the Lucene – Has REST-like API – Supports caching, distributed search, and index replication
YouSeer Architecture Cache Request Apache Tomcat DB heritrix WWW Storage File System Middleware Apache Solr
Heritrix Workflow • 1) Choose a URI from all among the scheduled • 2) Fetch that URI • 3)Analyze or archive the results • 4) select discovered URIs of interest, and add to those scheduled • 5) Note that the URI is done and repeat
Heritrix Crawl Result • By default, heritrix writes all its crawled to disk as Internet Archive ARC files • By default, Heritrix writes compressed version of ARC files • The compression is done with gzip • Each record (which contain a document) is gzipped • All gzipped records are concatenated together to make up a file of multiple gzipped members 9
Apache Solr • Very popular distribution of Lucene • Easy to configure and optimize – All modifications are in the XML files – No need to touch the code • The index has a schema, similar to database schema – Think of the index as a table in the database, and you have to define the columns
Solr Schema Example • • • <field name="url" type="string" indexed="true" stored="true"/> <field name="title" type="text" indexed="true" stored="true"/> <field name="keywords" type="text_ws" indexed="true" stored="true" multiValued="true" omitNorms="true"/> <field name="creationDate" type="date" indexed="true" stored="true"/> <field name="rating" type="sint" indexed="true" stored="true"/> <field name="published" type="boolean" indexed="true" stored="true"/> <field name="content" type="text" indexed="true" stored="true" /> <field name="all" type="text" indexed="true" stored="true" multiValued="true"/> • • • • • 11
Solr Documents • Solr accepts well formatted XML documents • <add> <doc> – <field name=“URL">www.cnn.com</field> – <field name=“title">CNN Breaking News – Obama wins</field> – <field name=“content">Barack Obama is the 44th president of the USA</field> – <field name=“pubDate">2008-11- 06T23:59:59.999Z</field> • </doc> </add> 12
YouSeer workflow • Waits for the crawled documents to be written • Iterates on the compressed files, and process the documents • Extract the textual content of the document, and parse metadata • Generate an XML file as output – Each custom extractor appends its result to this file • This XML file is submitted to the index
Demo: Configurtion • The schema of Solr is already configured in your installation • Solr is installed on tomcat • Heritrix web interface is listening on the port 8080 by default same as Apache TomCat server. • So change it to some other port number – i.e. ./hertitrix –p 9000
Demo • Download Virtual Machine image from http://sourceforge.net/projects/youseer/files/VM/youseer.0.1 /fedora-11-i386.zip/download – Unzip fedora-11-i386.zip – The virtual image is a linux VMware image • To run the VM, you need to download and install VMware player from: http://www.vmware.com/products/player/ – Double click on Vmware virtual machine configuration icon
Demo • Get into YouSeer with password “heritrixsolr”. – You are in a virtual Linux environment sitting in Windows. • While leaving the VM environment – Log out from youseer (“youseer -> quit” ) – Shutdown the VM (“ shutdown”) – Press Ctrl + Alt to work in your local machine.
Demo • About to start Heritrix (crawler) !!! – In VM open a terminal – Go to apps directory (cd apps) – You find solr, tomcate, heritrix-1.14.3 etc applications – Don’t forget to start up solr server before running heritrix • Go to apache-solr…/example/ • Locate the jar file “start.jar” and run it. • Solr should run all the time.
Demo • Now open another terminal or another tab from the same terminal – Go to heritrix-1.14.3 under /home/apps. – Run heritrix application with the following command line arguments • ./heritrix –p XXXX - -admin=nameX:passwordX • Now open the browser in VM and type the URL – http://localhost:XXXX – Get heritrix UI (Username= nameX and password = passwordX)
Demo: Heritrix • Configure first job • Most important parameter is user agent under configurations – Enter a valid URL and email address – Enter http://www.psu.edu – And your OWN email address – Do not run more than 5 threads • Avoid machine “tireness” and system crash.
Demo • ARC files are written to: – ~/crawler/heritrix-1.14.3/jobs/JOB-NAME/arcs • To start tomcat, enter start-tomcat – Solr will start automatically • YouSeer ingestion module (middleware) is located under: – ~/youseer/release • Add folder entry to Apache web server configuration file – Retrieve cached copies of documents from ARC files – Use URL of the solr to post the document – Specify number of working threads to process the documents – Java –jar YouSeer.jar [IndexURL] [Path_ARCfiles] [Cached_virtual_Folder][Number_of_Threads][wait_Time]
Demo • To index documents crawled by heritrix: – Navigate to ~/youseer/release – Run: java –jar YouSeer.jar http://localhost:8983/solr/update /absolute/path/to/arc/files /cachingDirectory 1 0 • Solr URL • The full path to the ARC files • The virtual directory which maps to the cached files • Number of threads, please keep it <5 • Waiting Time between retries
Comments • YouSeer tracks which arc files has been processed into the database, default name is submitted.db – If you want to re-ingest the documents, Map virtual directory within TomCat directory – Update the submitted.db file – Execute $ path= /cached docBase=“/heritrix- 1.14.3/jobs/JOB_NAME/arcs” crossContext=“false” debug=“0” reloadable=“true”/ • The search interface: – http://localhost:8080/youseerui
References • http://youseer.sourceforge.net/doc/Tutorial.pdf • http://crawler.archive.org/articles/user_manual/ • http://lucene.apache.org/solr/tutorial.html Want to Download separately?? • https://sourceforge.net/projects/youseer/ • https://sourceforge.net/projects/archive – crawler/files/archive-crawler%20(heritrix%201.x)/ • http://www.apache.org/dyn/closer.cgi/lucene/solr