Introduction to YouSeer

Introduction to YouSeer Partha Mukherjee pom5109@ist.psu.edu

Outline • Overview • YouSeer components • Heritrix • Solr • Demo

Overview • YouSeer: is a complete and powerful open source search engine available on SourceForge that integrates the open source crawler Heritrix with the open source indexer Solr/Lucene. • Java-based, and run successfully on Windows Requirements • 512 MB RAM, 6.5 GB on Hard Disk • Java 1.6 ( Java 1.5 also works)

Advantages of YouSeer • Built on top of scalable components – Tested on 23M documents, while Solr and Heritrix can scale to billions • Very flexible, and easy to extend • Modifying the index and the ingestion module is easy • The crawler supports complicated crawling policies

YouSeer Components • Heritrix: – The Internet Archive’s crawler – Reported to scale up to 1B documents – Written in Java, and has a web interface • Apache Solr: – open source enterprise search server based on the Lucene – Has REST-like API – Supports caching, distributed search, and index replication

YouSeer Architecture Cache Request Apache Tomcat DB heritrix WWW Storage File System Middleware Apache Solr

Heritrix Workflow • 1) Choose a URI from all among the scheduled • 2) Fetch that URI • 3)Analyze or archive the results • 4) select discovered URIs of interest, and add to those scheduled • 5) Note that the URI is done and repeat

Heritrix Crawl Result • By default, heritrix writes all its crawled to disk as Internet Archive ARC files • By default, Heritrix writes compressed version of ARC files • The compression is done with gzip • Each record (which contain a document) is gzipped • All gzipped records are concatenated together to make up a file of multiple gzipped members 9

Apache Solr • Very popular distribution of Lucene • Easy to configure and optimize – All modifications are in the XML files – No need to touch the code • The index has a schema, similar to database schema – Think of the index as a table in the database, and you have to define the columns

Solr Schema Example • • • <field name="url" type="string" indexed="true" stored="true"/> <field name="title" type="text" indexed="true" stored="true"/> <field name="keywords" type="text_ws" indexed="true" stored="true" multiValued="true" omitNorms="true"/> <field name="creationDate" type="date" indexed="true" stored="true"/> <field name="rating" type="sint" indexed="true" stored="true"/> <field name="published" type="boolean" indexed="true" stored="true"/> <field name="content" type="text" indexed="true" stored="true" /> <field name="all" type="text" indexed="true" stored="true" multiValued="true"/> • • • • • 11

Solr Documents • Solr accepts well formatted XML documents • <add> <doc> – <field name=“URL">www.cnn.com</field> – <field name=“title">CNN Breaking News – Obama wins</field> – <field name=“content">Barack Obama is the 44th president of the USA</field> – <field name=“pubDate">2008-11- 06T23:59:59.999Z</field> • </doc> </add> 12

YouSeer workflow • Waits for the crawled documents to be written • Iterates on the compressed files, and process the documents • Extract the textual content of the document, and parse metadata • Generate an XML file as output – Each custom extractor appends its result to this file • This XML file is submitted to the index

Demo: Configurtion • The schema of Solr is already configured in your installation • Solr is installed on tomcat • Heritrix web interface is listening on the port 8080 by default same as Apache TomCat server. • So change it to some other port number – i.e. ./hertitrix –p 9000

Demo • Download Virtual Machine image from http://sourceforge.net/projects/youseer/files/VM/youseer.0.1 /fedora-11-i386.zip/download – Unzip fedora-11-i386.zip – The virtual image is a linux VMware image • To run the VM, you need to download and install VMware player from: http://www.vmware.com/products/player/ – Double click on Vmware virtual machine configuration icon

Demo

Demo • Get into YouSeer with password “heritrixsolr”. – You are in a virtual Linux environment sitting in Windows. • While leaving the VM environment – Log out from youseer (“youseer -> quit” ) – Shutdown the VM (“ shutdown”) – Press Ctrl + Alt to work in your local machine.

Demo

Demo • About to start Heritrix (crawler) !!! – In VM open a terminal – Go to apps directory (cd apps) – You find solr, tomcate, heritrix-1.14.3 etc applications – Don’t forget to start up solr server before running heritrix • Go to apache-solr…/example/ • Locate the jar file “start.jar” and run it. • Solr should run all the time.

Demo

Demo • Now open another terminal or another tab from the same terminal – Go to heritrix-1.14.3 under /home/apps. – Run heritrix application with the following command line arguments • ./heritrix –p XXXX - -admin=nameX:passwordX • Now open the browser in VM and type the URL – http://localhost:XXXX – Get heritrix UI (Username= nameX and password = passwordX)

Demo: Heritrix

Demo: Heritrix • Configure first job • Most important parameter is user agent under configurations – Enter a valid URL and email address – Enter http://www.psu.edu – And your OWN email address – Do not run more than 5 threads • Avoid machine “tireness” and system crash.

Demo: Heritrix

Demo: Features of Heritrix

Demo: More features

Demo : Heritrix

Demo • ARC files are written to: – ~/crawler/heritrix-1.14.3/jobs/JOB-NAME/arcs • To start tomcat, enter start-tomcat – Solr will start automatically • YouSeer ingestion module (middleware) is located under: – ~/youseer/release • Add folder entry to Apache web server configuration file – Retrieve cached copies of documents from ARC files – Use URL of the solr to post the document – Specify number of working threads to process the documents – Java –jar YouSeer.jar [IndexURL] [Path_ARCfiles] [Cached_virtual_Folder][Number_of_Threads][wait_Time]

Demo • To index documents crawled by heritrix: – Navigate to ~/youseer/release – Run: java –jar YouSeer.jar http://localhost:8983/solr/update /absolute/path/to/arc/files /cachingDirectory 1 0 • Solr URL • The full path to the ARC files • The virtual directory which maps to the cached files • Number of threads, please keep it <5 • Waiting Time between retries

Demo

Comments • YouSeer tracks which arc files has been processed into the database, default name is submitted.db – If you want to re-ingest the documents, Map virtual directory within TomCat directory – Update the submitted.db file – Execute $ path= /cached docBase=“/heritrix- 1.14.3/jobs/JOB_NAME/arcs” crossContext=“false” debug=“0” reloadable=“true”/ • The search interface: – http://localhost:8080/youseerui

Shots

Test case (http://pike.psu.edu)

Test Case(:pike)

References • http://youseer.sourceforge.net/doc/Tutorial.pdf • http://crawler.archive.org/articles/user_manual/ • http://lucene.apache.org/solr/tutorial.html Want to Download separately?? • https://sourceforge.net/projects/youseer/ • https://sourceforge.net/projects/archive – crawler/files/archive-crawler%20(heritrix%201.x)/ • http://www.apache.org/dyn/closer.cgi/lucene/solr

THANK YOU

Introduction to YouSeer

Introduction to YouSeer

Presentation Transcript

Introduction to YouSeer

INTRODUCTION TO…

Introduction to

Introduction to

Introduction to

Introduction to introduction to introduction to … Optimization

Introduction to

Introduction to Bioinformatics Introduction to Databases

Introduction to Engineering Introduction to CAD

Introduction to Introduction to Database Systems

Introduction to Introduction to Psychology

INTRODUCTION TO

INTRODUCTION to

Introduction to

Introduction to Concurrency: Introduction to Concurrency

Introduction to

Introduction to

Introduction to

Introduction to Psychophysiology Lecture 1- introduction to introduction

Introduction to Introduction to Artificial Intelligence