1 / 33

Introduction to YouSeer

By default, heritrix writes all its crawled to disk as Internet Archive ARC files. By default, Heritrix writes compressed version 1 ARC files ...

RoyLauris
Download Presentation

Introduction to YouSeer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    Slide 1:Introduction to YouSeer Madian Khabsa

    Slide 2:Outline Overview YouSeer components Heritrix Solr Demo

    Slide 3:Overview YouSeer: is a complete and powerful open source search engine available on SourceForge that integrates the open source crawler Heritrix with the open source indexer Solr/Lucene. Java-based, and reported to run successfully on Windows (need confirmation)

    Slide 4:Search Engine: Basic Workflow

    Slide 5:Advantages of YouSeer Built on top of scalable components Tested on 23M documents, while Solr and Heritrix can scale to billions Very flexible, and easy to extend Modifying the index and the ingestion module is easy The crawler supports complicated crawling policies

    Slide 6:YouSeer Components Heritrix: The Internet Archive’s crawler Reported to scale up to 1B documents Written in Java, and has a web interface Apache Solr: open source enterprise search server based on the Lucene Has REST-like API Supports caching, distributed search, and index replication

    Slide 7:YouSeer Architecture

    Slide 8:Heritrix Workflow 1) Choose a URI from all among the scheduled 2) Fetch that URI 3)Analyze or archive the results 4) select discovered URIs of interest, and add to those scheduled 5) Note that the URI is done and repeat “An Introduction to Heritrix. An open source archival quality web crawler”. Gordon Mohr et al“An Introduction to Heritrix. An open source archival quality web crawler”. Gordon Mohr et al

    Slide 9:Heritrix Architecture

    Slide 10:Heritrix Crawl Result By default, heritrix writes all its crawled to disk as Internet Archive ARC files By default, Heritrix writes compressed version 1 ARC files The compression is done with gzip Each record (which contain a document) is gzipped All gzipped records are concatenated together to make up a file of multiple gzipped members

    Slide 11:ARC Record Example http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103text/html 200 fac069150613fe55599cc7fa88aa089d - 209 IA-001102.arc 202HTTP/1.0 200 Document followsDate: Mon, 04 Nov 1996 14:21:06 GMTServer: NCSA/1.4.1Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11 GMTContent-length: 30<HTML>Hello World!!!</HTML>

    Slide 12:Some Heritrix features Override crawling policy per domain Include/Exclude specific media types, domains, file size. Limit the number of the threads, connection bandwidth, writing threads Obey or ignore Robots.txt PLEASE OBEY ROBOTS.TXT

    Slide 13:Apache Solr Very popular distribution of Lucene Easy to configure and optimize All modifications are in the XML files No need to touch the code The index has a schema, similar to database schema Think of the index as a table in the database, and you have to define the columns

    Slide 14:Solr Schema Example <field name="url" type="string" indexed="true" stored="true"/> <field name="title" type="text" indexed="true" stored="true"/> <field name="keywords" type="text_ws" indexed="true" stored="true" multiValued="true" omitNorms="true"/> <field name="creationDate" type="date" indexed="true" stored="true"/> <field name="rating" type="sint" indexed="true" stored="true"/> <field name="published" type="boolean" indexed="true" stored="true"/> <field name="content" type="text" indexed="true" stored="true" /> <field name="all" type="text" indexed="true" stored="true" multiValued="true"/>

    Slide 15:Solr Documents Solr accepts well formatted XML documents <add> <doc> <field name=“URL">www.cnn.com</field> <field name=“title">CNN Breaking News – Obama wins</field> <field name=“content">Barack Obama is the 44th president of the USA</field> <field name=“pubDate">2008-11-06T23:59:59.999Z</field> </doc> </add>

    Slide 16:Solr Features Modify the ranking function by editing XML file, or along with your HTTP query request Supports query suggestions, aka autocomplete Provides spelling corrections from the terms in the index Supports wildcard and range queries Example: porch*, ferr* Example: popularity:[10 TO *] http://wiki.apache.org/solr/SolrQuerySyntax http://wiki.apache.org/solr/SolrQuerySyntax

    Slide 17:YouSeer workflow Waits for the crawled documents to be written Iterates on the compressed files, and process the documents Extract the textual content of the document, and parse metadata Generate an XML file as output Each custom extractor appends its result to this file This XML file is submitted to the index

    Slide 18:Document formats supported

    Slide 19:Demo: Configurtion The schema of Solr is already configured in your installation Solr is installed on tomcat Heritrix web interface is listening on the port 91XX, where XX is your team numebr i.e. team 3: heritrix listens on 9103

    Slide 20:Demo The server is: ist441.ist.psu.edu Heritrix port is blocked by the ITS You need SSH tunneling For Linux/Mac users: SSH –L 91XX:localhost:91XX teamX@ist441 Where XX is your team number Navigate to localhost:91XX to access heritrix

    Slide 21:Demo Tunneling for Windows users Install putty http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html Double click putty.exe Enter the host name as ist441.ist.psu.edu The connection type is SSH Navigate to Connection->SSH->Tunnels Source port is 91XX Destination address is localhost:91XX Press ADD Press open http://www.ytechie.com/2008/05/set-up-a-windows-ssh-tunnel-in-10-minutes-or-less.htmlhttp://www.ytechie.com/2008/05/set-up-a-windows-ssh-tunnel-in-10-minutes-or-less.html

    Slide 22:Demo: Putty http://www.ytechie.com/2008/05/set-up-a-windows-ssh-tunnel-in-10-minutes-or-less.htmlhttp://www.ytechie.com/2008/05/set-up-a-windows-ssh-tunnel-in-10-minutes-or-less.html

    Slide 23:Demo: Putty

    Slide 24:Demo: Heritrix After connecting to the server, go to ~/crawler/heritrix-1.14.3/bin Execute: ./heritrix --admin=teamX:Password Now navigate to heritrix interface form you browser Log in, and create your first job http://crawler.archive.org/articles/user_manual/tutorial.html

    Slide 25:Demo: Heritrix Most important parameter is user agent under configurations Enter a valid URL and email address Enter http://www.psu.edu And your OWN email address Do not run more than 5 threads Do not crawl publisher’s websites

    Slide 26:Demo: Heritrix Heritrix log in screen Heritrix log in screen

    Slide 27:Demo: Heritrix Enter the Seed URLsEnter the Seed URLs

    Slide 28:Demo: Heritrix Change the Agent URLChange the Agent URL

    Slide 29:Demo: Heritrix If everything goes well, after you submit you should see this screenIf everything goes well, after you submit you should see this screen

    Slide 30:Demo ARC files are written to: ~/crawler/heritrix-1.14.3/jobs/JOB-NAME/arcs To start tomcat, enter start-tomcat Solr will start automatically YouSeer ingestion module is located under: ~/youseer/release SubmitterConfig.xml Provides the file types you want to index Provides the database name Provides mapping to the index schema

    Slide 31:Demo To index documents crawled by heritrix: Navigate to ~/youseer/release Run: java –jar YouSeer.jar http://localhost:90XX/solr/update /absolute/path/to/arc/files /cachingDirectory 1 0 Solr URL The full path to the ARC files The virtual directory which maps to the cached files Number of threads, please keep it 1 Waiting Time between retries

    Slide 32:Comments YouSeer tracks which arc files has been processed into the database, default name is submitted.db If you want to re-ingest the documents,remove the db file Check the log file The search interface: http://ist441.ist.psu.edu:90XX/youseerui

    Slide 33:Q & A

    Slide 34:References http://youseer.sourceforge.net/doc/Tutorial.pdf http://crawler.archive.org/articles/user_manual/ https://webarchive.jira.com/wiki/download/attachments/5441/Mohr-et-al-2004.pdf http://www.packtpub.com/solr-1-4-enterprise-search-server?utm_source=http%3A%2F%2Flucene.apache.org%2Fsolr%2F&utm_medium=spons&utm_content=pod&utm_campaign=mdb_000275 http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr/Reference-Guide?sc=AP

More Related