Building a scalable distributed WWW search engine … NOT in Perl!

Building a scalabledistributed WWW search engine …NOT in Perl! Presented by Alex Chudnovsky (http://www.majectic12.co.uk) at Birmingham Perl Mongers User Group (http://birmingham.pm.org) V1.0 27/07/05

Contents • History • Goals • Architecture • Implementation • Why not Perl? • Conclusions • Credits • Recommended reading

History(of my work in area of information retrieval) • First primitive pathetic stone-age search engine: 1000 documents in the “index” (1997, Perl) • Second engine using proper inverted indexing for Jungle.com: 500,000 products indexed (Perl + Java, 2002) • Current: 50,000,000 pages indexed with a lot more to go (to be revealed, 2005)

Goals • Build a distributed WWW search engine capable of dealing with at least 1 bln web pages based on principles of SETI@Home and D.NET • See to it that the chosen language for implementation (more on this later) fits purpose or more likely learn how to make it work • Eventually make some money out of it

Architecture • Data collection (crawling) • Indexing: turning text into numbers • Merging: turning indexed barrels into single searchable index • Searching: locating documents for given keywords

Data collection (crawling) Distributed crawlers – receive lists of URLs to crawl, crawl them and send back compressed data. Base In the future will do distributed indexing Issues URLs to crawl and receives compressed pages Note: this stage is optional if you already have data to index, ie list of products with their descriptions

Crawler screenshot 1

Current Stats Source: http://www.majestic12.co.uk/projects/dsearch/stats.php as of 27/07/05

Indexing Indexing is a process of turning words into numbers and creating inverted index. Lexicon(maps words to their numeric WordIDs) Birmingham – 0Perl – 1Mongers – 2City – 3 Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Inverted Index(Each of the WordID has list of (ideally sorted) DocIDs) Data barrel 0 -> 0, 11 -> 0, 22 -> 0,3 -> 1, 2 Note: if you use database then it make sense to have clustered index on WordID

Merging Individual indexed barrels Single searchable index Note: this stage is not necessary if just one barrel is used as there will be no need to remap all Ids from local to their global equivalents.

Searching Searching is a process of finding documents that contain words from search query Lexicon(maps words to their numeric WordIDs) Search query: “Birmingham Perl” Birmingham – 0Perl – 1Mongers – 2City – 3 Intersection of DocIDs present in both lists (implementation of boolean AND logic): Inverted Index(lists DocIDs for each of the WordID) WordIDs: 0, 1 0 -> 0, 11 -> 0, 22 -> 0,3 -> 1, 2 Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Note: if you use database then it make sense to cluster on WordID

Search engine screenshot 1

Search engine screenshot 2

Implementation • Microsoft .NET C# ported to Linux using Mono (http://www.mono-project.com) • ~90k lines of code (minimal copy/paste) written from scratch • Low level of dependencies (SharpZipLib/SQLite/NPlot)

Why not Perl?(using C# instead) • Not strong in GUI department • Hard to deal with Multi-Threading and Asyncronous sockets • OOP is more of a hack • Lax compile-time checks due to not being strictly typed • Fear of performance bottlenecks forcing to use C++ • Hard to profile for performance analysis • Managed memory lacks support for pointers (?) • Poor exceptions handling • I wanted something new :)

Conclusions • Still work in progress, but some conclusions can be made already: • Inverted indexing approach helps to achieve fast searches • Its tough to build one – don’t try if you ain’t going to see it through! • Crawler is one tough piece of code – 6 months vs 2 months on searching • .NET C# is a decent language suitable for heavy duty tasks like this

Credits • R&D: Alex Chudnovsky <alexc@majestic12.co.uk> • Pioneers*: FiddleAbout, dazza12, lazytom, Mordac, linuxbren, Cyber911, www.vanginkel.info, Vari, ASB, SEOBy.org, arni, japonicus, webstek.info | Pimpel, DimPrawn, Zyron, partys-bei-uns.de, jake, bull at webmasterworld, nada, dodgy4, sri-heinz * Volunteers running crawler and who crawled at least 1 mln URLs as of 27/07/05

Recommended reading • “The Anatomy of a Large-Scale Hypertextual Web Search Engine” Sergey Brin and Lawrence Page of Google (http://www-db.stanford.edu/~backrub/google.html) • “Managing Gigabytes” Ian h. Witten et al ISBN 1-55860-570-3

Join! • Join the project (unmetered broadband required!): majestic12.co.uk Your name could be here!

Building a scalable distributed WWW search engine … NOT in Perl!