230 likes | 361 Views
Building a scalable distributed WWW search engine … NOT in Perl!. Presented by Alex Chudnovsky (http://www.majectic12.co.uk) at Birmingham Perl Mongers User Group (http://birmingham.pm.org). V1.0 27/07/05. Contents. History Goals Architecture Implementation Why not Perl? Conclusions
E N D
Building a scalabledistributed WWW search engine …NOT in Perl! Presented by Alex Chudnovsky (http://www.majectic12.co.uk) at Birmingham Perl Mongers User Group (http://birmingham.pm.org) V1.0 27/07/05
Contents • History • Goals • Architecture • Implementation • Why not Perl? • Conclusions • Credits • Recommended reading
History(of my work in area of information retrieval) • First primitive pathetic stone-age search engine: 1000 documents in the “index” (1997, Perl) • Second engine using proper inverted indexing for Jungle.com: 500,000 products indexed (Perl + Java, 2002) • Current: 50,000,000 pages indexed with a lot more to go (to be revealed, 2005)
Goals • Build a distributed WWW search engine capable of dealing with at least 1 bln web pages based on principles of SETI@Home and D.NET • See to it that the chosen language for implementation (more on this later) fits purpose or more likely learn how to make it work • Eventually make some money out of it
Architecture • Data collection (crawling) • Indexing: turning text into numbers • Merging: turning indexed barrels into single searchable index • Searching: locating documents for given keywords
Data collection (crawling) Distributed crawlers – receive lists of URLs to crawl, crawl them and send back compressed data. Base In the future will do distributed indexing Issues URLs to crawl and receives compressed pages Note: this stage is optional if you already have data to index, ie list of products with their descriptions
Current Stats Source: http://www.majestic12.co.uk/projects/dsearch/stats.php as of 27/07/05
Indexing Indexing is a process of turning words into numbers and creating inverted index. Lexicon(maps words to their numeric WordIDs) Birmingham – 0Perl – 1Mongers – 2City – 3 Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Inverted Index(Each of the WordID has list of (ideally sorted) DocIDs) Data barrel 0 -> 0, 11 -> 0, 22 -> 0,3 -> 1, 2 Note: if you use database then it make sense to have clustered index on WordID
Merging Individual indexed barrels Single searchable index Note: this stage is not necessary if just one barrel is used as there will be no need to remap all Ids from local to their global equivalents.
Searching Searching is a process of finding documents that contain words from search query Lexicon(maps words to their numeric WordIDs) Search query: “Birmingham Perl” Birmingham – 0Perl – 1Mongers – 2City – 3 Intersection of DocIDs present in both lists (implementation of boolean AND logic): Inverted Index(lists DocIDs for each of the WordID) WordIDs: 0, 1 0 -> 0, 11 -> 0, 22 -> 0,3 -> 1, 2 Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Note: if you use database then it make sense to cluster on WordID
Implementation • Microsoft .NET C# ported to Linux using Mono (http://www.mono-project.com) • ~90k lines of code (minimal copy/paste) written from scratch • Low level of dependencies (SharpZipLib/SQLite/NPlot)
Why not Perl?(using C# instead) • Not strong in GUI department • Hard to deal with Multi-Threading and Asyncronous sockets • OOP is more of a hack • Lax compile-time checks due to not being strictly typed • Fear of performance bottlenecks forcing to use C++ • Hard to profile for performance analysis • Managed memory lacks support for pointers (?) • Poor exceptions handling • I wanted something new :)
Conclusions • Still work in progress, but some conclusions can be made already: • Inverted indexing approach helps to achieve fast searches • Its tough to build one – don’t try if you ain’t going to see it through! • Crawler is one tough piece of code – 6 months vs 2 months on searching • .NET C# is a decent language suitable for heavy duty tasks like this
Credits • R&D: Alex Chudnovsky <alexc@majestic12.co.uk> • Pioneers*: FiddleAbout, dazza12, lazytom, Mordac, linuxbren, Cyber911, www.vanginkel.info, Vari, ASB, SEOBy.org, arni, japonicus, webstek.info | Pimpel, DimPrawn, Zyron, partys-bei-uns.de, jake, bull at webmasterworld, nada, dodgy4, sri-heinz * Volunteers running crawler and who crawled at least 1 mln URLs as of 27/07/05
Recommended reading • “The Anatomy of a Large-Scale Hypertextual Web Search Engine” Sergey Brin and Lawrence Page of Google (http://www-db.stanford.edu/~backrub/google.html) • “Managing Gigabytes” Ian h. Witten et al ISBN 1-55860-570-3
Join! • Join the project (unmetered broadband required!): majestic12.co.uk Your name could be here!