1 / 13

Building a Vertical Search Site

Explore the architecture of a vertical search site using Apache software like Httpd, Lucene, Nutch, Solr, Xerces, and more. Discover the lessons learned and insights shared by Ken Krugler, the CTO/co-founder of Krugle.

schaffer
Download Presentation

Building a Vertical Search Site

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building a VerticalSearch Site (using lots of Apache software, of course)

  2. Just the Facts, Ma’am • Ken Krugler - CTO/co-founder of Krugle • We use lots of Apache S/W at krugle.org • Httpd, Lucene, Nutch, Solr, Xerces, etc. • I’ll describe our architecture • And the sometimes painful lessons learned

  3. Three Faces of Krugle • Free public site - http://www.krugle.org • Partner sites • http://sourceforge.krugle.com • http://developerworks.krugle.com • http://aws.krugle.com • Enterprise appliance

  4. Krugle.org free public site • Search code, projects, & technical web pages • 150,000 projects • 2.5billion lines of code • 40million web pages

  5. Krugle.org Architecture (web) • Web tier runs Apache • Also mod_perl • “glue” for Javascript to backend RESTful API • Partner APIs • “Dirty” side of system

  6. Krugle.org Architecture (API) • API server uses Resin • Webapps provide RESTful API services • Filer is big disk array • LightTPD, NFS • Searchers run Hadoop, Lucene

  7. Krugle.org Architecture (CPI) • Page crawl uses Nutch • Code crawl uses bits of Nutch, custom stuff • Fuzzy parsers created using ANTLR • Project data in MySQL, pushed to Solr • Code index is Lucene

  8. Krugle partner sites • IBM developerWorks • Sourceforge.net • Amazon Web Services • Yahoo! Dev Network • Collabnet

  9. Krugle Architecture (partners) • Higher level API • Wraps RESTful API • Handled in web tier • Big chunks of Perl • LightTPD cache

  10. Krugle enterprise server • Krugle inside firewall • Talks to major SCMs • SCM Comment search • Includes public site info

  11. Krugle Architecture (enterprise) • Collapses web tier, API server, code searchers, filer, and DB server • Separate admin system (DB, GUI, code crawler, configuration) as Jetty-hosted webapp

  12. RESTful API • HTTP requests, XML responses • Works well with Perl middleware • Some load/memory issues • Solr integration challenges • Integration test challenges

  13. Key Lessons • If it isn’t broke, don’t upgrade • There’s always a newer version • That includes the build system • Be prepared to pay for free software • Motivating project contributors to do things • Moderation in architectural abstraction • There’s always a higher and lower option

More Related