The riddles of the Sphinx

The riddles of the Sphinx Full-text engine anatomy atlas

Who are you? • Sphinx – FOSS full-text search engine

Who are you? • Sphinx – FOSS full-text search engine • Good at playing ball

Who are you? • Sphinx – FOSS full-text search engine • Good at playing ball • Good at not playing ball

Who are you? • Sphinx – FOSS full-text search engine • Good at playing ball • Good at not playing ball • Good at passing the ball to a team-mate

Who are you? • Sphinx – FOSS full-text search engine • Good at playing ball • Good at not playing ball • Good at passing the ball to a team-mate • Good at many other “inferior” games • “faceted”search, geosearch, snippet extraction,multi-queries,IO throttling, and 10-20 other interesting directives

What are you here for? • What willnotbe covered? • No entry-level “what’s that Sphinx and what’s in it for me” overview • No long quotes from the documentation • No C++ architecture details

What are you here for? • What willnotbe covered? • No entry-level “what’s that Sphinx and what’s in it for me” overview • No long quotes from the documentation • No C++ architecture details • What will be? • How does it generally work inside • How things can be optimized • How things can be parallelized

Chapter 1. Engine insides

Totalworkflow • Indexing first • Searching second

Total workflow • Indexing first • Searching second • There are data sources (what to fetch, where from) • There are indexes • What data sources to index • How to process the incoming text • Where to put the results

How indexing works • In two acts, with an intermission • Phase 1 – collecting documents • Fetch the documents (loop over the sources) • Split the documents into words • Process the words (morphology, *fixes) • Replace the words with their wordid’s (CRC32/64) • Emit a number of temp files

How indexing works • Phase2 – sorting hits • Hit (occurrence) is a (docid,wordid,wordpos) record • Input is a number of partially sorted(by wordid)hit lists • The incoming lists are merge-sorted • Output is essentially a single fully sorted hit list • Intermezzo • Collect and sort MVA values • Sort ordinals • Sort extern attributes

Dumb & dumber • The index format is… simple • Several sorted lists • Dictionary (the complete list of wordid’s) • Attributes (only if docinfo=extern) • Document lists (for each keyword) • Hit lists (for each keyword) • Everything is laid out linearly, good for IO

How searching works • For each local index • Build a list of candidates (documents that satisfy the full-text query) • Filter (the analogy is WHERE) • Rank (compute the documents’ relevance values) • Sort (the analogy is ORDER BY) • Group (the analogy is GROUP BY) • Merge the results from all the local indexes

1. Searching cost • Building the candidates list • 1 keyword = 1+IO (document list) • Boolean operations on document lists • Cost is proportional (~)to the lists lengths • That is, to the sum of allthe keyword frequencies • In case of phrase/proximity/etc search, there also will be operations on hit lists – approx. 2xIO/CPU

1. Searching cost • Building the candidates list • 1 keyword = 1+IO (document list) • Boolean operations on document lists • Cost is proportional (~)to the lists lengths • That is, to the sum of allthe keyword frequencies • In case of phrase/proximity/etc search, there also are operations on hit lists – approx. 2xIO/CPU • Bottom line – “The Who” are really bad

2. Filtering cost • docinfo=inline • Attributes are inlined in the document lists • ALL the values are duplicated MANY times! • Immediately accessible after disk read • docinfo=extern • Attributes are stored in a separate list (file) • Fully cached in RAM • Hashed by docid + binary search • Simple loop over all filters • Cost~number of candidatesand filters

3. Ranking cost • Direct – depends on the ranker • To account for keyword positions – • Helps the relevancy • But costs extra resources – double impact! • Cost ~ number of results • Most expensive – phrase proximity + BM25 • Most cheap – none (weight=1) • Indirect – induced in the sorting

4. Sorting cost • Cost ~ number ofresults • Also depends on the sorting criteria(documents will be supplied in@id asc order) • Also depends on max_matches • The more the max,the worse the server feels • 1-10K is acceptable, 100K is way too much • 10-20 is not enough (makes little sense)

5. Grouping cost • Grouping is internally a kind of sorting • Cost affected by the number of results, too • Cost affected by max_matches, too • Additionally, max_matchessetting affects @count and@distinct precision

Chapter 2. Optimizing things

How to optimize queries • Partitioning the data • Choosing ranking vs. sorting mode • Filters vs. keywords • Filters vs. manual MTF • Multi queries

How to optimize queries • Partitioning the data • Choosing ranking vs. sorting mode • Filters vs. keywords • Filters vs. manual MTF • Multi queries • Last line of defense – Three Big Buttons

1. Partitioning the data • Swiss army knife, for different tasks • Bound by indexing time? • Partition, re-index the recent changes only • Bound by filtering? • Partition, search the needed indexes only • Bound by CPU/HDD? • Partition,move out to different cores/HDDs/boxes

1a. Partitioning vs. indexing • Vital to keep the balance right • Under-partition – and indexing will be slow • Over-partition – and searching will be slow • 1-10 indexes– work reasonably well • Some users are fine with 50+ (30+24...) • Some users are fine with 2000+ (!!!)

1b.Partitioning vs. filtering • Totally, 100% dependent on production query statistics • Analyze your very own production logs • Add comments if needed(3rdarg to Query()) • Justified only if the amount ofprocesseddata is going to decrease significantly • Move out last week’s documents – yes • Move out English-only documents – no (!)

1c. Partitioning vs.CPU/HDD • Use a distributed index, explicitly map the chunks to physical devices • Point searchd“at itself” – index dist1 { type = distributed local = chunk01 agent = localhost:3312:chunk02 agent = localhost:3312:chunk03 agent = localhost:3312:chunk04 }

1c. How to find CPU/HDD bottlenecks • Three standard tools • vmstat – what’s the CPU busy with? how busy is it? • oprofile –specifically who eats the CPU? • iostat –how busy is the HDD? • Also use logs, also use searchd --iostats option • Normally everything is clear (us/sy/bi/bo…),but! • Caveat – HDD might be iops bound • Caveat – CPUload from Sphinx might be induced and “hidden” in sy

2. Ranking • Can now be very different (so called rankers in extended2 mode) • Default ranker – phrase+BM25,accounts for keyword positions – not for free • Sometimes it’s ok to use simpler ranker • Sometimes @weightis ignored at all совсем (searching for ipod, sorting by price) • Sometimes you can save on ranker

3. Filters vs. keywords • Well-known trick • When indexing, add a special, fake keyword to the document(_authorid123) • When searching, add it to the query • Obvious questions • What’s faster, what’s better? • Simple answer • Count the change before moving away from the cashier

3. Filters vs. keywords • Cost of searching~keyword frequencies • Cost of filtering~number of candidates • Searching – CPU+IO, filtering – CPU only • Fake keyword frequency=filter valueselectivity • Frequent value +few candidates → bad! • Rare value + many candidates → good!

4. Filters vs. manual MTF • Filters are looped over sequentially • In the order specified by the app! • Narrowest filter – better at the start • Widest filter – better at the end • Does not matter if you use fake keywords • Exercise to the reader – why?

5. Multi-queries • Anyqueries can be sent together in a batch • Always saves on network roundtrip • Sometimes allows the optimizer to trigger • Especially important and frequent case –different sorting/grouping modes • 2x+ optimization for“faceted”searches

5. Multi-queries $client = new SphinxClient (); $q = “laptop”; // coming from website user $client->SetSortMode ( SPH_SORT_EXTENDED, “@weight desc”); $client->AddQuery ( $q, “products” ); $client->SetGroupBy ( SPH_GROUPBY_ATTR, “vendor_id” ); $client->AddQuery ( $q, “products” ); $client->ResetGroupBy (); $client->SetSortMode ( SPH_SORT_EXTENDED, “price asc” ); $client->SetLimit ( 0, 10 ); $result = $client->RunQueries ();

6. Three Big Buttons • If nothing else helps… • Cutoff (см. SetLimits()) • Forcibly stops searching after first Nmatches • Per-index, not overall • MaxQueryTime (см. SetMaxQueryTime()) • Forcibly stops searching after Mmilli-seconds • Per-index, not overall

6. Three Big Buttons • If nothing else helps… • Consulting  • We can notice the unnoticed • We can implement the unimplemented

Chapter 3. Parallelization sample

Combat mission • Got~160Mcross-links • Needed misc reports (by domains→groupby) *************************** 1. row *************************** domain_id: 440682 link_id: 15 url_from: http://www.insidegamer.nl/forum/viewtopic.php?t=40750 url_to: http://xbox360achievements.org/content/view/101/114/ anchor: NULL from_site_id: 9835 from_forum_id: 1818 from_author_id: 282 from_message_id: 2586 message_published: 2006-09-30 00:00:00 ...

Tackling – one • Partitioned the data • 8 boxes, 4x CPU, ~5Mlinks per CPU • Used Sphinx • In theory, we could had used MySQL • It practice, way too complicated • Would had resulted in 15-20M+ rows/CPU • Would had resulted in “manual” aggregation code

Tackling – two • Extracted“interesting parts” of the URLwhen indexing, using an UDF • Replaced the SELECT with full-text query *************************** 1. row *************************** url_from: http://www.insidegamer.nl/forum/viewtopic.php?t=40750 urlize(url_from,0): www$insidegamer$nlinsidegamer$nlinsidegamer$nl$forum insidegamer$nl$forum$viewtopic.phpinsidegamer$nl$forum$viewtopic.php$t=40750 urlize(url_from,1): www$insidegamer$nlinsidegamer$nlinsidegamer$nl$forum insidegamer$nl$forum$viewtopic.php

Tackling – three • 64indexes • 4 searchd instancesper box, by CPU/HDD count • 2 indexes (main+delta)per CPU • All searched in parallel • Web boxqueries the main instance at each box • Main instance queries itself and other 3 copies • Using 4 instances, because of startup/update • Using plain HDDs, because of IO stepping

Results • The precision is acceptable • “Rare”domains – precise results • “Frequent”domains – precision within 0.5% • Average query time – 0.125 sec • 90%queries – under0.227 sec • 95%queries – under0.352 sec • 99%queries – under2.888 sec

The end

The riddles of the Sphinx