Efficient full-text search in databases

Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru

Search in databases? • Databases are continually growing • “everyone” has got 1M records • 10-100M record databases are not that rare • 1B+ record databases which require full-text search do exist (most prominent example is Google) • Open-source DBMS are widely used • We will talk about MySQL • “The word on the street” is that other DBMSes have similar problems • Unfortunately, built-in solutions are not good enough for full-text search • And especially so, if there is something beyond “just” full-text search required…

Types of special requirements • “Just”search is a key requirement, but… • Amazing, but it happens rather rarely (in DBMS world) • Rather a Web-search engine task • Additional sorting is frequently required • On a value different from relevance – for instance, on product price • Additional filtering is frequently required • For instance, by product category, or posting author ID • Match grouping is frequently required • For instance, by date, or by data source (eg. site) ID • What do built-in solutions offer?

Built-in MySQL FTS • Pro – built-in, updates “instantly” • Con – scales poorly • Con – ignores word positions • This causes ranking issues • This causes phrase search to be slow • Con – only1 FT index per query (columns…) • Con – does not interoperate with other indexes • I.e.WHERE, ORDER/GROUP BY, LIMITclauses would be handled separately and “manually” • Conclusion – it is often unacceptable

External engines shootout • We tested a number of well-known (to us) open-sourcesolutions • Let the vendors advertise commercial solutions themselves  • MySQL FTS • mnoGoSearch, http://mnogosearch.org/ • Designed for Web, but can do databases too (htdb) • Lucene, http://lucene.apache.org/ • PopularJava full-text searchlibrary • Sphinx, http://sphinxsearch.com/ • Designed for full-text search in databases from day one

Benchmarking results • ~3.5Mrecords, ~5 GBtext (from Wikipedia) • mnoGoSearchdropped out of a race • more details in EuroOscon‘2006 talk by Peter Zaitsev

Existing solutions • mnoGoSearch • Con – indexing and searching time issues • FATAL – did not complete indexing 5 GBin 24hours • Lucene • Pro – “instant” index updates • Pro – wildcard, fuzzysearches • Con – integration cost (this is Java library) • Con – filtering implementation (searching speed) • Con – no support for grouping • Sphinx • Con – “monolithic”indexes • Pro – everything else 

Sphinx – overview • External solution for database search • Two principal programs • Indexer, used for re-indexing FTindexes • Searchd,search daemon • Easy integration • Built-in support for MySQL, PostgreSQL • ProvidesAPIsfor PHP, Python, Perl, Ruby,etc • Provides MySQL Storage Engine • High speed • Indexing speed – 4-10 MB/sec • Searching speed– avg 20-30 ms/q @ 5 GB, 3.5M docs

Sphinx – ideology • Indexes locally available databases • “A-la SQL” document structure supported from day one • Up to 256 full-text fields • Any amount of attributes (integer/timestamp/etc) • “Fast re-indexing instead of slow searching” • Non-updateable index format – was initially chosen to maximize searching speed • But then it turned out – that re-indexing is very fast, too • In case of partial updates – we can still use re-indexing“partial”(delta) indexes once per Nminutes

Sphinx – searching • Quality • Always accounts for word positions, not just frequencies • Scalability • Up to 50-100 GB per 1 CPU • Supports distributed searches • Distributed indexes are fully transparent to client application • Examples • Boardreader.com –500M+ records, 550+GBtext, 12CPU cluster • Mininova.org – not manyrecords (less than 1M), but 2-3Msearches per day

Sphinx –advanced features • Sorting • On any attribute combination, SQL-like syntax • Filtering matches with a condition • Performed at earliest possible searching stage – for speed • Attributes are always either kept in RAM, or copied multiple times all over the index in required order – for speed • Fun fact – sometimes full scan of all matches and filtering those on Sphinx side are times faster than corresponding MySQL SELECT query – and are used in production instead…

Sphinx –advanced features • Grouping • On any attribute • Performed in fixed RAM • Performed approximately (!) • Performed quite efficiently (compared to MySQLetc) • Query words highlighting • Special service, which needs document bodies and the query passed to it • MySQL Storage Engine • Can be used for especially complex queries on MySQL side which can not be run fully on Sphinx side • Can be used to simplify integration

Conclusions • Large and very large databases require external solutions for full-text search • There is a number of requirements to such solutions beyond “just” searching (filtering,grouping,etc) • There is a number of open-sourcesolutions with different degrees of matching these requirements • For most tasks, try Sphinx, http://sphinxsearch.com/

Efficient full-text search in databases