Migrating a Real-time News Classification Engine to Luwak/Lucene

Migrating a Real-time News Classification Engine to Luwak/Lucene Berlin Buzzwords 2017 June 12, 2017 Marvin Justice Software Developer, News Classification mjustice3@bloomberg.net

Outline • Rules based news classification at Bloomberg • Legacy engine and the OTL query language • Luwak • OTL Luwak • Query parser • Tokenizers • Custom queries and scoring • Integration into news pipeline • OTL Solr • Current status and performance

News at Bloomberg • Bloomberg Professional Service (aka “the Terminal”) • Almost unlimited number of individual “functions” • Among the most used are several News related: TOP, CN, NSE, etc. • Bloomberg ingests >1M news stories daily • Solr collections containing ~500M stories • 10M client searches/day • Useful to tag stories (AAPL, GLD, MNA, …) for search purposes • “Building a real-time news search engine,”R. Aiyengar, Berlin Buzzwords 2016

Rules based news classification • Part of the news ingest pipeline • Avg. ~15 stories/sec in live mode, spikes of 100’s stories/sec • Stories processed (serial and parallel steps) and indexed to Solr • 100Ks of OTL rules in 24 languages • Templated or hand crafted • Maintained by Content Indexing Team (domain experts) • Each story passed through entire rule set as it’s being ingested • Sharding and lots of hardware helps but need more sophisticated approach

Legacy Classification Engine • Provided by Verity (acquired by Autonomy in 2006) • In use at Bloomberg for ~10 years • Classification engine provided as 32 bit native library • Rules are written in the OTL query language • Quite fast, mean latency of ~25ms/story • Max latency not so good • Verity product has been EOL’d by current owner HP

Outline Topic Language • Term and Phrase leafs • Also <PHRASE> operator • Basic Wildcards plus Charset Regex’s • Boolean operators: <AND>, <OR>, <NOT> • Proximity queries: <NEAR/n>, <STARTSZONE/n> • Frequency queries: <ATLEAST/n>, <COUNT/n> • Miscellaneous: <SOUNDEX>, <THESAURUS>, etc. not used at Bloomberg • Absolute scoring system with scoring operators: <ANY>, <MANY>, etc. • Several OTL operators have no direct Lucene analog

OTL Example • * 0.75 <In> • /zonespec = “`headline`”** <Phrase>*** <Any>**** “sram” • **** “dram”**** “nvram”*** <Any>**** “memory”**** “chip” • q = headline: (“sram memory” OR “dram memory” OR “nvram memory” OR “sram chip” OR “dram chip” OR “nvram chip”)^0.75

Project Requirements • Goal: replace legacy engine • Minimal burden for Content Indexers • Replacement must interpret OTL input • Replacement needs to replicate existing scoring • Performance (latency, load) must be as good or better • Hardware not limiting factor (to a point) • Legacy system’s mean latency is 25ms • Primary build is the load test • Provide a way for Content Indexers to test new rules • Legacy has authoring environment, don’t need to replace it in first pass • Do need a way to generate custom collections and run rules against them

Luwak • “Simply put, it allows you to define a set of search queries and then monitor a stream of documents for any that might match these queries” • Alan Woodward & Flax • http://www.flax.co.uk/ • https://berlinbuzzwords.de/14/session/turning-search-upside-down-search-queries-documents • Based on “query index” and “presearcher” • Stories are turned into Lucene queries and run against the query index to quickly (<10ms) weed out rules that have no chance of matching • ParallelMatcher for queries that survive preselection • Already used by Bloomberg to provide news alerts • D. Collins https://www.youtube.com/watch?v=IOL-6ns7M8k

OTL Luwak OTL parser Analysis chains (tokenizers, filters, etc.) Custom queries and scoring Luwak extensions Integration into ingest pipeline Hire Flax to get started!

OTL Parser • ANTLR 4 parser generator • grammar file => lexer, parser, listener • QueryBuilder class does a top down walk of parse tree • No intermediate text representation --- OTL is translated directly into a Lucene Query object

Analysis Chains • Legacy system (non-CJ) • Whitespace tokenize queries • Document tokenizer similar to Lucene’s StandardTokenizer but not exact match • Customizable via config file, e.g.,@handle need not be split • Token filtering similar to LowerCaseFilter, ASCIIFoldingFilter • OTL Luwak (non-CJ) • Whitespace tokenize queries • Fork StandardTokenizer => BBStandardTokenizer • Separate fork for BBKoreanTokenizer • LowerCaseFilter, ASCIIFoldingFilter (with exclusions) • Handful of custom filters (especially Korean)

Chinese/Japanese Tokenizer • For Chinese/Japanese legacy uses dictionary tokenizer from Basis Tech • Tried to obtain a Solr pluggable version to no avail • Lucene’s ICUTokenizer comes closest (also dictionary based) • ICU trac #11996, #11999 had to be fixed first • Retokenize OTL queries according to ICU dictionary • Big improvement but still not good enough • Customize ICU dictionary? • Can “fix” individual rules • Still an open question how well we can reproduce legacy behavior • Ultimate fallback is for Content Indexers to adjust CJ rules

Custom Queries and Scoring • Operators w/o Lucene analog • AtLeastQuery, SpanAtLeastQuery, SpanMinMatchQuery, SpanStartsQuery, … • Scoring operators • DisjunctionAccrueQuery, ConjunctionMinQuery, … • Tweak Lucene’s DisjunctionScorer.freq() for (DisiWrapper w = subMatches.next; w != null; w = w.next) {- freq += 1;+ freq += w.scorer.freq(); } • Tweak Lucene NearSpans slop • Modify sort order in Lucene’s SpanPositionQueue • LUCENE-7398 “Nested Span Queries are Buggy” • Other minor tweaks

Attachment 1 Attachment 2 1999 0 1 500 999 1000 1001 attachment: … … … luwak luwak i SpanFirstQuery Attachment 1 Attachment 2 2001 0 1 501 1000 1001 1002 1003 attachment: … … … START luwak START luwak i1 i2 * <STARTSZONE/5> /zonespec = “`attachment`” ** “luwak” SpanStartsQuery

Custom Slop * <Near/4>  slop = 3** “colorless” ** “green”** “ideas”0 1 2 3 4 5 “colorless blah green blah blah ideas” NearSpansUnorderedmaxEndPositionCell.endPosition() - minPositionCell.startPosition() - totalSpanLength6 – 0 – 3 = 3 => match (bad) Modified formulamaxPositionCell.startPosition() - minPositionCell.startPosition() - 15 – 0 – 1 = 4 => no match (good)

Extending Luwak • We can run stock Luwak • Presearcher builds a query index • Uses QueryExtractor’s • We add custom extractors: AtLeastQueryExtractor, etc. • QueryDecomposer can improve performance • We extend to OTLQueryDecomposer • Highlighting uses a SpanRewriter • We extend to OTLSpanRewriter

Ingest Pipeline Integration • Pipeline consists of Bloomberg Application Services • Traditionally native code (BAS Java available recently) • Legacy classifier is a BAS service with linked in Verity library • oryx: native BAS accessing OTL Luwak via JNI • Reuse much of existing infrastructure from legacy’s BAS • Classification farm of Linux machines split across multiple data centers • Rules sharded 4 ways • Replicated across each machine of the farm • Fed by “brokers” (intelligent routers) • Luwak’s ParallelMatcher thread count 2 • 20 max simultaneous stories per shard replica

OTL Solr • Primarily for Content Indexers but also useful for debugging this project • Solr+ OTL Luwak - Luwak => bbsolr-6.3.0-otl • Solr modified to have per language analysis chains • Bloomberg Terminal UI (internal only) • On-demand custom collections • Send either OTL or Solr syntax queries • Results highlighted with hits

Current Status • Parallel streams for legacy and OTL Luwak • Thai turned on for new engine • Side-by-side comparisons • >99% accuracy for non-CJ languages • CJ >90% but still not good enough • Performance • Solr 4.8 looked hopeless, Solr 5.3 saves the day! • Improved latency: 20ms average vs 25ms for legacy • Significantly better latency for large stories: a few seconds vs a few 10’s of seconds • Handles load as well or better than legacy • Does use 2X hardware • Startup is slower than legacy

Q&A

Migrating a Real-time News Classification Engine to Luwak/Lucene