1 / 24

Migrating a Real-time News Classification Engine to Luwak/Lucene

Explore the migration of a legacy news classification engine at Bloomberg to Luwak/Lucene, including OTL query language, custom queries, performance, and integration into the news pipeline. Learn about project requirements and the role of Luwak in streamlining the process.

onawa
Download Presentation

Migrating a Real-time News Classification Engine to Luwak/Lucene

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Migrating a Real-time News Classification Engine to Luwak/Lucene Berlin Buzzwords 2017 June 12, 2017 Marvin Justice Software Developer, News Classification mjustice3@bloomberg.net

  2. Outline • Rules based news classification at Bloomberg • Legacy engine and the OTL query language • Luwak • OTL Luwak • Query parser • Tokenizers • Custom queries and scoring • Integration into news pipeline • OTL Solr • Current status and performance

  3. News at Bloomberg • Bloomberg Professional Service (aka “the Terminal”) • Almost unlimited number of individual “functions” • Among the most used are several News related: TOP, CN, NSE, etc. • Bloomberg ingests >1M news stories daily • Solr collections containing ~500M stories • 10M client searches/day • Useful to tag stories (AAPL, GLD, MNA, …) for search purposes • “Building a real-time news search engine,”R. Aiyengar, Berlin Buzzwords 2016

  4. Rules based news classification • Part of the news ingest pipeline • Avg. ~15 stories/sec in live mode, spikes of 100’s stories/sec • Stories processed (serial and parallel steps) and indexed to Solr • 100Ks of OTL rules in 24 languages • Templated or hand crafted • Maintained by Content Indexing Team (domain experts) • Each story passed through entire rule set as it’s being ingested • Sharding and lots of hardware helps but need more sophisticated approach

  5. Legacy Classification Engine • Provided by Verity (acquired by Autonomy in 2006) • In use at Bloomberg for ~10 years • Classification engine provided as 32 bit native library • Rules are written in the OTL query language • Quite fast, mean latency of ~25ms/story • Max latency not so good • Verity product has been EOL’d by current owner HP

  6. Outline Topic Language • Term and Phrase leafs • Also <PHRASE> operator • Basic Wildcards plus Charset Regex’s • Boolean operators: <AND>, <OR>, <NOT> • Proximity queries: <NEAR/n>, <STARTSZONE/n> • Frequency queries: <ATLEAST/n>, <COUNT/n> • Miscellaneous: <SOUNDEX>, <THESAURUS>, etc. not used at Bloomberg • Absolute scoring system with scoring operators: <ANY>, <MANY>, etc. • Several OTL operators have no direct Lucene analog

  7. OTL Example • * 0.75 <In> • /zonespec = “`headline`”** <Phrase>*** <Any>**** “sram” • **** “dram”**** “nvram”*** <Any>**** “memory”**** “chip” • q = headline: (“sram memory” OR “dram memory” OR “nvram memory” OR “sram chip” OR “dram chip” OR “nvram chip”)^0.75

  8. Project Requirements • Goal: replace legacy engine • Minimal burden for Content Indexers • Replacement must interpret OTL input • Replacement needs to replicate existing scoring • Performance (latency, load) must be as good or better • Hardware not limiting factor (to a point) • Legacy system’s mean latency is 25ms • Primary build is the load test • Provide a way for Content Indexers to test new rules • Legacy has authoring environment, don’t need to replace it in first pass • Do need a way to generate custom collections and run rules against them

  9. Luwak • “Simply put, it allows you to define a set of search queries and then monitor a stream of documents for any that might match these queries” • Alan Woodward & Flax • http://www.flax.co.uk/ • https://berlinbuzzwords.de/14/session/turning-search-upside-down-search-queries-documents • Based on “query index” and “presearcher” • Stories are turned into Lucene queries and run against the query index to quickly (<10ms) weed out rules that have no chance of matching • ParallelMatcher for queries that survive preselection • Already used by Bloomberg to provide news alerts • D. Collins https://www.youtube.com/watch?v=IOL-6ns7M8k

  10. OTL Luwak OTL parser Analysis chains (tokenizers, filters, etc.) Custom queries and scoring Luwak extensions Integration into ingest pipeline Hire Flax to get started!

  11. OTL Parser • ANTLR 4 parser generator • grammar file => lexer, parser, listener • QueryBuilder class does a top down walk of parse tree • No intermediate text representation --- OTL is translated directly into a Lucene Query object

  12. Analysis Chains • Legacy system (non-CJ) • Whitespace tokenize queries • Document tokenizer similar to Lucene’s StandardTokenizer but not exact match • Customizable via config file, e.g.,@handle need not be split • Token filtering similar to LowerCaseFilter, ASCIIFoldingFilter • OTL Luwak (non-CJ) • Whitespace tokenize queries • Fork StandardTokenizer => BBStandardTokenizer • Separate fork for BBKoreanTokenizer • LowerCaseFilter, ASCIIFoldingFilter (with exclusions) • Handful of custom filters (especially Korean)

  13. Chinese/Japanese Tokenizer • For Chinese/Japanese legacy uses dictionary tokenizer from Basis Tech • Tried to obtain a Solr pluggable version to no avail • Lucene’s ICUTokenizer comes closest (also dictionary based) • ICU trac #11996, #11999 had to be fixed first • Retokenize OTL queries according to ICU dictionary • Big improvement but still not good enough • Customize ICU dictionary? • Can “fix” individual rules • Still an open question how well we can reproduce legacy behavior • Ultimate fallback is for Content Indexers to adjust CJ rules

  14. Custom Queries and Scoring • Operators w/o Lucene analog • AtLeastQuery, SpanAtLeastQuery, SpanMinMatchQuery, SpanStartsQuery, … • Scoring operators • DisjunctionAccrueQuery, ConjunctionMinQuery, … • Tweak Lucene’s DisjunctionScorer.freq() for (DisiWrapper w = subMatches.next; w != null; w = w.next) {- freq += 1;+ freq += w.scorer.freq(); } • Tweak Lucene NearSpans slop • Modify sort order in Lucene’s SpanPositionQueue • LUCENE-7398 “Nested Span Queries are Buggy” • Other minor tweaks

  15. Attachment 1 Attachment 2 1999 0 1 500 999 1000 1001 attachment: … … … luwak luwak i SpanFirstQuery Attachment 1 Attachment 2 2001 0 1 501 1000 1001 1002 1003 attachment: … … … START luwak START luwak i1 i2 * <STARTSZONE/5> /zonespec = “`attachment`” ** “luwak” SpanStartsQuery

  16. Custom Slop * <Near/4>  slop = 3** “colorless” ** “green”** “ideas”0 1 2 3 4 5 “colorless blah green blah blah ideas” NearSpansUnorderedmaxEndPositionCell.endPosition() - minPositionCell.startPosition() - totalSpanLength6 – 0 – 3 = 3 => match (bad) Modified formulamaxPositionCell.startPosition() - minPositionCell.startPosition() - 15 – 0 – 1 = 4 => no match (good)

  17. Extending Luwak • We can run stock Luwak • Presearcher builds a query index • Uses QueryExtractor’s • We add custom extractors: AtLeastQueryExtractor, etc. • QueryDecomposer can improve performance • We extend to OTLQueryDecomposer • Highlighting uses a SpanRewriter • We extend to OTLSpanRewriter

  18. Ingest Pipeline Integration • Pipeline consists of Bloomberg Application Services • Traditionally native code (BAS Java available recently) • Legacy classifier is a BAS service with linked in Verity library • oryx: native BAS accessing OTL Luwak via JNI • Reuse much of existing infrastructure from legacy’s BAS • Classification farm of Linux machines split across multiple data centers • Rules sharded 4 ways • Replicated across each machine of the farm • Fed by “brokers” (intelligent routers) • Luwak’s ParallelMatcher thread count 2 • 20 max simultaneous stories per shard replica

  19. OTL Solr • Primarily for Content Indexers but also useful for debugging this project • Solr+ OTL Luwak - Luwak => bbsolr-6.3.0-otl • Solr modified to have per language analysis chains • Bloomberg Terminal UI (internal only) • On-demand custom collections • Send either OTL or Solr syntax queries • Results highlighted with hits

  20. Current Status • Parallel streams for legacy and OTL Luwak • Thai turned on for new engine • Side-by-side comparisons • >99% accuracy for non-CJ languages • CJ >90% but still not good enough • Performance • Solr 4.8 looked hopeless, Solr 5.3 saves the day! • Improved latency: 20ms average vs 25ms for legacy • Significantly better latency for large stories: a few seconds vs a few 10’s of seconds • Handles load as well or better than legacy • Does use 2X hardware • Startup is slower than legacy

  21. Q&A

More Related