1 / 21

Charlie Hull Managing Director, Flax 1 st November 2012

Search,plus building taxonomy, autoclassification and media monitoring tools with open source software. Charlie Hull Managing Director, Flax 1 st November 2012. charlie@flax.co.uk www.flax.co.uk/blog +44 (0) 8700 118334 Twitter: @FlaxSearch. Who are Flax?.

bunme
Download Presentation

Charlie Hull Managing Director, Flax 1 st November 2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search,plus building taxonomy, autoclassification and media monitoring tools with open source software Charlie Hull Managing Director, Flax 1st November 2012 charlie@flax.co.uk www.flax.co.uk/blog +44 (0) 8700 118334 Twitter: @FlaxSearch

  2. Who are Flax? • Search engine specialists with decades of experience • Developers, innovators and strategists based in Cambridge, UK • Technology agnostic – but open source exponents • UK Authorized Partner of Lucid Imagination • Customers include Reed Specialist Recruitment, Mydeco, NLA, Durrants Ltd, Financial Times, MediaMiser, MySkreen, Accenture, University of Cambridge, Cabinet Office... • Come to

  3. Who am I? Wrote my first saleable software at age 14 Electronic engineer, Windows device driver developer, mobile networks, pro sound, racing cars.... Muscat (Bayesian search) Helped build a half-billion-page web search engine Co-founder and CEO of Flax

  4. Who am I? Wrote my first saleable software at age 14 Electronic engineer, Windows device driver developer, mobile networks, pro sound, racing cars.... Muscat (Bayesian search) Helped build a half-billion-page web search engine Co-founder and CEO of Flax

  5. What I'll cover today • Search – the state of play • Clade – an open source taxonomy based classifier • Flax Media Monitor • Some other crazy ideas • Conclusions

  6. Search – the state of play • Types of search project: • Website search • Intranet search • Database search • Closed source engines either: • Sold up • Repositioned • In trouble! • Open source engines: • Apache Lucene/Solr • ElasticSearch • (ish!) Attivio/Lucidworks/...

  7. Search – the state of play • Types of search project: • Website search • Intranet search • Database search • Closed source engines either: • Sold up • Repositioned • In trouble! • Open source engines: • Apache Lucene/Solr • ElasticSearch • (ish!) Attivio/Lucidworks/...

  8. Search – the state of play • Types of search project: • Website search • Intranet search • Database search • Closed source engines either: • Sold up • Repositioned • In trouble! • Open source engines: • Apache Lucene/Solr • ElasticSearch • (ish!) Attivio/Lucidworks/...

  9. Let's talk about something more interesting... • Clade: classifying data into a taxonomy with a search engine • Developed as a proof of concept • Based on Apache Solr & Stanford NLP • Written in JQuery & Python • Caveats: • We don't know much about library science! • Something like this may already exist (not that we could find it...) • This is an alpha version only

  10. Clade demo....

  11. What Clade doesn't do (yet) • Talk standard taxonomy formats • Output anything • Multiple users • Rules-based classification • Look pretty http://www.flax.co.uk/the_software to try it out...

  12. Media monitoring • Standard search – few queries over many documents • Monitoring search – many queries over each document • Customers interests manually turned into queries • Humans probably still have the final say on relevance • Eventual result is a list of articles emailed (or even printed for) customers

  13. Media monitoring - parameters • Tens of thousands of stored expressions or keywords • Can't rewrite these so must use same syntax! • Hundreds of thousands of articles to monitor every day • Source data can sometimes be scanned & OCR'd • False positives cost human operator time: false negatives cost customers! • Traditional approach is brute force using standard search engine software

  14. Media monitoring – a Keyword (";PALM BEACH COUNTY"; W/48 ((";TOURIS*"; OR ";TOUR"; OR ";TOURS"; OR ";TRAVEL*"; OR ";HOLIDAY*"; OR ";HOL"; OR ";HOLS"; OR ";HOTEL*"; OR ";VISIT*"; OR ";TRIP"; OR ";TRIPS"; OR ";DAYTRIP*"; OR ";BEACH"; OR ";!BEACHES"; OR ";COAST"; OR ";!COASTLINE*"; OR ";ABTA"; OR ";DAY TRIP*"; OR ";SUITE"; OR ";SUITES"; OR ";A%CCOMMODATION"; OR ";BED AND !BREAKFAST"; OR ";B&B"; OR ";BED & !BREAKFAST"; OR ";!BREAKFAST AND BOARD"; OR ";FULL BOARD"; OR ";HALF BOARD"; OR ";ALL !INCLUSIVE"; OR ";THINGS TO DO"; OR ";HOSP?TALITY"; OR ";SHORT BREAK*"; OR ";!WEEKEND BREAK"; OR ";CITY BREAK*"; OR ";!SIGHTSEE*"; OR ";!VACATION*"; OR ";E%XCURSION*"; OR ";FLY* WITH"; OR ";FLY* THERE"; OR ";FLY* DRIVE"; OR ";!GETAWAY"; OR ";!BACKPACK*"; OR ";BACK PACK*"; OR ";!ECOTOURIS*"; OR ";!WATERSPORT*"; OR ";WATER SPORT*"; OR ";FESTIVAL*"; OR ";RESORT* & SPA"; OR ";RESORT* AND SPA"; OR ";WHALE WATCH*"; OR ";GET THERE"; OR ";WHERE TO STAY"; OR ";GETTING THERE"; OR ";STAYCATION*"; OR ";VILLA"; OR ";VILLAS"; OR ";AIRPORT*"; OR ";SPA"; OR ";SPAS"; OR ";OUTDOOR EVENT*"; OR ";OUTDOOR ADVENTURE*"; OR ";OUTDOOR PURSUIT*"; OR ";OUTDOOR ACTIVIT*"; OR ";CLIMBING WALL*"; OR ";CLIMBING CENTRE*"; OR ";ROCK CLIMB*"; OR ";WHITE WATER RAFTING";) OR (";PLACES"; W/4 (";TO STAY"; OR ";TO SEE"; OR ";TO EAT";)) OR ((";FLIGHT*"; OR ";FLY"; OR ";FLYING"; OR ";CRUISE*";) W/4 (";OFFER"; OR ";AVAILABLE"; OR ";DEPART*"; OR ";FROM"; OR ";TRANSFER*";))))

  15. Media monitoring – another Keyword (((";!MOBILE PHONE*"; OR ";PHONE MAST*"; OR ";HANDSET*"; OR ";CELL* PHONE*"; OR ";3G"; OR ";GPRS"; OR ";G.P.R.S"; OR ";!GENERAL !RADIO PACKET SERVICE*"; OR ";GSM"; OR ";G.S.M"; OR ";!GLOBAL SYSTEM FOR !MOBILE COMM*"; OR ";HSDPA"; OR ";H.S.D.P.A"; OR ";HIGH SPEED DOWNLINK !PACKET ACCESS"; OR ";HSUPA"; OR ";H.S.U.P.A"; OR ";HIGH SPEED !UPLINK !PACKET ACCESS"; OR ";UMTS"; OR ";U.M.T.S"; OR ";MVNO"; OR ";M.V.N.O"; OR ";SMS"; OR ";SHORT MESSAGE !SERVICE*"; OR ";MMS"; OR ";!MULTIMEDIA MESSAGE !SERVICE*"; OR ";!MOBILES"; OR ";!CELLPHONE*"; OR ";!TELECOM*"; OR ";!LANDLINE*"; OR ";!TELEPHONE*"; OR ";PHONE*"; OR ";!TELEKOM*"; OR ";TELCO*"; OR ";VODAFONE"; OR ";T-MOBILE"; OR ";TMOBILE"; OR ";!TELEFONICA"; OR ";BT"; OR ";!MOBILE USER*"; OR ";TEXT MESSAG*"; OR ";SMARTPHONE*"; OR ";!VIRGIN !MEDIA*"; OR ";CABLE & !WIRELESS"; OR ";CABLE AND !WIRELESS";) W/48 ((";PROFIT*"; OR ";LOSS*"; OR ";BAN"; OR ";BANNED"; OR ";PREMIUM RATE*"; OR ";FINANC*"; OR ";!REFINANC*"; OR ";OFFICE OF FAIR TRADING"; OR ";MERGER*"; OR ";!ACQUISIT*"; OR ";ACQUIR*"; OR ";TAKEOVER*"; OR ";BUYOUT*"; OR ";BUY-OUT*"; OR ";NEW PRODUCT*"; OR ";INVEST*"; OR ";SHARES"; OR ";MARKET*"; OR ";ACCOUNT*"; OR ";MONEY"; OR ";CASH*"; OR ";SECURIT*"; OR ";!ENTERPRIS*"; OR ";!BUSINESS*"; OR ";PRICE*"; OR ";JOINT*"; OR ";NEW VENTURE*"; OR ";PRICING"; OR ";COST*"; OR ";CHAIRM?N"; OR ";APPOINT*"; OR ";!EXECUTIVE"; OR ";SALE*"; OR ";SELL*"; OR ";FULL YEAR"; OR ";REGULAT*"; OR ";!DIRECTIVE*"; OR ";LAW"; OR ";LAWS"; OR ";!LEGISLAT*"; OR ";GREEN PAPER"; OR ";WHITE PAPER*"; OR ";!MEDIAWATCH"; OR ";MORAL*"; OR ";ETHIC*"; OR ";ADVERT*"; OR ";AD"; OR ";ADS"; OR ";MARKETING"; OR ";!COMPLAIN*"; OR ";MIS-SOLD"; OR ";MIS-SELL*"; OR ";SPONSOR"; OR ";COSTCUT*"; OR ";COST CUT*"; OR ";CUT* COST*"; OR ";FIBRE OPTIC*"; OR ";TAX"; OR ";TAXES"; OR ";TAXED"; OR ";EXPAND*"; OR ";!EXPANSION"; OR ";EMPLOY*"; OR ";STAFF"; OR ";WORKER*"; OR ";SPOKESM?N"; OR ";DEBUT"; OR ";BRAND*"; OR ";DIRECTOR*";) OR ((";FAIR"; OR ";UNFAIR"; OR ";%UNSCRUPULOUS"; OR ";NOT FAIR"; OR ";UNJUST*"; OR ";!PENALISE*";) W/12 (";CHARG*"; OR ";TARIFF*"; OR ";PRICE PLAN*"; OR ";GLOBAL";)))) AND NOT (";EXPRESS OFFER"; OR ";TIMES OFFER"; OR ";READER OFFER"; OR ((";CALLS COST";) W/6 (";FROM A LANDLINE"; OR ";FROM LANDLINE*"; OR ";BT LANDLINE*";))))

  16. Flax Media Monitor • Based on a modification of Apache Lucene/Solr • Runs a separate Solr server for archiving • Consumes XML articles & keywords • Outputs matches as XML • REST API for status & configuration • Allows you to test new Keywords on old content

  17. Flax Media Monitor demo...

  18. Flax Media Monitor - performance • For simple keywords (<20 terms): • 70,000 keywords applied per second to an article • Tested on a Macbook • 20 times faster than previous implementation • For more complex keywords (some run to three pages!) • 20,000 keywords applied in 0.5 seconds • Approx 2000 docs/hour • Can be scaled horizontally for high load (and needs a lot less hardware) • Archive can store tens to hundreds of millions of articles

  19. Some other crazy ideas... • Combine media monitoring with Clade: very fast expression-based classification! • We can parse syntax from other search engines... • How to store rapidly changing classification data in a search engine index: 1. Re-index all documents affected (expensive) 2. Store the classifications somewhere else: how about a Lucene codec backed by a NoSQL Database? http://www.flax.co.uk/blog/2012/06/22/updating-individual-fields-in-lucene-with-a-redis-backed-codec/

  20. Conclusions • Search isn't just about “search” • Taxonomy management is ready for open source • Media monitoring can be done at low cost for high volume with open source • Classification maybe as a special case of monitoring? • It's all much more fun than 'vanilla' search!

  21. Thankyou! Any questions? charlie@flax.co.uk www.flax.co.uk/blog +44 (0) 8700 118334 Twitter: @FlaxSearch

More Related