London Amazon CloudSearch Meetup Jon Handler

London Amazon CloudSearch MeetupJon Handler

Agenda • CloudSearch technical overview (Jon Handler, Amazon CloudSearch Solution Architect) • NakedWines and CloudSearch (Matt Reid, Developer at NakedWines) • Searching Wikipedia with Amazon CloudSearch (Iain Fletcher, Search Technologies) • Building UI with CloudSearch (Stefan Olafsson, Co-Founder, Twigkit)

What is Search Shoes

Do You Want Search With That? • Build your own – database, home-rolled, site search • Open source • Legacy enterprise search

Search Challenges • Complex, expertise required • Costly, often with up-front expenditure • Long time to market, innovation and experimentation are slowed • Operational overhead is undifferentiated work

Amazon CloudSearch • Pay for infrastructure you need when you need it • Low cost • No need to guess capacity • Experiment fast with low risk • We do the undifferentiated heavy lifting • Go global in minutes

Amazon CloudSearch Architecture AWS Query DNS / Load Balancing Search Domain Command Line Tools Doc Svc API Command Line Tools Config API Search API Console Console Console DOCUMENT SERVICE SEARCH SERVICE CONFIG SERVICE

Automatic Scaling DATA Document Quantity and Size SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition n Copy 1 Index Partition 2 Copy 1 Index Partition 1 Copy 1 TRAFFIC Search Request Volume and Complexity SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition 2 Copy 2 Index Partition 1 Copy 2 Index Partition n Copy 2 SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition n Copy n Index Partition 1 Copy n Index Partition 2 Copy n

Compute Storage Load Balancing Security SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition n Copy 1 Index Partition 2 Copy 1 Index Partition 1 Copy 1 SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition 2 Copy 2 Index Partition 1 Copy 2 Index Partition n Copy 2 SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition n Copy n Index Partition 1 Copy n Index Partition 2 Copy n

Text Search

Highly Relevant Results

Faceted Drilldown

Integer Range Searching

Complex Queries

Search Query Processing 564 726 Query 726 564 123 123 Matching Filtering Ranking Sorting

Reference Architecture

Create An Amazon CloudSearch Domain

Text fields for matching user terms Result enabled to retrieve source data

Literal fields for Faceting Facet enabled to retrieve facets Search enabled for narrowing

Integer fields for ranking, narrowing

Configure the Domain

Data Preparation and Upload SDF Batch Extract POST Amazon CloudSearch Search Documents

CloudSearch SDF [{"type":"add", "id": "b007oznzg0", "version": 1, "lang": "en", "fields": { "title":"KindlePaperwhite", "description":"World's most advanced e-reader", "category": ["Electronics","eBook Readers"], "price":11900 } }, ...]

Document Service API http(s)://< document service endpoint >/2011-02-01/documents/batch Accept: application/json Content-Length: 1176 Content-Type: application/json Host: doc.imdb-movies-rr2f34ofg56xneuemujamut52i.us-east-1.cloudsearch.amazonaws.com [{"type": "add","id":"b007oznzg0","version": 1,"lang": "en","fields": {"title":"KindlePaperwhite","description":"World's most advanced e-reader","category":["Electronics","eBook Readers"],"price":11900} }, { "type": "delete", "id": "tt0434409", "version": 1337648735 } ]

Search Service API http(s)://< search service endpoint>/2011-02-01/search? • Simple searches • q= text • Boolean combination of fields • bq= (or field:'value1' (and field:'value2' field:'value3')) • Faceting • facet= comma separated list of facet fields • Pagination • start=, size= • Customized ranking • rank= sort results based on the rank expression provided

Search Results {"rank": "-text_relevance", "match-expr": "(label 'kindle paperwhite')", "hits": { "found": 204, "start": 0, "hit": [ { "id": "sontsst12cf5f88b42" }, { "id": "sopvopr12ab017f082" }, { "id": "sorzrpw12ac468a13b" }, ] }, ... }

Customizing Ranking • Rank expressions • Compute a score for each document • &rank=<function> • E.g. recency based

Customizing Ranking With Queries • Define rank expressions in your query • &rank-recency=text_relevance + (1 / (2012 - year)) * 100 • &rank=-recency • Uses • A/B testing • User-customized searches • Geo-searching

IMDb Data Demo

Pricing • Get started for just $2.40/day; $75/month • AWS Calculator http://calculator.s3.amazonaws.com/calc5.html Free Trial

Wrap Up • Powerful search is a critical component of today's applications • Amazon CloudSearch makes adding search easy • Create a domain, POST documents, GET search results

Resources and Q&A • Amazon CloudSearch Overview Page http://aws.amazon.com/cloudsearch/ • FAQs • Community Forum • Documentation & Getting Started Tutorial (IMDb) • Contact our EU business development team • http://aws.amazon.com/contact-us

Thank You Jon Handler / handler@amazon.com

Searching Wikipedia with Amazon CloudSearch • Iain Fletcher • ifletcher@searchtechnologies.com

Search Engine Expertise • Microsoft SharePoint/FAST • Google Search Appliance • Solr • Amazon CloudSearch • LucidWorks • Attivio • Exalead • Autonomy • MarkLogic • elasticsearch • Vivisimo • Sinequa • Hadoop • Sphinx • …..

400+ Customers

Searching Wikipedia with Amazon CloudSearch • Iain Fletcher • ifletcher@searchtechnologies.com

Agenda • Project Background • High-level Architecture • Summary & Observations

Project Background • Amazon contracted with Search Technologies to help with beta-testing, prior to the launch of Amazon CloudSearch • Decision to use Wikipedia as a convenient data set for testing purposes

High-level Architecture

Indexing • Wikipedia provides content in a series of large xml files • Amazon CloudSearch ingests xml in a specified form • Various content processing tasks to perform • Splitting into individual documents • Date normalization • Metadata extraction & mapping • Cleanup, etc. • We used Aspire for these tasks

Aspire in Brief • Based on Apache Felix / OSGi • Thread-safe, multi-threaded, distributable • Any number of pipelines, conditional branching • Plug-in components individually testable & upgradable • In use with FAST ESP, FS4SP, Solr, Amazon CloudSearch, GSA. • Tested with Elasticsearch and SP 2013

XML Input

Indexing • Streaming Wikipedia Dump Files directly into CloudSearch • 500 docs/second achieved without much effort • Using 4 x XL instances of CloudSearch • 1 x XL EC2 instance for Aspire

Searching • Amazon CloudSearch provides a RESTful/XML interface for search purposes • For the Wikipedia project, we needed a UI • Chose to use Twigkit • Wrote a Java API for CloudSearch • The Java API is freely downloadable (with source) at http://www.searchtechnologies.com/java-api-amazon-cloudsearch.html

Searching • Supports navigators and relevancy customization • E.g. a “PageRank” style link analysis was performed • Limits set high: E.g. retrieve 500,000 results in a single list, delivered in just a few seconds • Hugely useful for analysis applications • So, what does it look like?

wikipedia.searchtechnologies.com

London Amazon CloudSearch Meetup Jon Handler

London Amazon CloudSearch Meetup Jon Handler

Presentation Transcript

IBD Meetup

IBD Meetup

IBD Meetup

Ruth handler

Amazon CloudSearch Meetup August 15, 2012

IBD Meetup

London Yii Meetup

Jon Murray Imperial College London

IBD Meetup

Daniel Handler

MEASUREMENT HANDLER

IBD Meetup

TPC ALARM HANDLER

Project Database Handler

Interrupt Handler

IBD Meetup

Dog Handler Security London

Tiktok meetup

Amazon Jobs in London | Jobsearchine.co.uk