560 likes | 798 Views
London Amazon CloudSearch Meetup Jon Handler . Agenda. CloudSearch technical overview (Jon Handler, Amazon CloudSearch Solution Architect) NakedWines and CloudSearch (Matt Reid, Developer at NakedWines ) Searching Wikipedia with Amazon CloudSearch (Iain Fletcher, Search Technologies)
E N D
Agenda • CloudSearch technical overview (Jon Handler, Amazon CloudSearch Solution Architect) • NakedWines and CloudSearch (Matt Reid, Developer at NakedWines) • Searching Wikipedia with Amazon CloudSearch (Iain Fletcher, Search Technologies) • Building UI with CloudSearch (Stefan Olafsson, Co-Founder, Twigkit)
What is Search Shoes
Do You Want Search With That? • Build your own – database, home-rolled, site search • Open source • Legacy enterprise search
Search Challenges • Complex, expertise required • Costly, often with up-front expenditure • Long time to market, innovation and experimentation are slowed • Operational overhead is undifferentiated work
Amazon CloudSearch • Pay for infrastructure you need when you need it • Low cost • No need to guess capacity • Experiment fast with low risk • We do the undifferentiated heavy lifting • Go global in minutes
Amazon CloudSearch Architecture AWS Query DNS / Load Balancing Search Domain Command Line Tools Doc Svc API Command Line Tools Config API Search API Console Console Console DOCUMENT SERVICE SEARCH SERVICE CONFIG SERVICE
Automatic Scaling DATA Document Quantity and Size SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition n Copy 1 Index Partition 2 Copy 1 Index Partition 1 Copy 1 TRAFFIC Search Request Volume and Complexity SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition 2 Copy 2 Index Partition 1 Copy 2 Index Partition n Copy 2 SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition n Copy n Index Partition 1 Copy n Index Partition 2 Copy n
Compute Storage Load Balancing Security SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition n Copy 1 Index Partition 2 Copy 1 Index Partition 1 Copy 1 SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition 2 Copy 2 Index Partition 1 Copy 2 Index Partition n Copy 2 SEARCH INSTANCE SEARCH INSTANCE SEARCH INSTANCE Index Partition n Copy n Index Partition 1 Copy n Index Partition 2 Copy n
Search Query Processing 564 726 Query 726 564 123 123 Matching Filtering Ranking Sorting
Text fields for matching user terms Result enabled to retrieve source data
Literal fields for Faceting Facet enabled to retrieve facets Search enabled for narrowing
Data Preparation and Upload SDF Batch Extract POST Amazon CloudSearch Search Documents
CloudSearch SDF [{"type":"add", "id": "b007oznzg0", "version": 1, "lang": "en", "fields": { "title":"KindlePaperwhite", "description":"World's most advanced e-reader", "category": ["Electronics","eBook Readers"], "price":11900 } }, ...]
Document Service API http(s)://< document service endpoint >/2011-02-01/documents/batch Accept: application/json Content-Length: 1176 Content-Type: application/json Host: doc.imdb-movies-rr2f34ofg56xneuemujamut52i.us-east-1.cloudsearch.amazonaws.com [{"type": "add","id":"b007oznzg0","version": 1,"lang": "en","fields": {"title":"KindlePaperwhite","description":"World's most advanced e-reader","category":["Electronics","eBook Readers"],"price":11900} }, { "type": "delete", "id": "tt0434409", "version": 1337648735 } ]
Search Service API http(s)://< search service endpoint>/2011-02-01/search? • Simple searches • q= text • Boolean combination of fields • bq= (or field:'value1' (and field:'value2' field:'value3')) • Faceting • facet= comma separated list of facet fields • Pagination • start=, size= • Customized ranking • rank= sort results based on the rank expression provided
Search Results {"rank": "-text_relevance", "match-expr": "(label 'kindle paperwhite')", "hits": { "found": 204, "start": 0, "hit": [ { "id": "sontsst12cf5f88b42" }, { "id": "sopvopr12ab017f082" }, { "id": "sorzrpw12ac468a13b" }, ] }, ... }
Customizing Ranking • Rank expressions • Compute a score for each document • &rank=<function> • E.g. recency based
Customizing Ranking With Queries • Define rank expressions in your query • &rank-recency=text_relevance + (1 / (2012 - year)) * 100 • &rank=-recency • Uses • A/B testing • User-customized searches • Geo-searching
Pricing • Get started for just $2.40/day; $75/month • AWS Calculator http://calculator.s3.amazonaws.com/calc5.html Free Trial
Wrap Up • Powerful search is a critical component of today's applications • Amazon CloudSearch makes adding search easy • Create a domain, POST documents, GET search results
Resources and Q&A • Amazon CloudSearch Overview Page http://aws.amazon.com/cloudsearch/ • FAQs • Community Forum • Documentation & Getting Started Tutorial (IMDb) • Contact our EU business development team • http://aws.amazon.com/contact-us
Thank You Jon Handler / handler@amazon.com
Searching Wikipedia with Amazon CloudSearch • Iain Fletcher • ifletcher@searchtechnologies.com
Search Engine Expertise • Microsoft SharePoint/FAST • Google Search Appliance • Solr • Amazon CloudSearch • LucidWorks • Attivio • Exalead • Autonomy • MarkLogic • elasticsearch • Vivisimo • Sinequa • Hadoop • Sphinx • …..
Searching Wikipedia with Amazon CloudSearch • Iain Fletcher • ifletcher@searchtechnologies.com
Agenda • Project Background • High-level Architecture • Summary & Observations
Project Background • Amazon contracted with Search Technologies to help with beta-testing, prior to the launch of Amazon CloudSearch • Decision to use Wikipedia as a convenient data set for testing purposes
Indexing • Wikipedia provides content in a series of large xml files • Amazon CloudSearch ingests xml in a specified form • Various content processing tasks to perform • Splitting into individual documents • Date normalization • Metadata extraction & mapping • Cleanup, etc. • We used Aspire for these tasks
Aspire in Brief • Based on Apache Felix / OSGi • Thread-safe, multi-threaded, distributable • Any number of pipelines, conditional branching • Plug-in components individually testable & upgradable • In use with FAST ESP, FS4SP, Solr, Amazon CloudSearch, GSA. • Tested with Elasticsearch and SP 2013
Indexing • Streaming Wikipedia Dump Files directly into CloudSearch • 500 docs/second achieved without much effort • Using 4 x XL instances of CloudSearch • 1 x XL EC2 instance for Aspire
Searching • Amazon CloudSearch provides a RESTful/XML interface for search purposes • For the Wikipedia project, we needed a UI • Chose to use Twigkit • Wrote a Java API for CloudSearch • The Java API is freely downloadable (with source) at http://www.searchtechnologies.com/java-api-amazon-cloudsearch.html
Searching • Supports navigators and relevancy customization • E.g. a “PageRank” style link analysis was performed • Limits set high: E.g. retrieve 500,000 results in a single list, delivered in just a few seconds • Hugely useful for analysis applications • So, what does it look like?