1 / 16

Geoffrey Hendrey @ geoffhendrey

Geoffrey Hendrey @ geoffhendrey. Architecture for real-time ad-hoc query on distributed filesystems. Motivation. Big Data is more opaque than small data S preadsheets choke BI tools can’t scale Small samples often fail to replicate issues Engineers, data scientists, analysts need:

jayme
Download Presentation

Geoffrey Hendrey @ geoffhendrey

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Geoffrey Hendrey @geoffhendrey Architecture for real-time ad-hoc query on distributed filesystems

  2. Motivation • Big Data is more opaque than small data • Spreadsheets choke • BI tools can’t scale • Small samples often fail to replicate issues • Engineers, data scientists, analysts need: • Faster “time to answer” on Big Data • Rapid “find, quantify, extract” • Solve “I don’t know what I don’t know” • This is NOT about looking up items in a product catalog (i.e. not a consumer search problem)

  3. Scaling search with classic sharding

  4. Classic “side system” approach • Definition of KLUDGE: “a system and especially a computer system made up of poorly matched components” –Merriam-Webster Hadoop Search Cluster ?????

  5. Classic “search toolkit” • Built around fulltext use case • Inverted Indexes optimized for on-the-fly ranking of results • TF-IDF • Okapi BM-25 • Yet never able to fully realize google-style search capability • Issues: • Phrase detection • Pseudo synonymy • Open loop architecture

  6. Big data ad-hoc query • Not typically a fulltext “document search” problem • Data is structured, mixed structured, and denormalized • Log lines • Json records • CSV files • Hadoop native formats (SequenceFile) • Ranking is explicit (ORDER BY), not relevance based • Sometimes “needle in haystack” (support, debugging) • Sometimes “haystack in haystack” (summary analytics, segmentation)

  7. Dremel MPP query execution tree

  8. Finer points of Dremel architecture • MapReduce friendly • In-Situ approach is DFS friendly • Excels at aggregation. Not so much for needle-in-haystack. • Column storage format accelerates mapreduce (less extraneous data pushed through) • But in some regards still a “side system” • Applications must explicitly store their data in a columnar format • “massive” is both a benefit and a hazard • Complex (operationally and WRT query execution) • Queries can execute quickly…on huge clusters

  9. Crawled In-Situ Index Architecture Hadoop Data Crawl Application HDFS SimpleSearch MapReduce In-situ Index

  10. Benefits to crawled In-Situ index • No changes to application data format • CSV • JSON • SequenceFile • Clear “separation of concerns” between data and index • Indexes become “disposable”: easily built, easily thrown away • There is no “side system” that needs to be maintained • Use the mapreduce “hammer” to pound a nail

  11. Architect for Elasticity Crawl Application AWS S3 EC2 M1.large Elastic MapReduce JetS3t HTTP Index Interesting: you don’t actually need to have hadoop installed…

  12. Declarative Crawl Indexing Hadoop { "filter”:"column[4]==\"athens\"" } Data Crawl Application Parse.json HDFS SimpleSearch MapReduce In-situ Index • Indexer reads declarative instructions from in-situ file • “pull” vs. traditional “push” indexing approach

  13. Thin index Data Crawl MapReduce HDFS Data Index • Index size is small because data is a holistic part of the system • data does not need to be “put into” the search system and repicated in the index. In-situ Index

  14. Lazy data loading Data Crawl Execution Runtime HDFS Lazy Pull MapReduce Data LRU Index Cache Index Lazy Pull

  15. Column Oriented Approach

  16. Contact Info Email: geoff@vertascale.com Private Beta http://vertascale.com

More Related