160 likes | 261 Views
Geoffrey Hendrey @ geoffhendrey. Architecture for real-time ad-hoc query on distributed filesystems. Motivation. Big Data is more opaque than small data S preadsheets choke BI tools can’t scale Small samples often fail to replicate issues Engineers, data scientists, analysts need:
E N D
Geoffrey Hendrey @geoffhendrey Architecture for real-time ad-hoc query on distributed filesystems
Motivation • Big Data is more opaque than small data • Spreadsheets choke • BI tools can’t scale • Small samples often fail to replicate issues • Engineers, data scientists, analysts need: • Faster “time to answer” on Big Data • Rapid “find, quantify, extract” • Solve “I don’t know what I don’t know” • This is NOT about looking up items in a product catalog (i.e. not a consumer search problem)
Classic “side system” approach • Definition of KLUDGE: “a system and especially a computer system made up of poorly matched components” –Merriam-Webster Hadoop Search Cluster ?????
Classic “search toolkit” • Built around fulltext use case • Inverted Indexes optimized for on-the-fly ranking of results • TF-IDF • Okapi BM-25 • Yet never able to fully realize google-style search capability • Issues: • Phrase detection • Pseudo synonymy • Open loop architecture
Big data ad-hoc query • Not typically a fulltext “document search” problem • Data is structured, mixed structured, and denormalized • Log lines • Json records • CSV files • Hadoop native formats (SequenceFile) • Ranking is explicit (ORDER BY), not relevance based • Sometimes “needle in haystack” (support, debugging) • Sometimes “haystack in haystack” (summary analytics, segmentation)
Finer points of Dremel architecture • MapReduce friendly • In-Situ approach is DFS friendly • Excels at aggregation. Not so much for needle-in-haystack. • Column storage format accelerates mapreduce (less extraneous data pushed through) • But in some regards still a “side system” • Applications must explicitly store their data in a columnar format • “massive” is both a benefit and a hazard • Complex (operationally and WRT query execution) • Queries can execute quickly…on huge clusters
Crawled In-Situ Index Architecture Hadoop Data Crawl Application HDFS SimpleSearch MapReduce In-situ Index
Benefits to crawled In-Situ index • No changes to application data format • CSV • JSON • SequenceFile • Clear “separation of concerns” between data and index • Indexes become “disposable”: easily built, easily thrown away • There is no “side system” that needs to be maintained • Use the mapreduce “hammer” to pound a nail
Architect for Elasticity Crawl Application AWS S3 EC2 M1.large Elastic MapReduce JetS3t HTTP Index Interesting: you don’t actually need to have hadoop installed…
Declarative Crawl Indexing Hadoop { "filter”:"column[4]==\"athens\"" } Data Crawl Application Parse.json HDFS SimpleSearch MapReduce In-situ Index • Indexer reads declarative instructions from in-situ file • “pull” vs. traditional “push” indexing approach
Thin index Data Crawl MapReduce HDFS Data Index • Index size is small because data is a holistic part of the system • data does not need to be “put into” the search system and repicated in the index. In-situ Index
Lazy data loading Data Crawl Execution Runtime HDFS Lazy Pull MapReduce Data LRU Index Cache Index Lazy Pull
Contact Info Email: geoff@vertascale.com Private Beta http://vertascale.com