James Taylor jtaylor@salesforce

James Taylor jtaylor@salesforce.com Phoenix We put the SQL back in NoSQL

Agenda • What is Phoenix? • Why SQL? • What is next? • Q&A Completed

What is Phoenix? • SQL layer on top of HBase • Delivered as an embedded JDBC driver • Targets low latency queries over HBase data • Columns modeled as multi-part row key and key values • Versioned schema repository • Query engine transforms SQL into puts, delete, scans • Uses native HBase APIs instead of Map/Reduce • Brings the computation to the data: • Aggregate, insert, delete datathrough coprocessors • Push predicates through custom filters • 100% Java • Open source here: https://github.com/forcedotcom/phoenix Completed

Why SQL? • Broaden HBase adoption • Give folks an API they already know • Reduce the amount of code users need to write • SELECT TRUNC(date,'DAY’), AVG(cpu_usage) • FROM web_stat • WHERE domain LIKE 'Salesforce%’ • GROUP BY TRUNC(date,'DAY') • Performance optimizations transparent to the user • Aggregation • Stats gathering • Secondary indexing • Leverage existing tooling • SQL client Completed

But I can’t surface x,y,z in SQL… Completed

But I can’t surface x,y,z in SQL… • Define multi-part row keys Completed

But I can’t surface x,y,z in SQL… • Define multi-part row keys • CREATE TABLE web_stat ( • domain VARCHAR NOT NULL, • feature VARCHAR NOT NULL, • date DATE NOT NULL, • usage BIGINT, • active_visitor INTEGER, • CONSTRAINT pk PRIMARY KEY (domain, feature, date) • ); Completed

But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang custom function Completed

But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang custom function • Derive class from ScalarFunction • Add annotation to define name, args, and types • Implement evaluate method • Register function • (blog on this coming soon: http://phoenix-hbase.blogspot.com/) Completed

But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries Completed

But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Set CURRENT_SCN property on connection to earlier timestamp • Queries will see only rows before timestamp • Schema in-place at that point in time will be used Completed

But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row Completed

But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Declare new new child entity as nested table • Prefix column qualifier of nested entities with: • table name + child primary key + child column name • Restrict join to be only through parent/child relation • Execute query by scanning nested child rows • TBD: https:/github.com/forcedotcom/phoenix/issues/19 Completed

But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes Completed

But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • “Salt” row key on upsert by mod-ing with cluster size • Query for fully qualified key by inserting salt byte • Range scan by concatenating results of scan over all possible salt bytes • Or alternately • Define column used for hash to derive row key prefix • TBD: https://github.com/forcedotcom/phoenix/issues/74 Completed

But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • Increment atomic counter

But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • Increment atomic counter • Surface the HBase put-and-increment functionality through the standard SQL sequence support • TBD: https://github.com/forcedotcom/phoenix/issues/18

But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • Increment atomic counter • Sample table data

But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • Increment atomic counter • Sample table data • Support the standard SQL TABLESAMPLE clause • Implement filter that uses a skip next hint • Base next key on the table stats “guide posts” • TBD: https://github.com/forcedotcom/phoenix/issues/22

But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • Increment atomic counter • Sample table data • Declare columns at query time

But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • Increment atomic counter • Sample table data • Declare columns at query time • SELECT col1,col2,col3 • FROM my_table(col2 VARCHAR, col3 INTEGER) • WHERE col3 > 10 • TBD: https://github.com/forcedotcom/phoenix/issues/9

Conclusion • Phoenix fits the 80/20 use case rule • Let us know what you’d like to see added • Get involved – we need your help! • Think about how your new feature can be surfaced in SQL

Thank you! Questions/comments?

Query Processing Product Metrics HTable ORG_ID DATE FEATURE Row Key TXNS Key Values IO_TIME RESPONSE_TIME • Scan • Start key: ORG_ID (:1) + DATE (:2) • End key: ORG_ID (:1) + DATE (:3) • Filter • Filter: IO_TIME > 100 • Aggregation • Intercepts scan on region server • Builds map of distinct FEATURE values • Returns one row per distinct group • Client does final merge SELECT feature, SUM(txns) FROM product_metrics WHERE org_id = :1 AND date >= :2 AND date <= :3 AND io_time > 100 GROUP BY feature

Phoenix Query Optimizations • Start/stop key of scan based on AND-ed columns • Through SUBSTR, ROUND, TRUNC, LIKE • Parallelized on client by chunking over start/stop key of scan • Aggregation on region-servers through coprocessor • Inline for GROUP BY over row key ordered columns • In memory map per group otherwise • WHERE clause executed through custom filters • Incremental evaluation with early termination • Evaluated through byte pointers • IN and OR over same column (in progress) • Becomes batched get or filter with next row hint • Top N queries (future) • Through coprocessor keeping top N rows • TABLESAMPLE (future) • Becomes filter with next row hint Completed

Phoenix Performance

Phoenix Performance Completed

James Taylor jtaylor@salesforce

James Taylor jtaylor@salesforce

Presentation Transcript

Taylor

James Taylor jtaylor@salesforce.com

James Taylor @ JamesPlusPlus http://phoenix- hbase.blogspot.com /

James Taylor

Salesforce

James Kast Andrew Kuharske Timothy Lenberg Daniel Taylor

Taylor

Taylor

By Harry James, James Blair, Alix Taylor, Rebecca Willder

By: Taylor Mara James Juel Max Smith

Rachel Cavin , Manager James Dodds , Assistant Manager Taylor Dorn, GIS Analyst

Salesforce

Rachel Cavin , Manager James Dodds , Assistant Manager Taylor Dorn, GIS Analyst

James Taylor Head of Policy

Salesforce

Taylor

Salesforce

Salesforce

salesforce

Salesforce

SALESFORCE

Salesforce