280 likes | 294 Views
James Taylor jtaylor@salesforce.com. Phoenix. We put the SQL back in NoSQL. Agenda. What is Phoenix? Why SQL? What is next? Q&A. Completed. What is Phoenix?. SQL layer on top of HBase Delivered as an embedded JDBC driver Targets low latency queries over HBase data
E N D
James Taylor jtaylor@salesforce.com Phoenix We put the SQL back in NoSQL
Agenda • What is Phoenix? • Why SQL? • What is next? • Q&A Completed
What is Phoenix? • SQL layer on top of HBase • Delivered as an embedded JDBC driver • Targets low latency queries over HBase data • Columns modeled as multi-part row key and key values • Versioned schema repository • Query engine transforms SQL into puts, delete, scans • Uses native HBase APIs instead of Map/Reduce • Brings the computation to the data: • Aggregate, insert, delete datathrough coprocessors • Push predicates through custom filters • 100% Java • Open source here: https://github.com/forcedotcom/phoenix Completed
Why SQL? • Broaden HBase adoption • Give folks an API they already know • Reduce the amount of code users need to write • SELECT TRUNC(date,'DAY’), AVG(cpu_usage) • FROM web_stat • WHERE domain LIKE 'Salesforce%’ • GROUP BY TRUNC(date,'DAY') • Performance optimizations transparent to the user • Aggregation • Stats gathering • Secondary indexing • Leverage existing tooling • SQL client Completed
But I can’t surface x,y,z in SQL… Completed
But I can’t surface x,y,z in SQL… Completed
But I can’t surface x,y,z in SQL… • Define multi-part row keys Completed
But I can’t surface x,y,z in SQL… • Define multi-part row keys • CREATE TABLE web_stat ( • domain VARCHAR NOT NULL, • feature VARCHAR NOT NULL, • date DATE NOT NULL, • usage BIGINT, • active_visitor INTEGER, • CONSTRAINT pk PRIMARY KEY (domain, feature, date) • ); Completed
But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang custom function Completed
But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang custom function • Derive class from ScalarFunction • Add annotation to define name, args, and types • Implement evaluate method • Register function • (blog on this coming soon: http://phoenix-hbase.blogspot.com/) Completed
But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries Completed
But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Set CURRENT_SCN property on connection to earlier timestamp • Queries will see only rows before timestamp • Schema in-place at that point in time will be used Completed
But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row Completed
But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Declare new new child entity as nested table • Prefix column qualifier of nested entities with: • table name + child primary key + child column name • Restrict join to be only through parent/child relation • Execute query by scanning nested child rows • TBD: https:/github.com/forcedotcom/phoenix/issues/19 Completed
But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes Completed
But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • “Salt” row key on upsert by mod-ing with cluster size • Query for fully qualified key by inserting salt byte • Range scan by concatenating results of scan over all possible salt bytes • Or alternately • Define column used for hash to derive row key prefix • TBD: https://github.com/forcedotcom/phoenix/issues/74 Completed
But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • Increment atomic counter
But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • Increment atomic counter • Surface the HBase put-and-increment functionality through the standard SQL sequence support • TBD: https://github.com/forcedotcom/phoenix/issues/18
But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • Increment atomic counter • Sample table data
But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • Increment atomic counter • Sample table data • Support the standard SQL TABLESAMPLE clause • Implement filter that uses a skip next hint • Base next key on the table stats “guide posts” • TBD: https://github.com/forcedotcom/phoenix/issues/22
But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • Increment atomic counter • Sample table data • Declare columns at query time
But I can’t surface x,y,z in SQL… • Define multi-part row keys • Implement my whizz-bang built-in function • Run snapshot in time queries • Nest child entities inside of a row • Prevent hot spotting on writes • Increment atomic counter • Sample table data • Declare columns at query time • SELECT col1,col2,col3 • FROM my_table(col2 VARCHAR, col3 INTEGER) • WHERE col3 > 10 • TBD: https://github.com/forcedotcom/phoenix/issues/9
Conclusion • Phoenix fits the 80/20 use case rule • Let us know what you’d like to see added • Get involved – we need your help! • Think about how your new feature can be surfaced in SQL
Thank you! Questions/comments?
Query Processing Product Metrics HTable ORG_ID DATE FEATURE Row Key TXNS Key Values IO_TIME RESPONSE_TIME • Scan • Start key: ORG_ID (:1) + DATE (:2) • End key: ORG_ID (:1) + DATE (:3) • Filter • Filter: IO_TIME > 100 • Aggregation • Intercepts scan on region server • Builds map of distinct FEATURE values • Returns one row per distinct group • Client does final merge SELECT feature, SUM(txns) FROM product_metrics WHERE org_id = :1 AND date >= :2 AND date <= :3 AND io_time > 100 GROUP BY feature
Phoenix Query Optimizations • Start/stop key of scan based on AND-ed columns • Through SUBSTR, ROUND, TRUNC, LIKE • Parallelized on client by chunking over start/stop key of scan • Aggregation on region-servers through coprocessor • Inline for GROUP BY over row key ordered columns • In memory map per group otherwise • WHERE clause executed through custom filters • Incremental evaluation with early termination • Evaluated through byte pointers • IN and OR over same column (in progress) • Becomes batched get or filter with next row hint • Top N queries (future) • Through coprocessor keeping top N rows • TABLESAMPLE (future) • Becomes filter with next row hint Completed
Phoenix Performance Completed