310 likes | 442 Views
Inside SQL Server Polybase. Bob Ward, Principal Architect, Microsoft. https://aka.ms/bobsql https://aka.ms/bobwardms https://aka.ms/bobsqldemos. Session learning objectives. At the end of this session, you should be better able to… Understand what is Polybase and how to use it.
E N D
Inside SQL Server Polybase Bob Ward, Principal Architect, Microsoft https://aka.ms/bobsql https://aka.ms/bobwardms https://aka.ms/bobsqldemos
Session learning objectives At the end of this session, you should be better able to… • Understand what is Polybase and how to use it. • Understand how Polybase works to know its capabilities for data virtualization • Learn how to use Polybase to build a data hub with SQL Server
“It’s all about Data Virtualization” What is SQL Server Polybase? Distributed compute engine integrated with SQL Server Query data where it lives using T-SQL Distributed, scalable query performance Manual/deploy with SQL Server Auto deploy/optimize with Big Data Clusters Analytics Apps T-SQL SQL Server PolyBase external tables NoSQL ODBC Relational databases Big data Excel Cosmos DB HDFS Intelligence over all data
The Journey to Data Virtualization in SQL Server 2019 Linux support is coming David Dewitt Rimma Nehme SQL Server 2019 is in Preview Some details subject to change
Using Polybase in SQL Server: T-SQL EXTERNAL TABLE Only for HDFS Login and password Not simple without BDC WWI SQL And join to any other table or external table metadata Results streamed HDFS INSERT only for HDFS Data lives here Cosmos DB
control and execution The SQL Server Polybase Architecture Data flow All editions EE only All editions “Compute” node “Head” node “Compute” node DW dbs DW dbs DW dbs SQL Engine SQL Engine SQL Engine Need more scale? Add compute nodes tempdb tempdb tempdb Polybase Engine Polybase Engine Polybase Engine ……... Polybase Data Movement Service Polybase Data Movement Service Polybase Data Movement Service Scan or pushdown Scale out with partitions shuffle mpdwsvc.exe Your data sources HDFS Cosmos DB
Polybase and Other Connectors \binn\Polybase\ODBC Drivers LOCATION “string” in EXTERNAL DATA SOURCE Built-in Connectors (use ODBC) sqlserver oracle teradata mongodb SQL Server, Azure SQL Database, Azure SQL Data Warehouse No client software install required Scale out with partitions MongoDB or CosmosDB (using MongoDB API) 64bit ODBC 3.0+ compliant ODBC Connector odbc You install the driver Ex. SAP HANA (HDBCODBC Driver)
External Tables vs Linked Servers * Insert into HDFS allowed
SQL Server 2019: Data Virtualization Modern StockItems Legacy Suppliers Mobile App Orders WWI SQL SQL Server 2019 WideWorldimporters Accounts Receivable Customers from Acquisition Order Reviews
Lessons Learned COLLATE required for character columns. Compat may be required ORACLE case sensitive for LOCATION = <instance>.<schema>.<table> LOCATION for SQL Server is <db>.<schema>.<table> EXTERNAL tables don’t support these types (they may be more) VARCHAR(MAX) GEOGRAPHY Computed Columns JSON MongoDB (CosmosDB) observations Be careful of types in your document LOCATION = <database>.<collection> Need to dive in EXTERNAL TABLE compatibility (Ex. Row Level Security)
SQL Server 2019 Big Data Clusters and Polybase The Problem Customers want to integrate with Big Data and other data sources easily Polybase is not simple to setup, configure, maintain, and scale elastically Customers may not have a Hadoop cluster or want to build one Polybase “classic” (MapReduce) could be better The Solution Kubernetes and containers to deploy and scale elastically Everything pre-installed including HDFS cluster Build a control plane to help manage and monitor Enhance SQL Server to read from HDFS natively Provide a data mart for cached results Introduce Spark and Notebooks for Data scientists
mpdwsvc.exe uses SQLPAL Polybase in SQL Server 2019 Big Data Clusters Control Plane • Controller Svc • Azure FSM Engine • Kibana • Grafana Customapps SQL BI Analytics • Elastic Search • InfluxDB • Configuration Store (SQL Server) Polybase head node In Linux container Cluster Polybase compute nodes In Linux containers SQL Server master instance Spark Spark Spark SQL Server SQL Server SQL Server Compute pool Compute pool Directly read from HDFS Compute pool • SQL Compute Node • SQL Compute Node • SQL Compute Node • SQL Compute Node • SQL Compute Node • … • HDFS Data Node • HDFS Data Node • HDFS Data Node “Built-in” Data Sources Storage pool Data mart External data sources • … MapReduce Not used • SQL Data Node • SQL Data Node • Storage • Storage Kubernetes pod IoT data Persistent storage Node Node Node Node Node Node Node
Storage and Data Pools Data Sources CREATE EXTERNAL DATA SOURCE SqlDataPool WITH (LOCATION = 'sqldatapool://service-mssql-controller:8080/datapools/default’); CREATE EXTERNAL TABLE… WITH ( DATA_SOURCE = SqlDataPool, DISTRIBUTION = ROUND_ROBIN ); CREATE EXTERNAL DATA SOURCE SqlStoragePool WITH (LOCATION = 'sqlhdfs://service-mssql-controller:8080’); CREATE EXTERNAL TABLE … WITH ( DATA_SOURCE = SqlStoragePool, LOCATION = '/clickstream_data', FILE_FORMAT = csv_file ); Preinstalled in model of Master Instance mssql-controller REST endpoint
Azure Data Studio and Data Virtualization Using PROSE for intelligent import and schema detection
Session takeaways • Polybase = Data Virtualization = Reduced Need for ETL • Polybase provides distributed read scale performance • Big Data Clusters automate the deployment of Polybase • Download and try it yourself • Sign up for EAP for SQL Server 2019 Big Data Clusters
Questions? aka.ms/SQLBits19
https://aka.ms/bobsql https://aka.ms/bobwardms https://aka.ms/bobsqldemos Session resources SQL Server 2019 Polybase documentation SQL Server Big Data Clusters documentation Polybase demos on GitHub Azure SQL Database Elastic Query documentation Loading data into Azure SQL Data Warehouse with Polybase