Practical Query Governance & Data Security

Practical Query Governance & Data Security UPDATED! TCC 2013 akrinsky@tableausoftware.com

Agenda Terms Security filtering Managing big data Drill-down design patterns Limiter design patterns

Terms Data Governance Data Security Query Governance

“Data Governance” • “The discipline embodies a convergence of data quality, data management, data policies, business process management, and risk management surrounding the handling of data in an organization.” • Wikipedia

In-scope • Data Security • Authentication • Data encryption • Row-level security • Query Governance

Data Security • Data encryption • Use the database: Tableau will not decrypt • Tableau provides 2 methods for authenticating into server • AD Security • Local security • v8 introduced Datasource filters • Immutable • Table row-level filters on saved datasources • Security enforced through SQL where clauses: must be modeled

Datasource Filters Like common filters, but applied to the datasourceobject directly. Useful session security expressions can bind rules to user identities. Saved with the datasource Can be published with the datasource.

Session security expressions • USERNAME() • Username of logged in user, otherwise Windows username. Authors may impersonate other server users. • ISMEMBEROF(<string literal>) • FULLNAME() • USERDOMAIN() • Tableau server domain or windows domain • ISFULLNAME(< string literal>) • ISUSERNAME(< string literal>)

Use data server to “bake” security into datasources

Data Server • Control for IT: • Users cannot override calculations • Users cannot edit joins or connection information • Users cannot write SQL • Users cannot alter or republish data sources • Authentication can be fixed or prompted • Flexibility for users • Can access remotely over HTTP • Can write new calculations • Can blend desktop data • No need to download data/extracts • No drivers to install • Can leverage power of Tableau Server

Group/Role vs. User Filtering Superstore Tables 1:0 or many Restrict to regions user manages: [Users].[Manager]=Username() Tableau wants to see a STRING literal here (not a dynamic variable) THIS WILL NOT WORK: IsMemberOf ([Users].[Region])

Dynamic group filtering does NOT work Does not work! Works!

Adding users to groups on Tableau server

User IsMemberOf() for role based permissions • ISMEMBEROF(“<ROLE>”) • Use group membership for capabilities/ROLE based security: • ISMEMBEROF(“HR”) • ISMEMBEROF(“FINANCE”) • ISMEMBEROF(“ADMIN”)

User username() for dynamic security <NAMEFIELD>=USERNAME() for user or group-level dynamic security

User security design pattern Inner joins Fact 1:many 1:many Table 1 Table N USERNAME USERNAME USERNAME

Blend possible, but not recommended • Filtering on secondary dimensions supported in v8. • But… • Defined for view: can be easily defeated or misapplied by author. • Can explode memory footprint if security table is too large. Fact blend Security Table (in another datasource) USERNAME USERNAME

View-level filter test

Database enforced join

Make immutable with Datasource filters

Works same as local filter but tamper-proof

Impersonation different for published datasource Note that when publishing the datasource, Desktop user impersonation will stop working. To test different users, edit datasource connection.

Big Data: what can possibly go wrong? The data Innocent Tableau user! Data source just dragand….

When too much data becomes a problem • Reports take too long to render • Report interactions are too sluggish • Tableau (Server often) is unstable • Memory issues may be latent. • You may not notice a problem until reports stop processing (spinners on one or more panes of a dashboard) • Long-running queries are outraging the DBA

What is a lot of marks? Tableau renders images fast. 1 million marks is fast, fluid, natural. 5 million? 1 trillion? Can you make sense of 1 million marks???

What is a lot of data? • Report complexity, query working set size, and cache settings, affect memory footprint. • Render time is not always a good indicator of memory utilization (ie. Table calcs are fast). • v8.0 VizQL processes are 32-bit process (2GB or 4GB per process). • 8.1 with 64-bit (16 TB) practically erases limits. • As a rule of thumb: • 100 mb is a lot of data for a view to process. • If caching is enabled, 10 mb per view adds up quickly.

Sizing query results Function of aggregation, data types and factorial of dimensions and rows. Database profiling tools can often tell you exactly how much data is being transferred/requested. Check size/cardinality of dimensions using “describe field” -> “load”

Specifics: MSQL “Client Statistics” Example: 1 million row fact table query = 80 mb

Cardinality of dimensions (high = lots)

How to blow up a VizQL process? • You’re rarely going to see it on million row datasets but as results get larger, it’s possible… • Drag-drop high-cardinality dimension in DW • ie. SKU, names or IDs • (Inadvertently) tell Tableau to NOT aggregate huge queries. • Blending per se, will not constrain the size of the secondary query. • Table calculations are applied AFTER data is returned. • A “top N” filter on “index()” requires all rows to be retrieved FIRST.

Caution: your report may run fine in Desktop! Yes, memory-hogging reports may “just work” on your laptop. Desktop = dedicated server, per user. Desktop processing is easier: no HTML to render. Server does not always release memory (especially if “refresh less often” is enabled)

Use common sense Can you make sense of 1 million points? How many rows does your report require to render 1,000 marks? If queries require 10x as many rows, or more, you may have a report design issue. If your data set has 1 million rows or less, it probably doesn’t matter. If it’s more, consider defensive maneuvers.

Tableau Warnings: Informed Consent

Take precautions • Defensive data source definition • Be mindful of high-cardinality dimensions • Avoid index() filters (post query) • Avoid blends (post query) • Or keep secondary data sources small • By default, show no/few records • Use data source filters/parameters • Summary tables • Consider extracts • Use summary-detail design pattern • In-database query governance

Summary Tables • Summarize data you will query often using aggregate functions (SUM, AVG, MIN, MAX) and GROUP BY on dimensions. • Create summary tables in the database (materialized views make this easy) • Extracts are a great option (up to 1 billion rows) • Remember to use the “aggregate” option on creation. • You are generally “safe” with extracts (memory-backed file architecture)

Summary Extracts

Summary-Detail Design Pattern

Summary-Detail filter action Make sure to “Exclude all values” when not selected

When page loads, no detail shown.

Database Top N • Syntax varies by database • MySQL, Postres, Vertica, Pivotal, Netezza syntax • SELECT column_name(s)FROM table_nameLIMIT number • Oracle syntax • SELECT column_name(s)FROM table_nameWHERE ROWNUM <= number; • SQL Server / MS Access / Excel syntax • SELECT TOP numbercolumn_name(s)FROM table_name;

Excel example SELECT TOP 1000

Top N in Tableau • Simple T/F on # number of rows • Does not work! • Top N on a dimension • Works! • But not immutable • Sets & T/F calculated field • Works! • Can bind to data source filter (immutable) • Using index()<top N • Works! • But does not filter.

But remember… order of evaluation # = order of evaluation. Are we filtering on a pre-limited subset??? 5. Where you would want a LIMIT function (if other filters) Where you can add one easily 4. Filter Shelf (no context) Top N on dim 3. Context Filters 2. Datasource Filters Top N with sets 1. Custom SQL or DB view Top N or LIMIT in SQL

What about 100% Custom SQL filters But to get accurate results, you can’t use these filters AFTER limit has been applied. Ok if logical. Less Ok if arbitrary. 4. Filter Shelf (no context) 3. Context Filters 2. Datasource Filters 1. Custom SQL or DB view

So… Top N as a governor: pick 2

A note about context filters… • By default, context filters write results to temp tables • With big data… writing temp tables can be slow (depending on database) • On ODBC and some databases, can you disable temp file generation through datasource XML. <customization name='CAP_CREATE_TEMP_TABLES' value='no' /> <customization name='CAP_SELECT_INTO' value='no' />

So… We can use Top N for governance, but then we won’t necessarily get the right results. But if we choose methods that get the right results, either defeatable by the author (not bound to the datasource/immutable), or actually do not actually limit rows (table calcs) Can you constrain your query non-arbitrarily?

Key insights so far… • Be thoughtful in exposing and authoring with large dimensions without summarization • Consider non-arbitrary (ie. Top N) filters/parameters to constrain dimensions to a sensible domain • Datasource parameters (new in 8.0) • Custom SQL, Views • Stored procedures (new in 8.1) • SAP BW variables • Leave run-away query governance to the database itself – if your database supports it. • Then, detect when the database is truncating results and indicate if necessary.

In-database query governance # = order of evaluation 5. In-database query governor for Tableau user Truncates but does not introduce inaccuracies 4. Filter Shelf (no context) 3. Context Filters 2. Datasource Filters 1. Custom SQL or DB view

In-database query governors • Elephant in the room: • Tableau does not have a mid-tier query governor to wrapper the final SQL w/limit • And… most but not all query governors are resource not row-constrained • Bullet-proof, but functionality does not exist for all databases – I think. • Set initial SQL coming to more databases in 8.2 (supported in Teradata today) • Work with your DBA to setup resource limits • I welcome your feedback: akrinsky@tableausoftware.com

Practical Query Governance & Data Security