560 likes | 832 Views
Practical Query Governance & Data Security. UPDATED!. TCC 2013 . akrinsky@tableausoftware.com. Agenda. Terms Security filtering Managing big data Drill-down design patterns Limiter design patterns. Terms. Data Governance Data Security Query Governance. “Data Governance”.
E N D
Practical Query Governance & Data Security UPDATED! TCC 2013 akrinsky@tableausoftware.com
Agenda Terms Security filtering Managing big data Drill-down design patterns Limiter design patterns
Terms Data Governance Data Security Query Governance
“Data Governance” • “The discipline embodies a convergence of data quality, data management, data policies, business process management, and risk management surrounding the handling of data in an organization.” • Wikipedia
In-scope • Data Security • Authentication • Data encryption • Row-level security • Query Governance
Data Security • Data encryption • Use the database: Tableau will not decrypt • Tableau provides 2 methods for authenticating into server • AD Security • Local security • v8 introduced Datasource filters • Immutable • Table row-level filters on saved datasources • Security enforced through SQL where clauses: must be modeled
Datasource Filters Like common filters, but applied to the datasourceobject directly. Useful session security expressions can bind rules to user identities. Saved with the datasource Can be published with the datasource.
Session security expressions • USERNAME() • Username of logged in user, otherwise Windows username. Authors may impersonate other server users. • ISMEMBEROF(<string literal>) • FULLNAME() • USERDOMAIN() • Tableau server domain or windows domain • ISFULLNAME(< string literal>) • ISUSERNAME(< string literal>)
Data Server • Control for IT: • Users cannot override calculations • Users cannot edit joins or connection information • Users cannot write SQL • Users cannot alter or republish data sources • Authentication can be fixed or prompted • Flexibility for users • Can access remotely over HTTP • Can write new calculations • Can blend desktop data • No need to download data/extracts • No drivers to install • Can leverage power of Tableau Server
Group/Role vs. User Filtering Superstore Tables 1:0 or many Restrict to regions user manages: [Users].[Manager]=Username() Tableau wants to see a STRING literal here (not a dynamic variable) THIS WILL NOT WORK: IsMemberOf ([Users].[Region])
Dynamic group filtering does NOT work Does not work! Works!
User IsMemberOf() for role based permissions • ISMEMBEROF(“<ROLE>”) • Use group membership for capabilities/ROLE based security: • ISMEMBEROF(“HR”) • ISMEMBEROF(“FINANCE”) • ISMEMBEROF(“ADMIN”)
User username() for dynamic security <NAMEFIELD>=USERNAME() for user or group-level dynamic security
User security design pattern Inner joins Fact 1:many 1:many Table 1 Table N USERNAME USERNAME USERNAME
Blend possible, but not recommended • Filtering on secondary dimensions supported in v8. • But… • Defined for view: can be easily defeated or misapplied by author. • Can explode memory footprint if security table is too large. Fact blend Security Table (in another datasource) USERNAME USERNAME
Impersonation different for published datasource Note that when publishing the datasource, Desktop user impersonation will stop working. To test different users, edit datasource connection.
Big Data: what can possibly go wrong? The data Innocent Tableau user! Data source just dragand….
When too much data becomes a problem • Reports take too long to render • Report interactions are too sluggish • Tableau (Server often) is unstable • Memory issues may be latent. • You may not notice a problem until reports stop processing (spinners on one or more panes of a dashboard) • Long-running queries are outraging the DBA
What is a lot of marks? Tableau renders images fast. 1 million marks is fast, fluid, natural. 5 million? 1 trillion? Can you make sense of 1 million marks???
What is a lot of data? • Report complexity, query working set size, and cache settings, affect memory footprint. • Render time is not always a good indicator of memory utilization (ie. Table calcs are fast). • v8.0 VizQL processes are 32-bit process (2GB or 4GB per process). • 8.1 with 64-bit (16 TB) practically erases limits. • As a rule of thumb: • 100 mb is a lot of data for a view to process. • If caching is enabled, 10 mb per view adds up quickly.
Sizing query results Function of aggregation, data types and factorial of dimensions and rows. Database profiling tools can often tell you exactly how much data is being transferred/requested. Check size/cardinality of dimensions using “describe field” -> “load”
Specifics: MSQL “Client Statistics” Example: 1 million row fact table query = 80 mb
How to blow up a VizQL process? • You’re rarely going to see it on million row datasets but as results get larger, it’s possible… • Drag-drop high-cardinality dimension in DW • ie. SKU, names or IDs • (Inadvertently) tell Tableau to NOT aggregate huge queries. • Blending per se, will not constrain the size of the secondary query. • Table calculations are applied AFTER data is returned. • A “top N” filter on “index()” requires all rows to be retrieved FIRST.
Caution: your report may run fine in Desktop! Yes, memory-hogging reports may “just work” on your laptop. Desktop = dedicated server, per user. Desktop processing is easier: no HTML to render. Server does not always release memory (especially if “refresh less often” is enabled)
Use common sense Can you make sense of 1 million points? How many rows does your report require to render 1,000 marks? If queries require 10x as many rows, or more, you may have a report design issue. If your data set has 1 million rows or less, it probably doesn’t matter. If it’s more, consider defensive maneuvers.
Take precautions • Defensive data source definition • Be mindful of high-cardinality dimensions • Avoid index() filters (post query) • Avoid blends (post query) • Or keep secondary data sources small • By default, show no/few records • Use data source filters/parameters • Summary tables • Consider extracts • Use summary-detail design pattern • In-database query governance
Summary Tables • Summarize data you will query often using aggregate functions (SUM, AVG, MIN, MAX) and GROUP BY on dimensions. • Create summary tables in the database (materialized views make this easy) • Extracts are a great option (up to 1 billion rows) • Remember to use the “aggregate” option on creation. • You are generally “safe” with extracts (memory-backed file architecture)
Summary-Detail filter action Make sure to “Exclude all values” when not selected
Database Top N • Syntax varies by database • MySQL, Postres, Vertica, Pivotal, Netezza syntax • SELECT column_name(s)FROM table_nameLIMIT number • Oracle syntax • SELECT column_name(s)FROM table_nameWHERE ROWNUM <= number; • SQL Server / MS Access / Excel syntax • SELECT TOP numbercolumn_name(s)FROM table_name;
Excel example SELECT TOP 1000
Top N in Tableau • Simple T/F on # number of rows • Does not work! • Top N on a dimension • Works! • But not immutable • Sets & T/F calculated field • Works! • Can bind to data source filter (immutable) • Using index()<top N • Works! • But does not filter.
But remember… order of evaluation # = order of evaluation. Are we filtering on a pre-limited subset??? 5. Where you would want a LIMIT function (if other filters) Where you can add one easily 4. Filter Shelf (no context) Top N on dim 3. Context Filters 2. Datasource Filters Top N with sets 1. Custom SQL or DB view Top N or LIMIT in SQL
What about 100% Custom SQL filters But to get accurate results, you can’t use these filters AFTER limit has been applied. Ok if logical. Less Ok if arbitrary. 4. Filter Shelf (no context) 3. Context Filters 2. Datasource Filters 1. Custom SQL or DB view
A note about context filters… • By default, context filters write results to temp tables • With big data… writing temp tables can be slow (depending on database) • On ODBC and some databases, can you disable temp file generation through datasource XML. <customization name='CAP_CREATE_TEMP_TABLES' value='no' /> <customization name='CAP_SELECT_INTO' value='no' />
So… We can use Top N for governance, but then we won’t necessarily get the right results. But if we choose methods that get the right results, either defeatable by the author (not bound to the datasource/immutable), or actually do not actually limit rows (table calcs) Can you constrain your query non-arbitrarily?
Key insights so far… • Be thoughtful in exposing and authoring with large dimensions without summarization • Consider non-arbitrary (ie. Top N) filters/parameters to constrain dimensions to a sensible domain • Datasource parameters (new in 8.0) • Custom SQL, Views • Stored procedures (new in 8.1) • SAP BW variables • Leave run-away query governance to the database itself – if your database supports it. • Then, detect when the database is truncating results and indicate if necessary.
In-database query governance # = order of evaluation 5. In-database query governor for Tableau user Truncates but does not introduce inaccuracies 4. Filter Shelf (no context) 3. Context Filters 2. Datasource Filters 1. Custom SQL or DB view
In-database query governors • Elephant in the room: • Tableau does not have a mid-tier query governor to wrapper the final SQL w/limit • And… most but not all query governors are resource not row-constrained • Bullet-proof, but functionality does not exist for all databases – I think. • Set initial SQL coming to more databases in 8.2 (supported in Teradata today) • Work with your DBA to setup resource limits • I welcome your feedback: akrinsky@tableausoftware.com