440 likes | 530 Views
The Thinking Person’s Guide to Data Warehouse Design. Robin Schumacher Director of Product Strategy www.enterprisedb.com. www.enterprisedb.com. Building a Logical Design Transitioning to the Physical Monitoring for Success. www.enterprisedb.com. Building a Logical Design.
E N D
The Thinking Person’s Guide to Data Warehouse Design Robin Schumacher Director of Product Strategy www.enterprisedb.com www.enterprisedb.com
Building a Logical Design • Transitioning to the Physical • Monitoring for Success www.enterprisedb.com
Building a Logical Design www.enterprisedb.com
Why Care About Design…? www.enterprisedb.com
What is the key component for success? * In other words, what you do with your PostgreSQL Server– in terms of physical design, schema design, and performance design – will be the biggest factor on whether a BI system hits the mark… * Philip Russom, “Next Generation Data Warehouse Platforms”, TDWI, 2009.
Your Database The #1 Cause of Database Downtime…? Bad Design… Source: Oracle Corporation www.enterprisedb.com
Simple reporting databases End Users Application Servers Read Shard One Reporting Database OLTP Database ETL Replication Just use the same design on a different box…
Data Warehouse Horror Story Number One… www.enterprisedb.com
Datatypes are more generally defined, not directed toward a database engine. Still choose carefully • Entities aren’t designed for performance necessarily • Redundancy is avoided, but simplicity is still a goal • Bottom line: you want to make sure your data is correctly represented and is easily understood (new class of user today)
Manual horizontal partitioning Modeling technique to overcome large data volumes
Pro’s/con’s to manual partitioning Pro’s • Less I/O if design holds up • Easy to prune obsolete data • Possibly less object contention Con’s • More tables to manage • More referential integrity to manage • More indexes to manage • Joins oftentimes needed to accomplish query requests • Oftentimes, a redesign is needed because the rows / columns you thought you’d be accessing together change; it’s hard to predict ad-hoc query traffic
Use a modeling tool • Don’t use 3rd normal form • Manual partition but… • Let the DB do the heavy lifting www.enterprisedb.com
Transitioning to a Physical Design www.enterprisedb.com
How to scale…? SQL or NoSQL…? Should I worry about High availability…? Index or no…? Is sharding a good idea…? How should I partition my data…? Row or Column database…?
What technologies you should be looking at * * Philip Russom, “Next Generation Data Warehouse Platforms”, TDWI, 2009.
Whether you choose to go NoSQL, Shard, use MPP databases, or something similar, divide & conquer is your best friend • You can scale-up and divide & conquer to a point, but you will hit disk, memory, or other limitations • Scaling up and out is the best future proof methodology
Row or Column-Based Database? A column-oriented architecture looks the same on the surface, but stores data differently than legacy/row-based databases…
Example: Column DB vs. “Leading” row DB InfiniDB takes up 22% less space InfiniDB loaded data 22% faster InfiniDB total query times were 65% less InfiniDB average query times were 59% less Notice not only are the queries faster, but also more predictable * Tests run on standalone machine: 16 CPU, 16GB RAM, CentOS 5.4 with 2TB of raw data
Hybrid Row / Column Databases Some vendors now give you a choice of table format – row or column – based on expected access patterns.
What about NoSQL options? • Standard model is not relational • Typically don’t use SQL to access the data • Take up more space than column databases • Lack special optimizers / features to reduce I/O
What about NoSQL options? • Really are row-oriented architectures that store data in ‘column families, which are expected to be accessed together (remember logical vertical partitioning?) Individual columns cannot be accessed independently • Will be faster with individual insert and delete operations • Will normally be faster with single row requests • Will lag in typical analytic / data warehouse use cases
PostgreSQL Specific - Partitioning • Main goal: reduce I/O via partitioning • Partitioning in PostgreSQL is somewhat more laborious than other RDBMS’s • Consider when table size exceeds memory capacity • Partitioning key is ‘key’ for many reasons • Have seen > 90% response time reductions when done right • Partitioning also assists in more efficient data pruning activities than typical DELETE operations
PostgreSQL Specific - GridSQL • One option for divide-and-conquer strategy • Does have limitations with respect to PostgreSQL feature and syntax support • One customer of EnterpriseDB is running well with 8TB
What About Indexing? • If query patterns are known and predictable, and data is relatively static, then indexing isn’t that difficult • If the situation is a very ad-hoc environment, indexing becomes more difficult. Must analyze SQL traffic and index the best you can • Over-indexing a table that is frequently loaded / refreshed / updated can severely impact load and DML performance. Test dropping and re-creating indexes vs. doing in-place loads and DML. Realize, though, any queries will be impacted from dropped indexes • Remember that a benefit of (most) column databases is that they do not need or use indexes
Optimizing for Data Loads • The two biggest killers of load performance are (1) very wide tables for row-based tables; (2) many indexes on a table / foreign keys; • Column-based tables typically load faster than row-based tables with load utilities, however they will experience slower insert/delete rates than row-based tables • Move the data as close to the database as possible; avoid having applications on remote machines do data manipulations and send data across the wire a row at a time – perhaps the worst way to load data • Oftentimes good to create staging tables then use procedural language to do data modifications and/or create flat files for high speed loaders • PostgreSQL COPY much faster than INSERT; EnterpriseDB’s EDB*Loader faster than COPY
Optimizing for Data Loads • Increasing maintenance_work_mem to a large values (e.g. > 1GB) helps speed index and foreign key constraint creations • Turning autovacuum off can help speed load operations • Turning off synchronous commit (synchronous_commit)can help improve load efficiency, but utilize generally only for all-or-nothing use cases • Minimizing checkpoint I/O is a good idea (checkpoint_segments to 100-200 and checkpoint_timeout to a higher value such as 1 hour or so).
Monitoring and Tuning the Design www.enterprisedb.com
Bottleneck Analysis Workload Analysis Ratio Analysis www.enterprisedb.com
Bottleneck Analysis • The focus of this methodology is the answer to the question “what am I waiting on?” • With general PostgreSQL, unfortunately, it can be difficult to determine latency in the database server • Lock contention rarely an issue in data warehouses • Can use EnterpriseDB’s wait interface • Problems found in bottleneck analysis translate into better lock handling in the app, partitioning improvements, better indexing, or storage engine replacement
Workload Analysis • The focus of this methodology is the answer to three questions: (1) Who’s logged on?; (2) What are they doing?; (3) How is my machine handing it? • Monitor active and inactive sessions. Keep in mind idle connections do take up resources • I/O and ‘hot objects’ a key area of analysis • Key focus should be on SQL statement monitoring and collection; something that goes beyond standard pre-production EXPLAIN analysis
The Pain of Slow SQL * Philip Russom, “Next Generation Data Warehouse Platforms”, TDWI, 2009.
Data Warehouse Horror Story Number Two… www.enterprisedb.com
Workload Analysis • SQL analysis basically becomes bottleneck analysis, because you’re asking where your SQL statement is spending its time • Once you have collected and identified your ‘top SQL’, the next step is to do tracing and interrogation into each SQL statement to understand its execution • Historical analysis is important too; a query that ran fine with 5 million rows may tank with 50 million or with more concurrent users • Design changes usually involve data file striping, indexing, partitioning, or parallel processing additions
Workload Analysis • The pgstatspack utility available to the community can provide some statistics for workload analysis • The pgfouine log analyzer tool can help you identify bad SQL • EnterpriseDB packages a utility that duplicates Oracle Automatic Workload Repository (AWR) reports, and shows hot objects, top wait events, and much more • A new SQL Profiler utility will be available from EnterpriseDB in the first half of 2011 that will help in tracing and analyzing SQL statement execution
Ratio Analysis • Least useful of all the performance analysis methods • May be OK to get a general rule of thumb as to how various resources are being used • Do not be misled by ratios; for example, a high cache hit ratio is sometimes meaningless. Databases can be brought to their knees by excessive logical I/O • Design changes from ratios typically include the altering of configuration parameters and sometimes indexing
Conclusions www.enterprisedb.com
Ratio Analysis • Design is the #1 contributor to the overall performance and availability of a system • With PostgreSQL, you have greater flexibility and opportunity than ever before to build well-designed data warehouses • With PostgreSQL, you now have more options and features available than ever before • The above translates into you being able to design data warehouses that can be future proofed: they can run as fast as you’d like (hopefully) and store as much data as you need (ditto)
The Thinking Person’s Guide to Data Warehouse Design Thanks…! www.enterprisedb.com