320 likes | 389 Views
3. Database Design (for IQ-M). Introduction. This section has been re-vamped for the 12.4.3 course I have removed all the design bits that are not absolutely for IQ This is to allow me to add more information directly related to IQ
E N D
3 Database Design (for IQ-M)
Introduction • This section has been re-vamped for the 12.4.3 course • I have removed all the design bits that are not absolutely for IQ • This is to allow me to add more information directly related to IQ • So this is not a Data WarehouseDesign or Methodology course, I have colleagues that do it much better then me!
Aggregation • This was a dirty word in IQ-M • Not anymore • It is not sensible to trawl through 10 billion rows of data, when a simple aggregation table can reduce the queried data to 10 thousand rows – or less • Even with IQ-M we can improve performance by as much as 100 times using aggregation correctly
A Statement Just because ASIQ-M is fast does not give us the right to generate slow databases or applications
Primary Key/Foreign Key • IQ-M does not enforce Primary Key/Foreign Key relationships – but it will do soon (12.5 Q2 2002) • The optimiser does use the PK/FK relationship for query planning • Only specify this relationship if the relationship does exist • Incorrect specification can result in query plan errors (performance degradation) and possibly errors • ASA does modify a join that is defined as PK/FK to an ANSI NATURAL join – this can cause problems with orphan rows
Key Specification • In a Data Warehouse the production key is not, generally, used as the warehouse key • It is more acceptable practice to use a generated key • Make this key an Unsigned INT or BIGINT • This is the absolutely most efficient key datatype in IQ-M
Primary Keys • In IQ-M a Primary Key is an ANSI standard Primary Key • It is UNIQUE • It must not be null • If specified as a table or column constraint then a specialised form of the HG index is created
Foreign Keys • Always generate an HG index on a Foreign Key • If the relationship is 1:1 then generate the Foreign Key column as a UNIQUE • This will force auto generation of a unique HG index • Again try to specify join columns as Unsigned INT or BIGINT
A digression on Datatypes • There are some very important issues concerning datatypes • We have discussed the actions of the indexes – there are areas where an index can be forced to run slowly if the datatype is specified wrongly • Always consider the requirements for the datatype • In correct datatype specification is as bad as incorrect index selection
Signed vs. Unsigned • If you don’t need signed data in an int or bigint – use UNSIGNED • This will speed up the accessing of the HNG index sometimes doubling the performance • HNG stores negative data as 1s complement • This means SUM() AVG() etc. run quickly • But range checks require another set of scans • If we stored as 2s complement then • Range checks would run with 1 scan • But SUM() AVG() would be slower!!
Other Datatype Issues • Signed vs. Unsigned does not affect the other indexes to any great degree • But… • The selection of datatypes does • We have already discussed keys but some other areas are worth commenting on…
Long Varchar() - 1 • A long varchar() is defined as a varchar() with a length greater than 255. • If you can avoid this please try to • Only FP and word index index is allowed • No enumerated indexes or HNG • We have seen a number of customers who use varchar(1024) as Primary Keys • please DO NOT DO THIS!!
Long Varchar() - 2 • Long varchar() are stored as 256 byte chunks, so using 4 bytes in a varchar(32000) only uses 256 bytes • By default these 256 byte chunks are memset (set to zeros to improve compression) • There is an upgrade option to memset existing 12.4.0 varchar() – this is worth doing, if you have the time!
Char() vs. Varchar() • Always, if you can use char() • Generally this will improve performance, at the modest cost of storing some small number of extra bytes • Query performance on retrieval of char() vs. Varchar() indicates that there can be a 2-3% performance hit per column, and we have seen 10% degradation on single columns
Memset - 1 Initialize_Memory_To_Ones_On_Allocation • Some slides ago I mentioned memset • This is a function of IQ that allows varchar() – and other objects – to have their unused portions set to one on initialization • This will improve the performance of the compression algorithms (usually) • It may possible slow down loads – but I have not found this to be the case
Memset - 2 • To memset existing varchar() when upgrading to 12.4.2 set the following option Convert_Varchar_To_1242 • To bring the HG indexes to 12.4.2 standard when upgrading a pre 12.4.2 database set the option Convert_HG_To_1242 • In the 12.4.3 upgrade, both these options happen by default
Float, Real and Double • Unless you really need them – please do not use • FLOAT • REAL • DOUBLE • They can only have Flat FP indexes – no others • The do not store “exact” values – only approximate • Please try to use • NUMERIC • DECIMAL
NUMERIC and DECIMAL • Numeric and Decimal are aliases of each other • Any numeric or decimal with a precision of less than 12 will be stored as an INT (with conversions) • Any numeric or decimal with a precision of between 12 and 20 will be stored as a BIGINT (with conversions)
Join Columns • You must generate the database schema with the table join columns having the same datatype. • INT, UINT and BIGINT are best, but the column datatypes for each join must be the same • Conversion cost is horrendous
Case and Collation Sequences • In terms of RAW performance the fastest IQ database is one where CASE is set to RESPECT and the collation sequence is BINARY (ISO_bineng) • This is probably not suitable for the general application of the database or warehouse server • CASE set to IGNORE is the next fastest, then changes in the collation sequence • The performance hits can be quite high (around 10-20% - we think!)
String Searches • String Searches such as substr(1,3,col_name) are really very slow, they rely on FP searches • With low cardinality (1 and 2 byte FP) data the search is faster, but this can still be a restriction • Create a new column which is the first 3 characters of the col_name column, then search on this • This way there is no function call, so no projection, so the optimiser can use a fast index LF or HG (or if it is a range query an HNG)
Telephone Numbers • A classic example of the above is the telephone number +1-301-896-1733 +1 -> Country Code 301 -> Area Code 896 -> Sub Area Code 1733 -> Local Number • Make this 4 columns (actually 5 - the whole number), then searches use fast indexes
Date time • As with telephone numbers, try storing a data time as as series of columns (or a dimension table) • Try creating columns DD MM YY HH MM SS DoWeek DoYear Quarter etc.
Date vs. Datetime • A slightly better solution to the above can be considered in the light of the 1 and 2 byte FP indexes • Try storing the date part of a datetime as a date and the time part as hh mm ss • So: Datetime -> date_col, hh_col, mm_col, ss_col
Loading Dates • There is NO default date or datetime format for loads into IQ • The format must be explicitly set for the load/insert to get the best performance • However some formats are conversion enhanced
DD/MM/YYYY DD.MM.YYYY DD-MM-YYYY HH:NN:SS HHNNSS HH:NN:SS.S HH:NN:SS.SS HH:NN:SS.SSS HH:NN:SS.SSSS HH:NN:SS.SSSSS YYYY-MM-DD HH:NN:SS YYYYMMDD HHNNSS YYYY-MM-DD HH:NN:SS YYYY-MM-DD HH:NN:SS.S YYYY-MM-DD HH:NN:SS.SS YYYY-MM-DD HH:NN:SS.SSS YYYY-MM-DD HH:NN:SS.SSSS YYYY-MM-DD HH:NN:SS.SSSSS Enhanced Conversion formats
Date Load • So it is better to use Col1 DATE(‘YYYY-MM-DD’) • than Col1 ASCII(10) • The performance enhancement can be as much as a 100 fold speed up in loads (for small tables)
UNION • In IQ-M 12.4.3 the UNION clause has very few disadvantages • Generally UNIONs are all processed in parallel • so if you have a low user count they work well • Also the delete question now can be solved • Do not use DISTINCT in the UNION clause, or in the SELECT statement
UNION and Delete • If you are storing a fixed (in time) amount of data e.g.. 6 months • Then every month you delete 1/6th of the data in the table • This is expensive • It is better to split the fact table into 6 x one month tables • At the end of the month you truncate the oldest table • And possibly rename the table sets • Remember for Multiplex table rename is DDL and hence can only be done in simplex mode!
What is wrong Here ? SELECT col1, col2 FROM table1, table2, table3 WHERE table1.col1 < table2.col1 AND table1.col1 > table3.col1
Cartesian Joins • These are expensive – they involve the join of every row in one table to every row in a second table. • Table A 1,000,000 rows • Table B 100,000 rows • Worktable 100,000,000,000 rows Select * from T, R where T.a = 10 Cartesian Select * from T, R where T.a between R.b and T.b Cartesian Select * from T, R where ABS(T.a * R.b) = T.b Cartesian • But Select * from T, R where ABS(T.a * T.b) = R.bNot Cartesian