800 likes | 813 Views
Explore the world of data warehousing, a specialized database for storing and analyzing large volumes of historical data to support decision-making processes in organizations. Learn about dimensions, variables, hypercubes, and data mining techniques. Discover the importance of understanding user needs and decision-making processes when developing a data warehouse.
E N D
Week 8 A Few Thoughts on Data Warehousing
Some Thoughts on Data Warehousing A Data Warehouse can be thought of a a special type of database - generally a very large database (source of data) It holds more data, which is processed by Analytical and Statistical software The data, or much of it, will probably be historical (also known as Legacy data) And some of the data probably won’t be of the same quality and content - or even consistent
Some Thoughts on Data Warehousing A more technical definition would be ‘A data warehouse is a subject-orientated, integrated, time-variant and nonvolatile set of data which supports decision making processes’. Subject databases are designed around the essential entities of a business (for example customers) rather than applications (for example Sales, Insurance..) The reference to time-variant means that data sets are organised by time periods (months, quarters..)
Some Thoughts on Data Warehousing A data warehouse is a snapshot of an organisation at a given or particular time. This snapshot is created by the extraction of data from existing systems. This extracted data must be transformed, cleaned and loaded into the Data Warehouse AND regular snapshots must be made to maintain the relevance and usefulness of the data warehoused.
Some Thoughts on Data Warehousing To process, develop and present results from what is normally a large amount of data (many gigabytes or terabtyes is realistic), there is the need to use ‘data management aids’ These aids consist of Extraction tools Analysis tools (such as OLAP and OLTP, which we’ll meet in more detail later) Application tools Some of the workhorses are DSS, EIS, OLAP, SQL, Data Mining
Some Thoughts on Data Warehousing There are some interesting terms in Data Warehousing Dimensions : A spreadsheet is a 2 dimensional display of data - so are most relational databases. However Data Warehousing tools can extend this association to a ‘multi dimension’ or a ‘variable dimension’ Do you remember the nominal dimension the ordinal dimension and the continuous dimension in statistics ?
Some Thoughts on Data Warehousing Just in case you don’t -- A nominal variable is an unordered category such as a region, suburb, - - - An ordinal variable is an ordered category such as an age group A continuous variable (normally) has a numeric value such as income, passenger-kilometres
Some Thoughts on Data Warehousing There is a term ‘hypercube’ which is a smart way of displaying (visualisation - which will appear soon) muiti dimensional data - or more correctly data in a multi-dimension form A hypercube is a combination of several types of dimensions An example: Identifier dimensions could be product and store (which are both nominal) and the variable dimensions could be sales and customers. You should know the terms ‘dependent’ and ‘independent’ variables
Some Thoughts on Data Warehousing Very powerful forms of analysis when both dimensions are continuous And just to finish off this brief introduction to terms : Data mining is the search for relationships and global patterns which exist in large databases, but are obscured or hidden in what are normally large (or enormous) amounts of data. Data mining is also associated with advanced machine learning technologies which are used to discover pockets of knowledge hidden in the mass of data
Some Thoughts on Data Warehousing And just to finish off this introduction, and give you some feel of what ‘large amounts of data’ are:- One company in the United States of America - a catalog publisher, reputedly has collected data on 30 million households The company uses Quadstone (data mining), SAS, SPSS, KnowledgeSeeker. It runs about 100 marketing campaigns each year
Some Thoughts on Data Warehousing There are a number of aspects to consider when developing a database, and many of these are applicable to a Data Warehouse. The major hurdle is the establishment and agreement by Users of the need to build a Data Warehouse This can be expanded into these ‘supporter’ groups 1. Understand the users by • Management information needs • Area of Business operation • Responsibilities • Computer literacy
Some Thoughts on Data Warehousing This (almost automatically) suggests a Business Management hierarchy - where no one person is responsible for everything And on this presumption we move on to 2. Determine what decisions need to make OR would like to make if only they had sufficient supporting information 3. What decisions they have made and are trying to assess the impact of those decisions (can you remember a number of ‘high level’ decisions made in Industry which have been disastrous - but this knowledge arrived too late to reverse or recover lost markets ?)
Some Thoughts on Data Warehousing 3. Who will most likely make informed decisions using the data warehouse as the source. It is not accurate to assume that because there is the availability of a data warehouse, all managers or decision makes will use it effectively (or at all) 4. Who could be ‘likely’ or potential new users ? This simple question opens up the Competitive Appraisal or Enterprise Management Appraisal environment. How effective and accurate are these Appraisals - how current are they Who conducted them - how skilled were they ? - what level of management skills ?
Some Thoughts on Data Warehousing Another way of stating this is ‘Who has serious potential in the Organisation’ Does this conflict with Privacy and Discrimination principles ? 5. Select the most efficient, effective, processible, and reliable data from when (?) to the most current period to be the content of the Data Warehouse 6. Make life enjoyable for the users • user screens • software icon or template based • match these with the skills of the users
Some Thoughts on Data Warehousing 7. Match the processes with the current and adaptive skills of the users - people do develop advanced skills 8. Ensure that the data used is accurate, and can be trusted and consistently recognised across all departments, areas, states, countries of the Company 9. Monitor the results and particularly the reactions of the users to the information advantages they are receiving 10. Search, or listen attentively for other sources of data for the particular Data Warehouse
Some Thoughts on Data Warehousing 11. Be alert to changes in Management - and the effect this could or will have on a Data Warehouse (either an established one, or one being developed. - Senior Management changes do happen quickly at times - especially after a General Meeting where shareholders can either be Happy or Not Happy) 12. Keep the users, and particularly the ‘Upper End of Management’ users satisfied And, make sure there is an energetic, resourceful and knowledgeable Project Leader (and read the paper on Team Construction at the website).
Some Thoughts on Data Warehousing There are some impediments to the design, development and use of a Data Warehouse 1. Not surprisingly, there are only a few organisations which are geared to, or are working towards total centralisation 2. The development of Data Warehouse invariably requires the integration of a number of technologies which are not compatible at many levels. This is due, in part, to the ongoing developments and upgrades which are inherent in and to any user of Information Technology
Some Thoughts on Data Warehousing 3. The ‘we want it now’ syndrome - which if allowed to rush the design, development and implementation invariably leads to a poor design, low quality but rapid development, and a Warehouse which doesn’t match expectations 4. The effect of changes both during the development and at the operational level. Conditions do change, and requirements also change and the overall effect is ongoing planned and directed maintenance 5. Data Warehouses process large amounts of data and use, in most cases, high level analysis and reporting tools. This is not a good environment for rapid or immediate response
Some Thoughts on Data Warehousing 6. Another dimension which has occurred in the past 2 to 3 years is the emphasis on ‘customer information’ which depends on normal IT systems to capture this data - this is a developing skill and early data will not be as complete nor revealing as ‘newer’ systems 7. The tracking, storing and prediction of customer behaviour may require different analysis tools. 8. Current databases provide ‘On Line Analytical Processing’ capabilities for on-line real time applications - the major effort is to associate this with legacy data in a meaningful and seamless manner - and maintain high processing rates
Some Thoughts on Data Warehousing A survey run by the Cutter Consortium in Arlington MA, 2002 revealed that 27% of the surveyed number of users and developers of Data Warehouses feel confident with data warehouse technology (142 companies worldwide were canvassed) And 41% have experienced data warehouse project failures And only 15% claim that their data warehousing efforts have been a major success
Some Thoughts on Data Warehousing So now that the groundwork has been laid, let’s move onto some more ‘ponderables’ of what are regarded as ‘Design Constraints’ - another name for non-avoidable requirements . 1. Visualisation • easily and quickly digested (understood) • simple but informative • recognisable information (names …..) • intuitive
Some Thoughts on Data Warehousing So what do these contribute to a Data Warehouse ? That is not the right question - It should be, What should design provide to make the results of a Data Warehousing process • understandable • easy to reprocess • easy to explore • easy to partition or to hand over to other tools
Some Thoughts on Data Warehousing If you look at many of the ‘advanced systems’ available you would be probably convinced that one of the main design objectives is to make the ‘system’ as complex as possible That’s not good Many designs are complex and intricate - great skill is needed to source, process, analyse, locate, redimension Features are not solutions - features are an aid to arrive at a solution
Some Thoughts on Data Warehousing Comprehension Smart or overfull screens have these results • they divert the attention of serious users • they confuse people • they make a process difficult to operate Screen displays should have much free space
Some Thoughts on Data Warehousing Screen displays should provide less choice - not all possible choices (or menu items) on one screen. (Many web sites suffer from the overfull problem). ‘Smart’ dashboards do not appeal to many users Multiple mouse clicks on the same screen mean a complex screen.
Some Thoughts on Data Warehousing Delivery Speed: Very much associated with ‘We must have it now!’ syndrome Acceptable delay tolerance from the end user is <= 0 Designers cannot to expect users to accept a long delay even if • the results are complicated • the amount of data to be processed is large • the process algorithms are complex and/or recursive
Some Thoughts on Data Warehousing Cost of ‘Implementation’ - from an idea to production These costs include: Labour (staff) costs during the design stage Delay costs during the design stage and this is before any useful results are produced - possibly a good case for a Return on Investment or Rate of Return calculation ? The are other incidental costs such as Business staff and management time
Some Thoughts on Data Warehousing Costs are also connected with the size of the Data Warehouse A 10 table design may not incur much complexity. A 100 table is almost certain to A 1000 table is guaranteed to (and may not even be successful).
Some Thoughts on Data Warehousing Technology Costs - Hardware and Software These items should be scaled to the ‘size’ of the known requirements And they should be able to be extended beyond the first implementation - if it is successful (nothing generates more demands than a successful Data Warehouse - probably due to a ‘wait and see’ attitude) Hardware should be periodically discarded and replaced with newer and more powerful versions
Some Thoughts on Data Warehousing Software is the key to for fast development, user-friendly systems, and fast delivery of data to the users for their queries of the Data Warehouse applications. A Data Warehouse works on a 2 phase basis 1. An ‘Extract, Transform and Load’ process (also known as ETL) 2. Front office user queries and reports Much of the design work is focussed on the processes which will provide relevant and accurate data from the sources so that queries and reports can be produced
Some Thoughts on Data Warehousing Daily Management and Administrative Costs This is NOT a NEW phenomenon which occurred with the emergence of Data Warehouses It is not avoidable but it is manageable and can be controlled Items which are found here are Routine loading of data into Fact and Dimension tables Production of standard reports
Some Thoughts on Data Warehousing Unexpected but Necessary Amendments to either Data or Processing Included here are such annoying aspects such as Late arriving facts Late arriving dimensions Corrections to existing data Other Amendments include New dimensions New Dimension Attributes New facts The details or summary structure of the data source
Some Thoughts on Data Warehousing Prevention of Inappropriate or Irrelevant Results This can be avoided and depends of the ‘correct target’ being selected Which in turn means that the extensive and continuous business requirements gathering at the beginning and throughout the life of the warehouse - which presupposes that either a data warehouse application may outlive its purpose - Or it may be replaced by a newer, different model
Some Thoughts on Data Warehousing Many years ago a chap called Hamlet (in one of William Shakespeare’s play (of the same name)) was given these words ‘To be, or not to be That is the question’ (and lots more of course). The current paraphrase of that is ‘To centralise or not to centralise That is the question’ The complete quotation from Hamlet is at the end of these overheads
Some Thoughts on Data Warehousing There are solid arguments to ‘Centralise’ and there are solid arguments ‘Not to Centralise’. Centralisation of data (as in a Centralised Data Warehouse) invariably assumes ‘perfect information, control and communication’ An alternative is to create data marts (specialised smaller centres of data) and to tie, or associate, these with an architecture which allows them to work together effectively (much like a distributed database).
Some Thoughts on Data Warehousing As with many developments, there are a number of ‘inescapables’ (much like fees and taxes) Unavoidable realities are a recognition that the ideal model is an ideal, not a practicality nor reality A realistic model is referred to a ‘descriptive’ - the ideal model is classed as ‘normative’. So with that little bit of Business philosophy, let’s have a quick look at reality in designing and implementing a data Warehouse
Some Thoughts on Data Warehousing The opportunities for building a ‘perfect design’ from a centralised position are very limited (in calculus this would be represented by sn 0 ) The major elements of skills, understanding, foresight, wisdom, sufficient, time are not normally available in abundance We construct, and live with, designs in separate areas (departments) and we learn the requirements of the many users slowly and sometimes painfully. We also need time and patience to understand and recognise the impact of data
Some Thoughts on Data Warehousing We develop things incrementally It’s not surprising then that the weight of experience indicates that a Data Warehouse is basically a decentralised system and may have no ‘core matter’ at all Don’t get discouraged - our national road systems were not constructed overnight Neither were our water, power, gas systems nor our Universities - and they are subject to continuous change - so why should there be the unnatural expectation for a Data Warehouse ?
Some Thoughts on Data Warehousing Another level of complexity is that of many, and incompatible technologies It is normal for organisations to have multiple ‘business’ systems which were developed at different time periods and on different computers and with different applications software platforms To integrate these requires much skill at the communications and applications levels. XML is the current ‘hope’ for clear and multiple different data forms. Curl is another new object oriented language which has appeared - and there will be others …..
Some Thoughts on Data Warehousing Most users expect what is known as ‘rapid deployment’ which means the time from inception to delivery of output Oracle Corporation (and others) suggest that a minimum 6 week period from acceptance and recognition of the need for a Data Warehouse is industry acceptable What would probably happen if the ‘wait period’ was longer ? If the deployment period is say 12 months, what are the risks of - budget reallocation ? - loss of organisation competitive edge ? - no more work for the Data Warehouse contractor ?
Some Thoughts on Data Warehousing Continuing Change One aspect of Business is predictable - change This makes long range assumptions both dangerous (particularly if they are not updated) and unreliable Timeliness of Data The Data Warehouse should ideally reflect both atomic data (the most persistent, lowest level of data) and real time - which is an expectation that latest data will be less than 60 minutes old, or even 0 minutes old. There is a term ‘lockstep’ which refers to the ‘in step’ condition of operational systems and Data Warehouse systems.
Some Thoughts on Data Warehousing The Optimists probably will look for ‘data ahead of its time’, that is before an event happens We currently have ‘futures’ and ‘futures trading’ so is ‘future data’ all that unrealistic ? Seamless Integration/Connection The final requirement is for the Data Warehouse to be able to connect the real-time data without prejudice (seamless) with the static historical data (which hopefully will not alter). This will give the appearance of a total time base
Some Thoughts on Data Warehousing The next few overheads will try to explain some current techniques and software which are employed to both minimise the time factor in producing outputs, and also improve or maintain the quality and scope of data There will also be some thoughts about a ‘new technique’ known as ‘Executive Dashboards’
Some Thoughts on Data Warehousing A New Term - OLTP databases OLTP is the acronym for ‘On Line Transaction Processing’ which links in nicely with the previous overheads regarding the ‘we want it now’ catch cry from Business Managers and Customers (we’ve overlooked these people have we ?) So now we have OLTP databases and Data Warehouses Big deal ? What’s the difference or differences, if any ? Is it just another expensive IT tool ?
Some Thoughts on Data Warehousing OLTP databases run business applications developed around • Siebel • SAP • PeopleSoft – now part of Oracle (and others) They are developed for multiple users, many of whom are remote from the processing centre (as if that is something unusual with our current capabilities with communications) These users require (sometimes ‘demand’) • very rapid response times • very high levels of accuracy of data (why ?)
Some Thoughts on Data Warehousing In these conditions, it is reasonable to suggest that there could be thousands of concurrent users (you remember concurrency ?) many, or perhaps all, of whom want to press the Enter key (or Submit a process with the more menu function driven Client devices) and expect • an immediate response • an accurate response
Some Thoughts on Data Warehousing Let us digress a bit and ponder on what is going on at the database level There will be a series of processing events which must occur (quickly, but in a predetermined sequence) 1. Business programming logic and SQL statements (like the SQL you have met) must be executed - this is not avoidable 2. Response data from the appropriate database or databases must be gathered - this will involve at least one CPU, some (or much) sorting, and also the unavoidable Input/Output (I/O) time.
Some Thoughts on Data Warehousing When all the detail events have been completed and the business function is performed, the production unit of work is committed, or rolled back. (you have met these conditions earlier) Just to make things a bit more entertaining, some OLTP database applications employ batch processing and others will service decision-support queries (with the associated extensive browsing and highly complex calculations required in many cases) An just to query the slogan ‘Life’s Good’ (or LG), OLTP databases are expected to run thousands of transactions per minute.
Some Thoughts on Data Warehousing OLTP database applications also perform • Insert • Update • Delete functions - and you have had some contact with transactions which affect more that 1 row in a table, and more than 1 table in a database (known as contention management and locking technology) So you will appreciate that OLTP is a high work level environment - so to expect fast output adds to the transaction load
Some Thoughts on Data Warehousing Let’s look at Data Warehouse Databases They assist in Business decision making processes - and of course some (or many) of these are complex Normally, Data Warehouse databases do not run transactions at high rates However, they do respond to complex business questions relevant to the available data These ‘queries’ are delivered to the database via complex SQL or by user friendly query tools