1 / 29

by Professor Ronald J Norman of Grossmont College, CA, USA

CR32 Knowledge Management and Adaptive Systems 09: Data Warehousing based on an online presentation by Ronald J Norman http://www.grossmont.edu/ronaldnorman/. by Professor Ronald J Norman of Grossmont College, CA, USA.

kenaz
Download Presentation

by Professor Ronald J Norman of Grossmont College, CA, USA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CR32 Knowledge Management and Adaptive Systems09: Data Warehousingbased on an online presentation by Ronald J Normanhttp://www.grossmont.edu/ronaldnorman/

  2. by Professor Ronald J Norman of Grossmont College, CA, USA • Prof Norman used these slides in his data mining course based on Data Mining Techniques (Second Edition) By Michael J. A. Berry and Gordon S. Linoff2004 John Wiley & Sons • Management-oriented textbook... http://www.data-miners.com/companion/dmt.html

  3. Introduction • Data, data, data…everywhere! • Information…that’s another story! • Especially, the right information @ the right time! • Data warehousing’s goal is to make the right information available @ the right time • Data warehousing is a data store (eg., a filestore or database of some sort) and a process for bringing together disparate data from throughout an organization for decision-support purposes

  4. Introduction • Data warehouses are natural allies for data mining (work together well) • Data mining can help fulfill some of the goal of data warehouses – right information @ the right time • Relational database management systems (RDBMS), such as Oracle, DB2, Sybase, Informix, Focus, SQL Server, etc. can be used for data warehousing; or just store as text/HTML

  5. Definitions of a Data Warehouse “A subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process” 1. - W.H. Inmon “A copy of transaction data, specifically structured for query and analysis” 2. - Ralph Kimball

  6. CW corpus as a Data Warehouse Subject-oriented: English terminology on WWW Integrated: harvested from many sources, into a single standard format and file-store Time-variant : WWW pages change! Non-volatile : corpus is a static snap-shot 1. - W.H. Inmon Copy of transaction data: cache structured for query and analysis: raw text yields word-frequency list 2. - Ralph Kimball

  7. Data Warehouse • For organizational learning to take place, data from many sources must be gathered together and organized in a consistent and useful way – hence, Data Warehousing (DW) • DW allows an organization to archive snapshots of its data, and what it has noticed about its data • Data Mining techniques make use of the data in a Data Warehouse

  8. Data Warehouse Enterprise “Database” Customers Orders Transactions Vendors Etc… Etc… • Data Miners: • “Farmers” – they know • “Explorers” - unpredictable Copied, organized summarized Data Warehouse Data Mining

  9. Data Warehouse • A data warehouse is a copy of transaction data specifically structured for querying, analysis, reporting, and more rigorous data mining • Note that the data warehouse contains a copy of the transactions which are not updated or changed later by the transaction system • Also note that this data is specially structured, and may have been transformed when it was copied into the data warehouse

  10. Data Mart • A Data Mart is a smaller, more focused Data Warehouse – a mini-warehouse. • A Data Mart typically reflects the business rules of a specific business unit within an enterprise. • Which English dominates the WWW, UK or US: each student captured a Data Mart for 1 domain.

  11. Data Mart Data Mart Data Mart Data Warehouse to Data Mart Decision Support Information Data Warehouse Decision Support Information Decision Support Information

  12. open source Data Warehouses • A company may keep its DW private! • Large data-sets are valuable Gold Standards for research and development • Some Universities host “public” DWs • Eg ICAME: International Computer Archive of Modern English • ICAME also runs CORPORA forum • Martin Krallinger etc on UK v US English: http://www.uib.no/mailman/public/corpora-archive/2006-November/003387.html

  13. Other Data repositories • UPenn: Linguistic Data Consortium http://www.ldc.upenn.edu/ • European equivalents: • ELRA http://www.elra.info/ • ELDA http://www.elda.org/ • Leeds Electronic Text Centre http://etext.leeds.ac.uk/ • Leeds Centre for Translation Studies http://corpus.leeds.ac.uk/

  14. Generic Architecture of Data (synonym) Transaction data

  15. Transaction (Operational) Data • Operational (production) systems create (massive number of) transactions, such as sales, purchases, deposits, withdrawals, returns, refunds, phone calls, toll roads, web site “hits”, web site text, etc… • Transactions are the base level of data – the raw material for understanding customer behavior • Unfortunately, operational systems change, eg new formats, due to changing business needs • Data warehousing strategies need to be aware of operational system changes

  16. Operational Summary Data Summaries are for a specific time period and utilize the transaction data for that time period Other Examples???

  17. Decision Support Summary Data • The data that are used to help make decisions about the business • Financial Data, such as: • Income Statements (Profit & Loss) • Balance Sheets (Assets – Liabilities = Net Worth) • Sales summaries • Other examples??? • Data warehouses maintain this type of data, however financial data “of record” (for audit purposes) usually comes from databases and not the data warehouse (confusing???) • Generally, it is a bad idea to use the same system for analytic and operational purposes

  18. Database Schema • Database schema defines the structure of data, not the values of the data (e.g., first name, last name = structure; Ron Norman = values of the data) • In RDBMS: • Columns = fields = attributes (A,B,C) • Rows = records = tuples (1-7)

  19. Describes data in a way that is familiar to business users Describes the data the way it will be stored in an RDBMS which might be different than the way the logical shows it Logical & Physical Database Schema

  20. Metadata • General definition: Data about data !!! • Examples: • A library’s card catalog (metadata) describes publications (data) • A file system maintains permissions (metadata) about files (data) • A form of system documentation including: • Values legally allowed in a field (e.g., AZ, CA, OR, UT, WA, etc.) • Description of the contents of each field (e.g., start date) • Date when data were loaded • Indication of currency of the data (last updated) • Mappings between systems (e.g., A.this = B.that) • Invaluable, otherwise have to research to find it

  21. Business Rules • Highest level of abstraction from operational (transaction) data • Describes why relationships exist and how they are applied • Examples: • Need to have 3 forms of ID for credit • Only allow a maximum daily withdrawal of $200 • After the 3rd log-in attempt, lock the log-in screen • Accept no bills larger than $20 • Others???

  22. OLAP – Online Analytical Processing • A definition: • Data representation for ease of visualization • OLAP goes beyond SQL with its analysis capabilities • Key feature of OLAP: Relevant multi-dimensional views such as products, time, geography

  23. OLAP Architecture

  24. General Architecture for Data Warehousing • Source systems • Extraction, (Clean), Transformation, & Load (ETL) • Central repository • Metadata repository • Data marts • Operational feedback • End users: analysis, OLAP, Data-Mining

  25. DM vs. OLAP Data Mining: can handle complex data types of the attributes and their aggregations a more automated process Online Analytic Processing (visualization): restricted to a small number of dimension and measure types user-controlled process CS490D 25

  26. DM + visualization Data Mining: can handle complex data types of the attributes and their aggregations “reduces” data to smaller number of patterns Visualization: restricted to a small number of patterns user-controlled process to select patterns which are “interesting” or “useful” CS490D 26

  27. Q: Is it a Data Warehouse? • Is ANY data-set a Data Warehouse? • SIS? • Library Catalogue? • VLE? • Text in a textbook?

  28. Definitions of a Data Warehouse “A subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process” 1. - W.H. Inmon “A copy of transaction data, specifically structured for query and analysis” 2. - Ralph Kimball

  29. CW corpus as a Data Warehouse Subject-oriented: English terminology on WWW Integrated: harvested from many sources, into a single standard format and file-store Time-variant : WWW pages change! Non-volatile : corpus is a static snap-shot 1. - W.H. Inmon Copy of transaction data: cache structured for query and analysis: raw text yields word-frequency list 2. - Ralph Kimball

More Related