600 likes | 721 Views
CHAPTER 4. Data Management: Warehousing, Access, and Visualization. In this Chapter, the outlines are:. MSS foundation Many new concepts Object-oriented databases Intelligent databases Data warehouse Data mining Online analytical processing Multidimensionality
E N D
CHAPTER 4 Data Management: Warehousing, Access, and Visualization
In this Chapter, the outlines are: • MSS foundation • Many new concepts • Object-oriented databases • Intelligent databases • Data warehouse • Data mining • Online analytical processing • Multidimensionality • Internet / Intranet / Web All right Reserved YAO Zhong, School of E&M, BUAA
4.1 Opening Vignette: Data Warehouse and DSS at Group Health Cooperative Group Health Cooperative is a large nonprofit HMO based in Seattle, Washington. The company owns hospital, inpatient centers, and primary care centers. The company also has contracts with many health care providers and also acts as an insurance company. In 1997, it served over 550,000 enrollees. A stream of 2-3 million records of data is processed monthly. Managing such a vast amount of data is a very difficult task. Even more difficult is the use of data for decision support. Before the use of DSS, costs were escalating, customer services were ineffective, use of resources was poor, and so was the quality of some service. The company realized that the only solution was to develop a All right Reserved YAO Zhong, School of E&M, BUAA
a comprehensive database and DSS approach to cover the whole spectrum of health services. Such an approach would allow for effective and efficient data-driven decision making. The basic idea was to create a single database (called a data warehouse) that would support DSS by linking data related to cost, efficiency of resource use, outcomes, and health status in a comprehensive corporate information system. The data come from existing data collection applications (TPS), such as clinic registration, laboratory, and pharmacy. An attempt was made to avoid data redundancy and to combine the data into meaningful information for decision making. The system was initiated in 1989 and it is constantly being updated and improved. It is used to generate periodic reports such as All right Reserved YAO Zhong, School of E&M, BUAA
Population-based reports summarized by clinic and by practice • Productivity reports • Utilization management reports • Reports by consumer groups or payer groups • Statistical reports, such as by age and gender The data warehouse is also used for many DSS, EIS, and MIS applications such as: • Allocating costs down to the level of use by the patient • Holistic cost approach to answer queries such as how cost reduction in one area affects costs in another areas. • Cost comparisons for negotiating prices with business partners • Comprehensive query system • Creating an EIS for monitoring key indicators such as cost per patient day in a hospital All right Reserved YAO Zhong, School of E&M, BUAA
Once the data warehouse was completed and data-based reporting and access tools were in place, the possibilities were endless. For example, the number of inpatient days is an important performance indicator. It was reduced by 7% by sending patients to outpatient services, resulting in millions of dollars in savings. Performance measurements are now generated periodically. Trend analysis is performed as well as comparisons with other HMOs. Considering these comparisons, target and plans for improvement were established. An example of the success of the DSS was the winning of a military contract valued at about $1 billion over a 5-year period. A database used specifically for the bid on this contract was created in only 2 days because it was extracted from the data warehouse. The warehouse was created with the SAS data warehouse software (from the SAS Institute Inc.), which is compatible with SAS’s DSS tools. Each group gets an individually tailored report. Such reports contain data from many All right Reserved YAO Zhong, School of E&M, BUAA
sources and because of the data warehouse ,it is possible to generate the reports rapidly and accurately. What this opening vignette demonstrated is how a large company uses a centralized database to support decision making. All right Reserved YAO Zhong, School of E&M, BUAA
4.2 Data Warehousing, Access, Analysis, and Visualization What to do with all the data that organizations collect, store, and use?(Information overload!) Solution • Data warehousing • Data access • Data mining • Online analytical processing (OLAP) • Data visualization • Data sources All right Reserved YAO Zhong, School of E&M, BUAA
4.3 The Nature and Sources of Data • Data: data items about things, events, activities, and transactions are recorded, classified, and stored, but are not organized to convey any specific meaning. It can be numeric, alphanumeric, figures, sounds, or image. • Information: Data organized to convey meanings. It confirms something the recipient knows or may have “surprise” value by telling something not know.The recipient interprets the meaning and draws inferences and conclusions. An applications process data items so that the results are meaningful for an intended action or decisions. All right Reserved YAO Zhong, School of E&M, BUAA
4.3 The Nature and Sources of Data • Knowledge: Data items organized and processed to convey understanding, experience, accumulated learning, and expertise.A set of data items processed to extract critical implications and to reflect past experience and expertise provides the recipient with organizational knowledge and has very high potential value. • DSS Database, data warehouse, may include data, information, and knowledge. All right Reserved YAO Zhong, School of E&M, BUAA
4.3 The Nature and Sources of Data • Data forms: • Documents • Pictures • Maps • Sound • Animation • Video • Can be hard or soft All right Reserved YAO Zhong, School of E&M, BUAA
4.3 The Nature and Sources of Data • Internal Data are stored in one or more places. These data are about people, products, service, and process. For example, data about the employees and their pay are usually stored in the corporate database. Data about equipment and machinery may be stored in the maintenance department database. • External There are many sources of external data. They range from commercial databases to data collected by sensors and satellites. Data are available on CD-ROM, on the Internets, as films, and as music or voices. Pictures, diagrams, atlases, and TV are also source of data. Government reports (either computerized or not) are a major source of external data. All right Reserved YAO Zhong, School of E&M, BUAA
4.3 The Nature and Sources of Data • Personal Data The MSS users or other corporate employees may contribute their own expertise by creating personal data. These include subjective estimates of sales opinions about what the competitor are like to do or interpretations of new articles. All right Reserved YAO Zhong, School of E&M, BUAA
4.4 Data Collection, Problems, and Quality • Methods for collecting data Raw data can collected manually or by instruments and sensors. Representative data collection methods are • time studies (during observation), • surveys (using questionnaire), • observation (e.g.using video cameras), and • soliciting information from experts (e.g. Using interview) • Data Problems (Table 4.1) All computer-based systems depend on data. The quality and integrity of the data are critical for the MSS to avoid the GIGO syndrome. MSS depend on data because complied data that make up information and knowledge are the heart of any decision-making system. All right Reserved YAO Zhong, School of E&M, BUAA
4.4 Data Collection, Problems, and Quality • Quality: determines usefulness of data • Intrinsic data quality • Accessibility data quality • Representation data quality All right Reserved YAO Zhong, School of E&M, BUAA
4.4 Data Collection, Problems, and Quality • Uniformity • Version • Completeness check • Conformity check • Genealogy check (drill down) All right Reserved YAO Zhong, School of E&M, BUAA
4.5 The Internet and Commercial Database Services • For external data • The Internet: major supplier of external data, Decision maker can access the home pages of vendor, clients, and competitors, view and download information, or conduct research. • Commercial Data Banks:sell access to specialized databasesAn online database service sell access to specialized DB. Such a service can add external information to the MSS in a timely manner and at a reasonable cost. All that is necessary to retrieve data from such a service is a company terminal, modem , telephone, password, and some service fees. All right Reserved YAO Zhong, School of E&M, BUAA
4.5 The Internet and Commercial Database Services Several thousand service are currently available, most of are accessible via Internet. Example, • CompuServe and The Source • CompuStat • Dow Jones Information Service • Interactive Data Corporation • Lockheed Information Service • Mead Data Central. All right Reserved YAO Zhong, School of E&M, BUAA
4.5 The Internet and Commercial Databases Servers • Use Web Browsers to: • Access vital information by employees and customers • Implement executive information systems • Implement group decision support systems (GDSS) • Database management systems provide data in HTML, on Web servers directly The big three relational database management system vendors-Informix, Oracle, Sybase and DB2-have reworked their core database products to accommodate a world of client/server, browser/server and Internet/intranet application All right Reserved YAO Zhong, School of E&M, BUAA
4.5 The Internet and Commercial Databases Servers • Use Web Browsers to: Besides, Powersoft is also readying software - The Internet Toolkit for Powerbuilder 5.0, and add-on to the PowerBuilder Windows application tool-for developing Web-enabled C/S application Web-site and database integration suppliers include: • Spider Technology, • Haht Software, • Next Software Inc. • NetObject Inc. • OneWave Corp. All right Reserved YAO Zhong, School of E&M, BUAA
4.6 Database Management Systems in DSS • DBMS: Software program for entering (or adding) information into a database; updating, deleting, manipulating, storing, and retrieving information • A DBMS + modeling language to develop DSS • DBMS to handle LARGE amounts of information • DSS is often working with both data and models • Small DSS can be build by either enhanced DBMS or spreadsheet. • Spreadsheet vs. DBMS? All right Reserved YAO Zhong, School of E&M, BUAA
4.7 Database Organization and Structure • Relational databases • This form is prominent in DBMS areas. 2D tabular to present the data logical structure. • Easy learn and maintains. • Hierarchical databases • Organizing data items in a top-down form, creating logical links between related data items. It looks like a tree. • Network databases • Network form to represent data logical structure. • Advantage is save store space. All right Reserved YAO Zhong, School of E&M, BUAA
4.7 Database Organization and Structure • Object-oriented databases • Complex application involves complex database, such as CIMS. • Any forms of above three types of databases can not satisfy the requirements. • OODBMS is based on the principle of OOP. OODMBS combine the characteristics of an OOP such as C++ or SmallTalk Language with a mechanism for data storage and access. It allows one to analyze data at a conceptual level that emphasizes the natural relationships between objectives. Abstraction is used to establish inheritance hierarchies and objective encapsulation allows the database designer to store both conventional data and procedural code within the same objects. All right Reserved YAO Zhong, School of E&M, BUAA
4.7 Database Organization and Structure • Object-oriented databases (see Leung, 2002) • OODBMS defines data as objects and encapsulates data along with their relevant structure and behaviors. The system uses a hierarchy of classes and subclasses of objects. Structure, in terms of relationships, and behavior, in terms of methods and procedures, are contained within an objects. • OODMBS especially useful in distributed DSS for very complex application • Multimedia-Based Database • MMDBMS manage data in a variety of formats, in addition to the standard text or numeric fields. These formats include images such as digitized photos or forms of bit-mapped graphics such maps or .pic files, hypertext images, video clips, sound, and virtual reality. All right Reserved YAO Zhong, School of E&M, BUAA
4.7 Database Organization and Structure At best, no more than 15 percent of all corporate information is digitized according to Gartner Group, Inc. At least 85% of all corporate information resides outside the computer in documents, maps, photos, images, and videotapes. For organizations to build application to take advantage of rich data types, the DBMS must accommodate them. Oracle, Informix, and Sybase store rich multimedia data types as binary large objects. Most PC systems (as clients) are capable of supporting the display or playback of files in these formats. It is logical, but not easy, to expand database management capabilities to include these objects into management support systems. All right Reserved YAO Zhong, School of E&M, BUAA
4.7 Database Organization and Structure • Document-based Database Organizations are drowning(溺死)in paper. To alleviate the paper storage and shuffling, document-based database were developed. These are also knows as electronic document management (EDM) systems. They are used for information dissemination, form storage and management, shipment, expert license processing, and workflow automation. • Distributed database A group of data have been distributed physically on different sites in various computer networks and logically belong to one system. The connected sites in the network is of individual process capabilities and can execute the local applications. At the same time, each site also can running the system wide applications through the communication subsystem. All right Reserved YAO Zhong, School of E&M, BUAA
4.8 Data Warehousing • Definition: (SAS Institute, 1996) • Physical separation of a company’s operational decision support environments. • At the heart of many companies lies a store of operational data, usually derived from critical mainframe-based online transaction processing (OLTP) systems, such as order entry applications. The OLTP systems are built with COBOL and they operate in Customer Information Control System (CICS) environment. OLTP systems for financial and inventory management and control also produce operational data. In this case, data access, application logic tasks, and data representation logic are tightly coupled together, usually in nonrelational database. These nonrelational data stores are not very conductive to data retrieval for All right Reserved YAO Zhong, School of E&M, BUAA
4.8 Data Warehousing decision support application. Deriving information for decision support analysis from an operational data store can be a self-defeating activity that requires too much time and programming expertise. It may negatively affect the performance of the critical transactional system. However, equally critical to a company’s success, decision support information must be made accessible to management. • W.H.Inmon(1992)definition:“data warehouse is a set of supporting subject-oriented, integrated, time-dependent and permanent decision process. • Purpose: to establish a data repository making operational data accessible in a form that is readily acceptable for decision support and EIS application. • As part of this new accessibility, a process must transform detail-level operational data to a relational form, which makes them more amenable to decision support processing. All right Reserved YAO Zhong, School of E&M, BUAA
4.8 Data Warehousing • Only data needed for decision support come from the TPS (operational environment). As data pass into the data warehouse, they are transformed and integrated into a consistent structure. • Data warehouses allows for the storage of metadata, which can include data summaries that are easier to search for and index. • The data are then placed directly into a data repository at the current level of detail, where they are eventually summarized, archived into the older detail data level, or purged. • Moving DSS information off the mainframe presents a company with an opportunity to restructure its DSS strategy. The company can reinvent the way that can shape and form their DSS data. All right Reserved YAO Zhong, School of E&M, BUAA
4.8 Data Warehousing • Any EIS requires having good summarized data in different forms. Traditional EIS and DSS application often failed because the underlying data were difficult or impossible to access. • Also, manipulation of the data has traditionally been a part of the EIS, instead of a preliminary data preparation activity. • Data warehousing (information warehousing): solves the data access problem • Data warehouse combines various data sources into a single resource for end-user access • End users perform ad hoc query, reporting analysis and visualization All right Reserved YAO Zhong, School of E&M, BUAA
4.8 Data Warehousing • Data warehousing involves combining a variety of technology vendor’s product into an integrated solution. • There can be several data warehouse in one company. • DW benefits: • Increase in knowledge worker productivity • Supports all decision makers’ data requirements • Provide ready access to critical data • Insulates operation databases from ad hoc processing • Provides high-level summary information • Provides drill down capabilities • Improved business knowledge • Competitive advantage All right Reserved YAO Zhong, School of E&M, BUAA
4.8 Data Warehousing • Enhances customer service and satisfaction • Facilitates decision making • Help streamline business processes • DW Architecture and Process • two-tier architecture • three-tier architecture (figure 4.3) Data from internal (legacy) sources and external sources are extracted, scrubbed, filtered, and summarized via special software before insertion in the data warehouse. The data then are processed again and deposited in an additional special MD database (3 tiers), organized for easy MD presentation. The DSS and EIS users can query the new server and perform analysis. In a two-tier architecture, there is no MD database or server. All right Reserved YAO Zhong, School of E&M, BUAA
Data Warehouse Architecture and Process Repository EIS Legacy System DB Server EIS/DSS Server Data Acquisition software DSS External System MultiD DB Data warehouse EIS Three tiers Data Warehouse Architecture by McFadden and Watson [1996] All right Reserved YAO Zhong, School of E&M, BUAA
4.8 Data Warehousing • Data Warehouse Components • Large physical database: This is an actual, physical database into which all the data for the data warehouse are gathered, along with the metadata and the processing logic used to scrub, organize, package, and preprocess the data for end-user access. • Logical data warehouse: contains all the metadata, business rules, and processing logic required to scrub, organize, package, and preprocess the data. In addition, it contains the information required to find and access the actual data, wherever they actually reside. All right Reserved YAO Zhong, School of E&M, BUAA
4.8 Data Warehousing • Data mart: A data mart is a subset of the enterprise-wide data warehouse. Typically it performs the role of a departmental, regional, or functional data warehouse. As part of the iterative data warehouse process, the organization builds a series of data marts over time and eventually links them via an enterprise-wide logical data warehouse. • The Metadata* : Data about data stored within the DW. • Decision support systems (DSS) and executive information system (EIS): These are not data warehouse but applications that use the data warehouse. All right Reserved YAO Zhong, School of E&M, BUAA
4.8 Data Warehousing • Data Warehouse Suitability: • Data warehousing is most appropriate for organizations where: • Data are stored in different systems. • An information-based approach to management is in use. • There is large, diverse customer base. • The same data are represented differently in different systems. • Data are stored in highly technical, difficult to decipher formats. All right Reserved YAO Zhong, School of E&M, BUAA
4.8 Data Warehousing • Characteristics of Data Warehousing 1. Data organized by detailed subject with information relevant for decision support 2. Integrated data :data in different locations may be enclosed differently. For example, gender data may be encoded 0 and 1 in one place, and m and f in another. DB2, ORACLE, Informix, Sybase, SQL Server, Access, etc. all are integrated into one DW. 3. Time-variant data: for 5-10 years data can used for tends, forecasting, and comparison. 4. Non-volatile data: Once entered into DW, data are not changed or updated. All right Reserved YAO Zhong, School of E&M, BUAA
4.9 OLAP: Data Access and Mining, Querying, and Analysis • Online analytical processing (OLAP) • DSS and EIS computing done by end-users in online systems • Versus online transaction processing (OLTP)OLTP is performed by end-users, whereas OLAP is done by IS professionals. OLAP’s activities are generating queries, requesting ad hoc reports, conducting statistical analyses, and building multimedia applications. To facilitate OLAP, data warehouse is necessary. OLTP with database. All right Reserved YAO Zhong, School of E&M, BUAA
4.9 OLAP: Data Access and Mining, Querying, and Analysis • OLAP Activities • Generating queries • Requesting ad hoc reports • Conducting statistical and other analyses • Developing multimedia applications All right Reserved YAO Zhong, School of E&M, BUAA
4.9 OLAP: Data Access and Mining, Querying, and Analysis All right Reserved YAO Zhong, School of E&M, BUAA
4.9 OLAP: Data Access and Mining, Querying, and Analysis • OLAP uses the data warehouse and a set of tools, usually with multidimensional capabilities • Query tools • Spreadsheets • Data mining tools • Data visualization tools • Two types realized modes • MOLAP (Multidimensional OLAP): EXPRESS (MIT) and System W (Comshare) • ROLAP: (Relational OLAP): Metaphor All right Reserved YAO Zhong, School of E&M, BUAA
4.9 OLAP: Data Access and Mining, Querying, and Analysis • Using SQL for Querying • SQL (Structured Query Language)Data language English-like, nonprocedural, very user friendly languageFree formatExample:SELECT Name, SalaryFROM EmployeesWHERE Salary >2000 All right Reserved YAO Zhong, School of E&M, BUAA
4.10 Data Mining • Data mining is a term used to describe knowledge discovery in databases, knowledge extraction, data archaeology, data exploration, data pattern processing, data dredging, information harvesting, and software. • Data Mining Applications: • Knowledge discovery in databases • Knowledge extraction • Data archeology • Data exploration • Data pattern processing • Data dredging • Information harvesting All right Reserved YAO Zhong, School of E&M, BUAA
4.10 Data Mining • Major Data Mining Characteristics and Objectives • Data are often buried deep • Client/server architecture • Sophisticated new tools--including advanced visualization tools--help to remove the information “ore” • End-user miner empowered by data drills and other power query tools with little or no programming skills • Often involves finding unexpected results • Tools are easily combined with spreadsheets, etc. • Parallel processing for data mining All right Reserved YAO Zhong, School of E&M, BUAA
4.10 Data Mining • Data Mining Application Areas • Marketing (Retailing and sales) • Banking • e-Commerce • Manufacturing and production • Brokerage and securities trading • Insurance • Computer hardware and software • Government and defense • Airlines • Health care • Broadcasting • Law enforcement All right Reserved YAO Zhong, School of E&M, BUAA
4.10 Data Mining • Intelligent Data Mining • Use intelligent search to discover information within data warehouses that queries and reports cannot effectively reveal • Find patterns in the data and infer rules from them • Use patterns and rules to guide decision making and forecasting • Five common types of information that can be yielded by data mining: 1) association, 2) sequences, 3) classifications, 4) clusters, and 5) forecasting All right Reserved YAO Zhong, School of E&M, BUAA
4.10 Data Mining • Main Tools Used in intelligent Data Mining • Case-based Reasoning: Using historical cases, the case-base reasoning approach can be used to recognize patterns. • Neural Computing: Neural computing is a machine learning approach by which historical data can be examined for pattern recognition. Thus, one can go through large databases identify potential customers of a new product. (a brief introduction, doesn’t requirement ) All right Reserved YAO Zhong, School of E&M, BUAA
4.10 Data Mining • Intelligent Agents:one of the most promising approaches to retrieving information from databases, especially external ones, is the use of intelligent agents. As vast amounts of information are becoming available through the Internet, finding the right information is becoming more difficult. IA is an autonomous agent, which is a system situation within and a part of an environment that senses thatenvironment and acts in on it, over time, in pursuit of its own agenda and so as to effect what it senses in the future. • Other Tools • Decision trees • Rule induction • Data visualization All right Reserved YAO Zhong, School of E&M, BUAA
4.10 Data Mining • Often used technologies in data mining • Associate rules • Clustering • Artificial Neural Networks • Decision trees • Multivariate regression All right Reserved YAO Zhong, School of E&M, BUAA