750 likes | 999 Views
Unit Five The nature and sources of data. Data : items about things, events, activities, and transactions are recorded, classified and stored but are not organized to convey any specific meaning. Data item can be numeric, alphanumeric, figures, sounds, and images.
E N D
Unit FiveThe nature and sources of data • Data : items about things, events, activities, and transactions are recorded, classified and stored but are not organized to convey any specific meaning. • Data item can be numeric, alphanumeric, figures, sounds, and images. • Information : data that have been organized in a manner that gives them meaning for the recipient.
They confirm thing that the recipient know, or may have surprising value by something not known. Knowledge: consists of data items and /or information organized and processed to convey understanding, experience, accumulated learning and expertise that are applicable to current problem activity. * knowledge can be the application of data and information in making decisions.
Internal data: stored in one ore more places, they are about people, products, services, and processes (student data is stored in university DB). * External data: Has many resources, commercial DB, collected data by sensors, and satallite,
Available on CD, DVD, Internet, statistical bureaus, banks, chamber of commerce. Data collection problems and quality: • The need to collect data from internal and external sources makes MSS building complicated. • In some cases it is necessary to collect row data in the filed. • In other cases it is necessary to get data from people or to find it on the Internet. • Data must be validated and filltered.
Methods for collecting row data: 1- manually: observations, questionnaires, interviews, soliciting information from experts. 2- sensors and scanners for biometrics. Data problems: 1-data are not correct: generated carelessly, entered inaccurately. 2- data is not timely: methods for generating data are not fast enough to meet needs for data.
3-data are not measured properly: gathered inconsistently with the purposes of analysis. 4- needed data do not exist: no one ever stored data needed now. Data quality: Quality determines usefulness of data as well as the quality of the decisions based on them.
Data quality problems: 1- contextual DQ: relevancy, timeliness, completeness. 2-intrinsicجوهري DQ: accuracy, objectivity, believability, reputation. 3- accessibility DQ: access security. 4- representation DQ: interpretability, ease of understanding, consistent representation.
Data Integrity • Older filing system may lack integrity. If a change is made in the file in one place, it may not be made in the file in another laces or department, which results in conflicting data. • Data integrity considers the following issues: 1- uniformity: during data capturing, uniformity checks to ensure that data are within specific limits.
2- version: checks are performed when the data are transformed through the use of metadata to ensure that the format of the original data has not been changed. 3- completeness check: ensures that the summaries are correct and that all values needed to create the summary are included.
4- conformity check: ensures that during data analysis and reporting, correlation are run between the value reported and previous values for the same numbers. Sudden changes can indicate a basic change in the business analysis is error or bad data. 5- genealogy علم الانسابcheck : drill down, trace back to the data source through its various transformation.
Data Access and Integration • How to reach data in its storage area? • Data access can be done using one of the following methods: * relation Database tables, XML documents, Electronic data messages, Cobol records, the Internet which has thousands of databases all over the world accessible through the Web/Internet. *commercial data banks which are an online databases services selling services to specialized databases,
they can add external data to MSS in a timely manner and reasonable cost. Example is the GIS.
DBMS • Supplements standard Operating system by allowing for greater integration of data, complex files structure, quick retrieval and changes, better data security. • It is SW programs for adding information to DB and updating, deleting, manipulating, storing and retrieving information.
DB types 1- relational 2-hierarchical 3-network 4- Object oriented DB: • MSS application may require accessibility to complex data which may include pictures which can not be handled by the previous types. • Graphical representation used (OODB): may be used to handle pictures.
It is based on OOP by combining characteristics of OOP such as UML with mechanism for data storage and access. • OOBMS allows to analyze data at a conceptual level that emphasize the natural relationships between objects using encapsulation and inheritance. • OODBMS defines data as objects and encapsulates data a long with their relevant structure and behavior.
5- Multimedia-Based DB: • MMDBMS manage data in a variety of formats in addition to text and numbers. • other formats include images such as digitized photographs, forms of bit-mapped graphics such as maps or .PIC files, hypertext images, videos, clips, sounds, and virtual reality (multidimensional images).
Data Warehousing • Is one or several databases which contain the information that is needed for tactical or strategic decisions. Collection of data designed to support decision making. Contains data that present a coherent picture of business conditions at a single point in time.
Data Warehousing can be : 1- utilized to support decision-making. 2- analyzing large amount of data from various resources to provide fast results to support critical process. * Organization (public and private) continuously collect data, information and knowledge and store them in computerized system.
As the amount of data increases: 1- updating, retrieving, using, removing of information becomes complicated. 2- number of data uses increase as a result of improved reliability and availability of network access. • Warehouse gets data from external and internal resources, organized in consistent with organization’s needs. • Data WH has access to all information relevant to the organization which can come form internal or external sources.
With meta data and metadata repository, organization can improve their uses of information and application development processes. • Business benefit from metadata as follows: 1- reduction of It- related problems. 2- increase system value to business. 3-improve business decisionmaking.
Business metadata comprise information that increase or understanding to traditional data (structured) reported. • Primary purpose is to provide context to the data, enriching information leading to knowledge. Context does not have to be the same for all users. • It assist in conversion of data and information into knowledge.
Data Warehousecharacteristics 1- subject-oriented: • Data are organized be detailed subjects (customer, policy type in insurance company) • Data contains only information relevant for decision support. • Enables users to determine how their business is performing and why it is performing that way.
It provides more comprehensive view of the organization, than operational DB which is oriented toward product and handles transactions. 2- Integrated: * Data at different source locations may be encoded differently. Example, person gender may be encoded as 0,or 1 and in other places as F, M. In data warehouse they are scrubbed ( cleaned) into one format which makes them standard and consistent.
3- time variant (time series): • Data do not provide current status. • Data are kept for several years and are used for trends, forecasting and comparison. • Time is the one important dimension that all data warehouses must support. • Data for analysis from multiple sources contain multiple time point (daily, weekly, monthly views)
4-non volatile: • Once entered into the warehouse they are read-only, they can not be changed or updated. • Obsolete data are discarded, and changes are recorded as new data 5- summarized: operational data are aggregate into summaries.
6- not normalized: • data in data warehouse are not normalized and highly redundant. 7- sources: all data are present both internal and external. 8- metadata: data about data are includes in data warehouse.
Metadata • Describes the structure of and some meaning about the data which affect its effective or ineffectiveness. • The key of making user comfortable with technology. • Involves knowledge, and capturing and making them accessible through the organization have become important success factor.
With metadata and metadata repository, organization can improve their uses of information and application development processes. • Business benefits from metadata as follows: 1- reduction of IT-related problems. 2- increase system value to business. 3- improved business decision-making.
Business metadata comprises information that increase our understanding to traditional (structured) data reported. • Primary purpose is to provide context to the data, enriching information leading to knowledge. Context does not have to be the same for all users. • It assist in conversion of data and information into knowledge.
Data about data. Metadata describes how and when and by whom a particular set of data was collected, and how the data is formatted. Metadata is essential for understanding information stored in data warehouses and has become increasingly important in XML-based Web applications.
Data Ware Housearchitecture and process • Could be of one, two, or three layers. • DWH can be divided into three parts: 1- the DWH itself, which contains the data associated SW. 2- data acquisition SW which extracts data from legacy systems and external sources , consolidates and summarizes them, and loads them into the DWH. 3- client SW which allows users to access and analyze data in the ware house.
In the three layer architecture contains: 1- operational system contains in the data SW for data acquisition in one server (layer). 2- the DWH is another layer. 3- the third layer includes decision support/business intelligence, business analytics engine and the client. This has advantage : it separate functions of data WH eliminating resources constraints and makes it possible to create data marts easily.
In Two layer: Dss engine is on the same platforms as the WH which makes it more economical than the three layer structure. Some issues to consider when selecting an architecture: 1- which DBMS to use? Most DWH built using relational DBMS, oracle, SQL server which support client-server and Web-server architecture. 2- will parallel processing or partitioning be utilized? parallel processing enables multiple CPU’s to process data WH query request at the same time. Partitioning the DB tables into smaller ones to improve access efficiency.
3- will data migration tools be used to load the DWH? 4- what tools will be used to support data retrieval and analysis?
Data Ware House Development • The process of migration data to DWH involves extraction of data from all relevant resources> • Data sources consists of the following: 1- files extracted from online transaction processing (OLTP). 2- spread sheet 3- personal DB (ms-Access) 4- external fles.
DWH contains a number of business rules that define the following: 1- how the data will be used. 2-summarization rules. 3-standardization of encoded attributes. 4-calculation rules. * And data quality issues need to be corrected before its loaded into the DWH.
One of the well defined DWH benefits is that these rules can be stored in a meta data repository and applied to DWH centrally. • In OLTP, rules are scattered all over the system. • Load process into DWH can be performed either by: 1- data transformation tools which provide Graphical User Interface (GUI) to help in development and maintenance business rules. 2- developing programs or utilities to load data WH using programming languages such as PL/SQL, C++ or .net.
There are several issues that affect whether to build a data transformation tool or buy one , which are: 1- cost of transformation tool. 2-they may take time to learn how to use. 3- it is difficult to measure how the IT organization is doing until it has learned to use the tool.
Benefits of transformation tools: 1- simplifying the maintenance of the organization DWH. 2-effective in detecting and scrubbing, removing of bad data.
Star Schema • DWH design is based on dimensional modeling. • dimensional modeling is retrieval-based model, it supports high amount of query access. • Star schema is how dimensional modeling Is implemented. • Star schema contains a central fact table which contains: 1- the attributes needed to perform decision analysis. 2- descriptive attributes used for query reporting. 3-foreign key to link to dimensional table. Decisional analysis attributes consists of:
A- performance measure B- operational metrics C-aggregate measures D- other metrics needed to analyze org. performance. • Fact table address what the DWH supports for decision analysis. • Dimensional table contains attributes that describe the data contained in the fact table. • Dimensional table address how data will be analyzed.
Grain of data WH defines the highest level of detail supported, grain indicates whether that DWH is high summarized or include detailed transaction data. • If the grain is defined too high, the WH may not support detailed requests to drill down into data. • Drill-down analysis is the process of probing beyond a summarized value to investigate each of the detail transaction that comprise the summary. • Low level granularity results in more data being stored in DWH. • Larger amounts of detail may affect the performance of query making response time longer.
Implementing Data Ware House • DWH projects can be identified as either data centric or application centric. • Data centric WH: -based on data model that is independent of any application. -designed to support variety of user needs and applications. -supports flexibility since organization information constantly needs change. More dynamic business means more data needs will change.
application centric: -designed to support a single initiative or small set of initiatives. -preferred for independent data mart development. -provides more focused scope increasing the success of DWH implementation. Its disadvantage is that critical data needs may be lost out during the initial development therefore multiple iterations is necessary.
Factors that play a big role in the successful implementation of DWH, can be categorized into organizational issues, project issues and technical issues, the factors are: 1-management support 2-champion 3-resources 4-user participation 5-team skills 6-source system 7-development technology
Implementation of Web-based DWH (Webhousing), make it easier to have access to large amounts of data, but it is difficult to determine the hard benefits of DWH. • Hard benefits : organization benefits that can be expressed in Monterey terms (org. has priorities when it comes to money). • Project champion helps ensuring that DWH project will receive the necessary resources for successful implementation. • Resources could be costly, require high processors, and large increase in direct-access storage devices, web-based WH need special security requirements.
User participation: -participation in data modeling and access modeling. • during data modeling, expertise is required to determine the following: 1-what data are needed? 2-define business rule of data 3-what aggregation and calculations needed? Access modeling is needed to determine: 1-how data is to be retrieved from DWH? 2-help in physical definition of WH to help in determining which data needs indexing. 3- indicates whether data marts are needed to facilitate information retrieval.
Team skills require in-dept knowledge of DB technologies and development tools. • Source system and development technology refer to many inputs and processes used to load and maintain DWH. • Ubiquitous
Best practices for DWH implementation • The project must fit with corporate strategy and business objectives. • There must be complete buy-in to the project (executives, managers, users) • Manage expectations. • DWH must be built incrementally. • Project must be managed by IT and managers. • Load cleaned data and of quality • Do not overlook training requirements.
DWH Risks • Many risks is WH project, they are serious because DWH are large-scale and expensive projects. Some risks are: • Quality of source data is not known. • Skills are not in place. • Inadequate budget • Lack of supporting SW. • Weak or loss of sponsor. • Users are computer literate. • Unrealistic users expectations. • Key people may leave project. • Too much new technology • Team geography , language culture.