670 likes | 841 Views
Data Mining Lecture 2: DBMS, DW, OLAP, and Data Preprocessing. Contrasting Database and File Systems. An Example of a Simple Relational Database. The Relational Schema for the SaleCo Database. The Entity Relationship Model. The Development of Data Models.
E N D
Data MiningLecture 2: DBMS, DW, OLAP, and Data Preprocessing
Data Life Cycle ProcessContinued The result - generating knowledge
Methods for Collecting Raw Data • Collection can take place • in the field • from individuals • via manually methods • time studies • Surveys • Observations • contributions from experts • using instruments and sensors • Transaction processing systems (TPS) • via electronic transfer • from a web site (Clickstream) The task of data collection is fairly complex. Which can create data-quality problem requiring validation and cleansing of data.
The Need for Data Analysis • Managers must be able to track daily transactions to evaluate how the business is performing • By tapping into the operational database, management can develop strategies to meet organizational goals • Data analysis can provide information about short-term tactical evaluations and strategies
The Data Warehouse • Benefits of a data warehouse are: • The ability to reach data quickly, since they are located in one place • The ability to reach data easily and frequently by end users with Web browsers. Adata warehouseis a repository of subject-oriented historical data that is organized to be accessible in a form readily acceptable for analytical processing activities (such as data mining, decision support, querying, and other applications).
The Data WarehouseContinued • Characteristics of data warehousing are: • Time variant. The data are kept for many years so they can be used for trends, forecasting, and comparisons over time. • Nonvolatile. Once entered into the warehouse, data are not updated. • Relational. Typically the data warehouse uses a relational structure. • Client/server. The data warehouse uses the client/server architecture mainly to provide the end user an easy access to its data. • Web-based. Data warehouses are designed to provide an efficient computing environment for Web-based applications
Conceptual Modeling of Data Warehouses • Modeling data warehouses: dimensions & measures • Star schema: A fact table in the middle connected to a set of dimension tables • Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake • Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
item time item_key item_name brand type supplier_type time_key day day_of_the_week month quarter year location branch location_key street city province_or_street country branch_key branch_name branch_type Example of Star Schema Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures
supplier item time item_key item_name brand type supplier_key supplier_key supplier_type time_key day day_of_the_week month quarter year city location branch city_key city province_or_street country location_key street city_key branch_key branch_name branch_type Example of Snowflake Schema Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures
item time item_key item_name brand type supplier_type time_key day day_of_the_week month quarter year location location_key street city province_or_street country shipper branch shipper_key shipper_name location_key shipper_type branch_key branch_name branch_type Example of Fact Constellation Shipping Fact Table time_key Sales Fact Table item_key time_key shipper_key item_key from_location branch_key to_location location_key dollars_cost units_sold units_shipped dollars_sold avg_sales Measures
The Data Cube • One intersection might be the quantities of a product sold by specific retail locations during certain time periods. • Another matrix might be Sales volume by department, by day, by month, by year for a specific region • Cubes provide faster: • Queries • Slices and Dices of the information • Rollups • Drill Downs Multidimensional databases(sometimes called OLAP) are specialized data stores that organize facts by dimensions, such as geographical region, product line, salesperson, time. The data in these databases are usually preprocessed and stored in data cubes.
Cube: A Lattice of Cuboids all 0-D(apex) cuboid time item location supplier 1-D cuboids time,item time,location item,location location,supplier 2-D cuboids time,supplier item,supplier time,location,supplier time,item,location 3-D cuboids item,location,supplier time,item,supplier 4-D(base) cuboid time, item, location, supplier
Transactional vs. Analytical Data Processing Transactionalprocessing takes place in operational systems (TPS) that provide the organization with the capability to perform business transactions and produce transaction reports. The data are organized mainly in a hierarchical structure and are centrally processed. This is done primarily for fast and efficient processing of routine, repetitive data. A supplementary activity to transaction processing is calledanalytical processing, which involves the analysis of accumulated data. Analytical processing, sometimes referred to as business intelligence, includes data mining, decision support systems (DSS), querying, and other analysis activities. These analyses place strategic information in the hands of decision makers to enhance productivity and make better decisions, leading to greater competitive advantage.
Data Preprocessing • Why preprocess the data? • Data cleaning • Data integration and transformation • Data reduction • Discretization and concept hierarchy generation
Why Data Preprocessing? • Data in the real world is a mess • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • noisy: containing errors or outliers • inconsistent: containing discrepancies in codes or names • No quality data, no quality mining results • Quality decisions must be based on quality data • Data warehouse needs consistent integration of quality data
Cont’d • Just as manufacturing and refining are about transformation of raw materials into finished products, so too with data to be used for data mining • ECTL – extraction, clean, transform, load – is the process/methodology for preparing data for data mining • The goal: ideal DM environment
Data Types • Variable Measures • Categorical variables (e.g., CA, AZ, UT…) • Ordered variables (e.g., course grades) • Interval variables (e.g., temperatures) • True numeric variables (e.g., money) • Dates & Times • Fixed-Length Character Strings (e.g., Zip Codes) • IDs and Keys – used for linkage to other data in other tables • Names (e.g., Company Names) • Addresses • Free Text (e.g., annotations, comments, memos, email) • Binary Data (e.g., audio, images)
Multi-Dimensional Measure of Data Quality • A well-accepted multidimensional view: • Accuracy • Completeness • Consistency • Timeliness • Believability • Value added • Interpretability • Accessibility • Broad categories: • intrinsic, contextual, representational, and accessibility.
Major Tasks in Data Preprocessing • Data cleaning • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration • Integration of multiple databases, data cubes, or files • Data transformation • Normalization and aggregation • Data reduction • Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization • Part of data reduction but with particular importance, especially for numerical data
What the Data Should Look Like • All data mining algorithms want their input in tabular form – rows & columns as in a spreadsheet or database table i.e. Give a sample file of SPSS
What the Data Should Look Like • Customer Signature • Continuous “snapshot” of customer behavior Each row represents the customer and whatever might be useful for data mining
What the Data Should Look Like • The columns • Contain data that describe aspects of the customer (e.g., sales $ and quantity for each of product A, B, C) • Contain the results of calculations referred to as derived variables (e.g., total sales $)
What the Data Should Look Like • Columns with One Value - Often not very useful • Columns with Almost Only One Value • Columns with Unique Values • Columns Correlated with Target Variable (synonyms with the target variable) 1. 2. 3.
Data Cleaning • Data cleaning tasks • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data
Missing Data • Data is not always available • E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to • equipment malfunction • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • certain data may not be considered important at the time of entry • not register history or changes of the data • Missing data may need to be inferred.
How to Handle Missing Data? • Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. • Fill in the missing value manually: tedious + infeasible? • Use a global constant to fill in the missing value: e.g., “unknown”, a new class?! • Use the attribute mean to fill in the missing value • Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter • Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree
Noisy Data • Noise: random error or variance in a measured variable • Incorrect attribute values may due to • faulty data collection instruments • data entry problems • data transmission problems • technology limitation • inconsistency in naming convention • Other data problems which requires data cleaning • duplicate records • incomplete data • inconsistent data