1 / 67

Data Mining Lecture 2: DBMS, DW, OLAP, and Data Preprocessing

Data Mining Lecture 2: DBMS, DW, OLAP, and Data Preprocessing. Contrasting Database and File Systems. An Example of a Simple Relational Database. The Relational Schema for the SaleCo Database. The Entity Relationship Model. The Development of Data Models.

samira
Download Presentation

Data Mining Lecture 2: DBMS, DW, OLAP, and Data Preprocessing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data MiningLecture 2: DBMS, DW, OLAP, and Data Preprocessing

  2. Contrasting Database and File Systems

  3. An Example of a Simple Relational Database

  4. The Relational Schema for the SaleCo Database

  5. The Entity Relationship Model

  6. The Development of Data Models

  7. The Relational Schema for the TinyCollege Database

  8. The Database System Environment

  9. Data Warehouse

  10. Data Life Cycle ProcessContinued The result - generating knowledge

  11. Methods for Collecting Raw Data • Collection can take place • in the field • from individuals • via manually methods • time studies • Surveys • Observations • contributions from experts • using instruments and sensors • Transaction processing systems (TPS) • via electronic transfer • from a web site (Clickstream) The task of data collection is fairly complex. Which can create data-quality problem requiring validation and cleansing of data.

  12. The Need for Data Analysis • Managers must be able to track daily transactions to evaluate how the business is performing • By tapping into the operational database, management can develop strategies to meet organizational goals • Data analysis can provide information about short-term tactical evaluations and strategies

  13. Transforming Operational Data Into Decision Support Data

  14. The Data Warehouse • Benefits of a data warehouse are: • The ability to reach data quickly, since they are located in one place • The ability to reach data easily and frequently by end users with Web browsers. Adata warehouseis a repository of subject-oriented historical data that is organized to be accessible in a form readily acceptable for analytical processing activities (such as data mining, decision support, querying, and other applications).

  15. The Data WarehouseContinued • Characteristics of data warehousing are: • Time variant. The data are kept for many years so they can be used for trends, forecasting, and comparisons over time. • Nonvolatile. Once entered into the warehouse, data are not updated. • Relational. Typically the data warehouse uses a relational structure. • Client/server. The data warehouse uses the client/server architecture mainly to provide the end user an easy access to its data. • Web-based. Data warehouses are designed to provide an efficient computing environment for Web-based applications

  16. The Data WarehouseContinued

  17. Conceptual Modeling of Data Warehouses • Modeling data warehouses: dimensions & measures • Star schema: A fact table in the middle connected to a set of dimension tables • Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake • Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

  18. item time item_key item_name brand type supplier_type time_key day day_of_the_week month quarter year location branch location_key street city province_or_street country branch_key branch_name branch_type Example of Star Schema Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures

  19. supplier item time item_key item_name brand type supplier_key supplier_key supplier_type time_key day day_of_the_week month quarter year city location branch city_key city province_or_street country location_key street city_key branch_key branch_name branch_type Example of Snowflake Schema Sales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_sales Measures

  20. item time item_key item_name brand type supplier_type time_key day day_of_the_week month quarter year location location_key street city province_or_street country shipper branch shipper_key shipper_name location_key shipper_type branch_key branch_name branch_type Example of Fact Constellation Shipping Fact Table time_key Sales Fact Table item_key time_key shipper_key item_key from_location branch_key to_location location_key dollars_cost units_sold units_shipped dollars_sold avg_sales Measures

  21. The Data Cube • One intersection might be the quantities of a product sold by specific retail locations during certain time periods. • Another matrix might be Sales volume by department, by day, by month, by year for a specific region • Cubes provide faster: • Queries • Slices and Dices of the information • Rollups • Drill Downs Multidimensional databases(sometimes called OLAP) are specialized data stores that organize facts by dimensions, such as geographical region, product line, salesperson, time. The data in these databases are usually preprocessed and stored in data cubes.

  22. Three-Dimensional View of Sales

  23. Cube: A Lattice of Cuboids all 0-D(apex) cuboid time item location supplier 1-D cuboids time,item time,location item,location location,supplier 2-D cuboids time,supplier item,supplier time,location,supplier time,item,location 3-D cuboids item,location,supplier time,item,supplier 4-D(base) cuboid time, item, location, supplier

  24. Operational vs. Multidimensional View of Sales

  25. Creating a Data Warehouse

  26. OLTP and OLAP

  27. Transactional vs. Analytical Data Processing Transactionalprocessing takes place in operational systems (TPS) that provide the organization with the capability to perform business transactions and produce transaction reports. The data are organized mainly in a hierarchical structure and are centrally processed. This is done primarily for fast and efficient processing of routine, repetitive data. A supplementary activity to transaction processing is calledanalytical processing, which involves the analysis of accumulated data. Analytical processing, sometimes referred to as business intelligence, includes data mining, decision support systems (DSS), querying, and other analysis activities. These analyses place strategic information in the hands of decision makers to enhance productivity and make better decisions, leading to greater competitive advantage.

  28. OLTP vs. OLAP

  29. OLAP Client/Server Architecture

  30. OLAP Server Arrangement

  31. OLAP Server with Multidimensional Data Store Arrangement

  32. OLAP Server With Local Mini Data Marts

  33. Data Mining: Extraction of Knowledge From Data

  34. Review: Data-Mining Phases

  35. Data Preprocessing

  36. Data Preprocessing • Why preprocess the data? • Data cleaning • Data integration and transformation • Data reduction • Discretization and concept hierarchy generation

  37. Why Data Preprocessing? • Data in the real world is a mess • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • noisy: containing errors or outliers • inconsistent: containing discrepancies in codes or names • No quality data, no quality mining results • Quality decisions must be based on quality data • Data warehouse needs consistent integration of quality data

  38. Cont’d • Just as manufacturing and refining are about transformation of raw materials into finished products, so too with data to be used for data mining • ECTL – extraction, clean, transform, load – is the process/methodology for preparing data for data mining • The goal: ideal DM environment

  39. Data Types • Variable Measures • Categorical variables (e.g., CA, AZ, UT…) • Ordered variables (e.g., course grades) • Interval variables (e.g., temperatures) • True numeric variables (e.g., money) • Dates & Times • Fixed-Length Character Strings (e.g., Zip Codes) • IDs and Keys – used for linkage to other data in other tables • Names (e.g., Company Names) • Addresses • Free Text (e.g., annotations, comments, memos, email) • Binary Data (e.g., audio, images)

  40. Multi-Dimensional Measure of Data Quality • A well-accepted multidimensional view: • Accuracy • Completeness • Consistency • Timeliness • Believability • Value added • Interpretability • Accessibility • Broad categories: • intrinsic, contextual, representational, and accessibility.

  41. Major Tasks in Data Preprocessing • Data cleaning • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration • Integration of multiple databases, data cubes, or files • Data transformation • Normalization and aggregation • Data reduction • Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization • Part of data reduction but with particular importance, especially for numerical data

  42. Forms of data preprocessing

  43. What the Data Should Look Like • All data mining algorithms want their input in tabular form – rows & columns as in a spreadsheet or database table i.e. Give a sample file of SPSS

  44. What the Data Should Look Like • Customer Signature • Continuous “snapshot” of customer behavior Each row represents the customer and whatever might be useful for data mining

  45. What the Data Should Look Like • The columns • Contain data that describe aspects of the customer (e.g., sales $ and quantity for each of product A, B, C) • Contain the results of calculations referred to as derived variables (e.g., total sales $)

  46. What the Data Should Look Like • Columns with One Value - Often not very useful • Columns with Almost Only One Value • Columns with Unique Values • Columns Correlated with Target Variable (synonyms with the target variable) 1. 2. 3.

  47. Data Cleaning • Data cleaning tasks • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data

  48. Missing Data • Data is not always available • E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to • equipment malfunction • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • certain data may not be considered important at the time of entry • not register history or changes of the data • Missing data may need to be inferred.

  49. How to Handle Missing Data? • Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. • Fill in the missing value manually: tedious + infeasible? • Use a global constant to fill in the missing value: e.g., “unknown”, a new class?! • Use the attribute mean to fill in the missing value • Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter • Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree

  50. Noisy Data • Noise: random error or variance in a measured variable • Incorrect attribute values may due to • faulty data collection instruments • data entry problems • data transmission problems • technology limitation • inconsistency in naming convention • Other data problems which requires data cleaning • duplicate records • incomplete data • inconsistent data

More Related