340 likes | 357 Views
CSE 8392 SPRING 1999 DATA MINING: PART I. Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275 (214) 768-3087 fax: (214) 768-3085 email: mhd@seas.smu.edu
E N D
CSE 8392 SPRING 1999DATA MINING: PART I Professor Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275 (214) 768-3087 fax: (214) 768-3085 email: mhd@seas.smu.edu www: http://www.seas.smu.edu/~mhd January 1999
CSE8392 SPRING 1999 OUTLINE • Course Objective: To examine Data Mining concepts. A database perspective (rather than AI or statistics) is taken. • I. Introduction and Related Topics • II. Core Topics • III. Advanced Topics • IV. Case Studies • V. Student Presentations • VI. Summary and Future Trends CSE 8392 Spring 1999
INTRODUCTION AND RELATED TOPICS • Section Objective: Provide an introduction of data mining concepts. Briefly examine related concepts and background topics. • Historical Perspective • Gleaning Knowledge from the Data • User Expectations increase as amount/sophistication of collected data increases. • Reality vs Extracted Data Physical View Database View Reality Data Information Need Query CSE 8392 Spring 1999
Related Topics (to be covered) • Knowledge Discovery • Information Retrieval • Fuzzy Sets • Data Warehousing and OLAP • Dimensional Modeling CSE 8392 Spring 1999
Data Mining Overview • What is Data Mining? • Definition: Fayyad, p. 9 • A.k.a. • Exploratory data analysis • Unsupervised pattern recognition • Data driven discovery • Deductive learning • Data Mining determines patterns in the data • Non-trivial • Valid • Novel • Potentially useful • Interesting • General and simple • Understandable CSE 8392 Spring 1999
DM Techniques (R[1]) • DM involves many different algorithms to accomplish different things. All have the following techniques in common. • Model(Must fit a model to the data.) • Function/Purpose • Representation • Preference Criteria (How to choose one model over another?) • Search Algorithm (How to search the data) • Example (Loan Data, fig 1.1 p6 in Fayyad): • Model: Classification, Linear Function • Preference: What best fits data? (Fig 1.2 or 1.4) • Search Algorithm: Linear search of database CSE 8392 Spring 1999
DM Model Functions (R[1]) • Classification - Map data into predefined groups • Regression - Map data to real valued predicate variable • Clustering - Map data into groups defined by data itself • Summarization - Map subsets of data into simple description • Dependency Modeling - Identify dependencies among data items • Link Analysis - Identify other relationships among data (association rules) • Sequence Analysis - Identify sequential patterns in data CSE 8392 Spring 1999
DM Historical Perspective • Late 70’s: Spreadsheet analysis • 80’s: Transactional databases support data storage and retrieval • Early 90’s: Growing interest in end user support (a.k.a. decision support) • Issue: transactional databases are not designed for decision support • Mid 90’s: Dedicated data warehouses for decision support and multidimensional analysis • Late 90’s: Proliferation; new concepts (data marts) • DM Tools: Neovista, Red Brick CSE 8392 Spring 1999
Data Mining Metrics • Berson, Tables 17-1,17-2,17-3, p 347 • Accuracy • Clarity • Dirty Data • Dimensionality • Raw Data (Preprocessing) • RDBMS embedding • Scalability • Speed • Validation CSE 8392 Spring 1999
DM Issues • Overfitting • Outliers • Closed World Assumption • Database schemas and database models • Algorithms for data mining • Interpretation and visualization of results • Size of databases • Multimedia data, Spatio-Temporal Data • Changing data • Integration • DM Applications • Basket market analysis Stock analysis and selection • Fraud detection and prevention • Crisis prediction and prevention CSE 8392 Spring 1999
KNOWLEDGE DISCOVERY IN DATABASES (KDD) • “Overall process of discovering useful knowledge from data.” (p28 in R[1]) • Defn: R[1] p 30 • Steps Fig 1, p29 R[1] (Fig 1.3 in Fayyad) • Data Mining is one step in KDD process • KDD objective not usually clear or exact. May require time with customer understanding needs. • Data usually has problems - needs cleaning • Incorrect/missing data • Extract from multiple sources and compare • Delete anomalous data and sources • Different data types/metrics CSE 8392 Spring 1999
FUZZY SETS and LOGIC • Set membership described by a real valued (0,1) membership function • Ex: Set of all tall people • Set membership function: f(x)=x is tall iff height(x)>6 ft. • Note that this is a simple classification problem. Just as the Loan example, the results are not exact. • Basis of many classification and clustering approaches • In a conventional DB how do you retrieve all tall people? • Three valued logic: True, False, Maybe • Multi-valued logic: More than 2 values CSE 8392 Spring 1999
Fuzzy Logic • Reasoning with uncertainty • Extends multivalued logic; allows user to communicate using imprecise concepts, i.e. • “good” and “bad” • “close to” and “far away” • Avoids brittleness of rule based reasoning by introducing probability of set membership • Allows for smoother transition between classification sets in the domain • Example • Berson figure 16.2, page 325 CSE 8392 Spring 1999
INFORMATION RETRIEVAL • Store and retrieve documents based on fuzzy queries • Predecessor of web based access • Ex: Store information about all articles in all IEEE Transactions journals and Retrieve all documents dealing with heaps. • Overview • Conventional IR Systems • Query Structures(Keywords) • Matching(Multivalued logic) • Measures • Text Analysis Techniques • IR Related Topics CSE 8392 Spring 1999
Conventional IR Systems • Library card catalogs • Documents (Library Science) • Formatted • Unformatted (Text) • Mixed • Document Surrogates • Identifiers • Titles, names, and dates • Abstracts, extracts, reviews • Summaries of Numerical Data • Image Descriptions CSE 8392 Spring 1999
IR Queries • Query Structures • Matching Criteria • Boolean Queries • Vector • Fuzzy • Natural Language • Logical combination of keywords • Weight associated with keywords • Similarity measures CSE 8392 Spring 1999
Similarity Measures • Document Vector: • Different Measures: • Salton and McGill, Introduction to Modern Information Retrieval, 1984, McGraw-Hill, pp201-204. • Similarity uses: • Document-Document • Query-Query • Document-Query CSE 8392 Spring 1999
IR Document/Query Matching • Matching Process • Relevance and Similarity Measures • Boolean based matching • Logical match • Vector based matching • Threshold match • Probabilistic Match n documents relevant • P(relevant) = N total documents • Fuzzy Matching • Proximity Matching • Weighting • Relative Importance of Items CSE 8392 Spring 1999
IR Matching • Scaling • Impact of Sample Size • Clustering • Centroids • Measures • Precision • Recall CSE 8392 Spring 1999
IR Indexing • Text Analysis • Indexing is the assignment of keywords or terms that represent document content • Originally a library science problem that has grown with the advent of web based searches • Indexing types • Automated vs. manual • Controlled vs. uncontrolled • Single term vs. terms in context • Deep vs. shallow CSE 8392 Spring 1999
IR Indexing • General Steps • 1. Assignment of terms or concepts capable of representing content • 2. Assignment to each term a weight or value • Indexing • Vector based • Start with excerpts, remove high frequency words • Stop list • Thesaurus • Compute discrimination values of terms CSE 8392 Spring 1999
IR Retrieval • Retrieval or Classification • Vector based • Same starting point as with indexing • Compute weighting factors • Assign to each document a weighted term vector • Similarity Measures • Measure similarity between document/query • Results normalized to range between 0 - 1 CSE 8392 Spring 1999
IR Retrieval • Inverse Document Frequency • Assumes importance is proportional to standard occurrence frequency, and inversely proportional to the total number of documents. • Also used for similarity measurement • Inverted Indexing of Document • Concept Hierarchy • DAG of concepts • Follow nodes from general to more specific • Tag articles with low level concepts so that each may be distinguished from ancestors CSE 8392 Spring 1999
IR Related Topics • Information Retrieval Related Topics • Text Analysis • Fuzzy Sets • Extending Databases • Hypertext • Digital Libraries • Data Mining • Web based browsers CSE 8392 Spring 1999
DATA WAREHOUSING AND OLAP • Preparations for Mining: Data Warehousing • Extracting the data (from RDBMS) • Storing the data • Data warehouse or data mart • Cleansing the data • Mining the data • Often with multidimensional queries • Definition • Blend of technologies • Integration • Enables Strategic Use of Data • Architecture • Figure 6.1, page 116 CSE 8392 Spring 1999
DW Migration • Migration from Relational Database to Data Warehouse • Differences (Relational vs. Data Warehouse) • Procedure for Migration • Extraction • Cleanup • Transformation • Migration • Issues • Multiple sources • Database Heterogeneity • Data Heterogeneity CSE 8392 Spring 1999
DW Design • Data Warehouse Design Considerations - Nine Step Method: • Subject Matter • Fact Table contents • Dimensioning • Fact Selection • Precalculations • Rounding out dimension table • Duration selection • What about change? • Query priorities • Technical Considerations • Hardware • Communications Infrastructure • Data Structures CSE 8392 Spring 1999
More on DW • Benefits • Development of strategic information and resources • Hypothesis testing • Knowledge discovery • Data Marts • Definition: a mini data warehouse for data mining • Directed at a partition of data • Dedicated user group • May be physically separate • Drivers • Urgent user requirements • Small budget • Absence of sponsor • Decentralization • Smaller project size CSE 8392 Spring 1999
DIMENSIONAL MODELING • Dimensional Modeling • Describes relationships in the data that will be mined • Relatively new concept, still developing • A technique for visualizing data models • Schema (Star and Snowflake) • Facts - A collection of related data items, consisting of measures and context data • Dimensions - A collection of members or units of the same type of view. Axis for modeling. Sets the context for the facts. • Measures - Numeric attribute of fact (What is stored about sales data) • Focus - Tends to be on numeric data • MD Analysis vs. DM - Figure 4, R[3] CSE 8392 Spring 1999
Data Cube • Way to visualize facts and dimensions • Hypercube (more than 3 dimensions) • May be nested • Figure 13.1, p249, Berson • Figure 15,R[3] CSE 8392 Spring 1999
Star Schema Time Dimension Customer Dimension Sales Facts Part No. Dimension Salesperson Dims Product Dimension • Contains large fact table and a surrounding set of dimension tables • A.k.a. constellation or multistar model • Figure 9.1, p171,Berson • Following from Figure 18, R[3] CSE 8392 Spring 1999
Snowflake Schema Week Dimension Month Dimension Time Dimension Customer Dimension Sales Facts Part No. Dimension Salesperson Dimension Product Dimension Location Dimension Manager Dimension • Sometimes dimensions have hierarchies among themselves • N:1 relationships among members of a dimension may be subdivided • Decomposition yields a snowflake like schema CSE 8392 Spring 1999
OLAP (On Line Analytic Processing) • Multidimensional database • Allows user to analyze data using elaborate, multidimensional, complex views • MOLAP - Multidimensional OLAP. Supported by specialized DBMS/software systems. (Data structures, temporal) • May not be general enough for other uses • Access limited and optimized for OLAP processing • Fig 13.3 p 253, Berson • ROLAP - Underlying data stored in traditional (relational) DBMS and accessed by traditional query language (SQL). • Layer on top of DBMS. Middleware. • May have poor performance for OLAP applications • Fig 13.4 p 254, Berson CSE 8392 Spring 1999
OLAP Operations • Move view of facts down/up dimensions • Drill Down • Roll Up • Figure 3, R[3] • Figure 16,R[3] • Look at data by partitioning the cube • Slice - Look at subcube to get more specific data • Dice - Rotate cube to look at another dimension • Figure 17,R[3] CSE 8392 Spring 1999