Data Mining: Current Status and Directions

Data Mining: Current Status and Directions

What is Data Mining? • Data mining (also called knowledge discovery in databases) • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful) information (knowledge) or patterns from data in large databases or other information repositories • The goal is to understand and use data, to make data itself something of value and strategic importance

Data is everywhere! • Relational databases—A commodity of every enterprise • POS (Point of Sales): Transactional DBs are often terabytes in size • Legacy databases • Spatial databases (GIS), remote sensing database (EOS), and scientific/engineering databases • Time-series data (e.g., stock trading) and temporal data • Text (documents, emails) and multimedia databases • WWW: A huge, hyper-linked, dynamic, global information system

The potential for Data Mining Is Everywhere, too! • Knowledge to be mined • Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc. • Techniques utilized • Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural networks, etc. • Applications adapted • Retail, telecommunication, banking, fraud analysis, DNA mining, stock market analysis, Web mining, Weblog analysis, etc.

Data Mining: A Confluence of Multiple Disciplines Database Technology Statistics Data Mining Machine Learning (AI) Visualization Information Science Other Disciplines

Multi-Dimensional Data Analysis • Data warehousing: integration from heterogeneous or semi-structured databases • Multi-dimensional modeling of data: star & snowflake schemas (in Relational DBMS) • Efficient and scalable computation of data cubes or iceberg cubes (in MDDB) • OLAP (on-line analytical processing): drilling, dicing, slicing, etc. • Discovery-driven (data driven) exploration of data cubes

Data Cubes

Data cube dimensional hierarchy

Start with standard normalized relational database tables. Creating Multi-dimensional data warehouses

Data warehouse ‘STAR’ Schema In order to reduce the number of joins that must be performed, data is reformatted into ‘fact’ tables. Fact tables typically consist of many foreign keys

Data Warehouse ‘Snowflake’ Schema Very similar to the snowflake schema, can you tell what this schema lets us see that the snowflake did not?

Making optimal use of storage space • Many cuboids can be materialized by analyzing another cuboid as opposed to the entire data set Example: Consider analyzing sales based on the dimensions of Route, Source, and Time. The number of rows in each view is given in Millions. Route, Source, Time 6 M Route, Time 6 M Route, Source .8 M Source, Time 6 M Time .1 M Route .2 M Source .01 M Materialization of all views would require roughly 19.1 Million rows None

Dependent Cuboids Selective materialization in this case can reduce the number of stored rows by 12 Million Assume that ‘Part’ can be further partitioned into ‘size’ and ‘color’, ‘Customer’ can be partitioned into ‘Individual’, ‘State’, and ‘Country’ Part, Supplier, Customer 6 M Part, Supplier Supplier, Customer Part, Customer .8 M 6 M 6 M Part(color), Customer (State) Part(size), Customer (State) Part (color), Customer (Country) Part (Size), Customer (Country) Part (Color), Customer (Individual) Part (Size), Customer (Individual) Part Customer

Association and Frequent Pattern Analysis • Objective is to find patterns in the tendency of items to be found together. • A typical 2-item association rule output will generally look something like this: • ComputerSoftware (7%, 72%) • This is telling you that 7% (a.k.a. confidence level) of your sales transactions involved computers AND software, and that 72% (a.k.a. support level) of all computer sales involved the sale of software.

Association and Frequent Pattern Analysis • Associations can also be found among 3, 4, or more item sets, for example: • (Computers, Software) Mouse Pad (8%, 65%) This tells you that 8% of transactions involved computers, software, and mouse pads. And that 65% of transactions involving computers and software also involved the purchase of a mouse pad

Association and Frequent Pattern Analysis • The problem with unguided associative analysis is that the number of associations can be enormous. • Consider a store like L.L. Bean trying to identify meaningful associations. The output could number in the millions. • In order to “filter” the output, users will frequently set parameters for confidence and support thresholds.

Visualization of association rules in MineSet 3.0

Clustering and Outlier Analysis • Attribute of interest is plotted on a graph whose axes represent the dimensions of interest. Cluster analysis is frequently two dimensional, but does not have to be. • The objective of the data mining algorithm is to find the centers of clusters that maximizes the distance between cluster centers while minimizing the distance between points in a cluster and the center of the cluster. • The center of the cluster typically defines the cluster (e.g. males between 30 and 35 years old with incomes between 50K and 75K) and axes are usually parametric rather than continuous

Clustering Analysis • Can include user-specified constraints (e.g. no cluster has less than 1000 customers)

Sequential Patterns and Time-Series Analysis • Trend analysis • Trend movement vs. cyclic variations, seasonal variations and random fluctuations • Similarity search in time-series database • Handling gaps, scaling, etc. • Indexing methods and query languages for time-series • Sequential pattern mining • Various kinds of sequences, various methods • Periodicity analysis • Full periodicity, partial periodicity, cyclic association rules

Data Mining Industry and Applications • Industry has grown rapidly over the past few years • From research prototypes to data mining products, languages, and standards • IBM Intelligent Miner, SAS Enterprise Miner, SGI MineSet, Clementine, MS/SQLServer 2000, etc. • A few data mining languages and standards (esp. MS OLEDB for Data Mining). • Application achievements in many domains • Market analysis, trend analysis, fraud detection, outlier analysis, Web mining, etc.

The data mining industry • Data mining is growing rapidly • R & D has seen huge increases • Applications have been broadened substantially • But not as rapidly as some may have hoped. Why not? • Value is easy to objectively measure • It is difficult to sell on hype alone, although they try! • Not on-the-shelf in nature • Need training, understanding, and customization • Definite learning curve associated with effective use • Benefit of effective use not seen immediately

Trends in data mining • Web mining (and incorporating data from outside the organization into the analysis of internal data) • Towards integrated data mining environments and tools • “Vertical” (or application-specific) data mining • Invisible data mining • Towards intelligent, efficient, and scalable data mining methods

Web Mining: A Rapidly Expanding area in Data Mining • Mine what the Web search engine finds • Automatic classification of Web documents • Discovery of authoritative Web pages, Web structures and Web communities • Meta-Web Warehousing: Web yellow page service • Web usage mining

Mining the results of Web Search Engine Finds • Current Web search engines: • keyword-based, return too many, often low quality answers, still missing a lot, not customized, etc. • Data mining will help: • coverage: “Enlarge and then shrink,” using synonyms and conceptual hierarchies • better search primitives: user preferences/hints • linkage analysis: authoritative pages and clusters • customization: home page + Weblog + user profiles • Identification of “hub” pages

A Layered Meta-Web Architecture More Generalized Descriptions Layern ... Layer1 Generalized Descriptions Layer0

Importance of Constructing Multi-Layer Meta Web • Benefits of Multi-Layer Meta-Web: • Multi-dimensional Web info summary analysis • Approximate and intelligent query answering • Web high-level query answering (WebSQL, WebML) • Web content and structure mining • Observing the dynamics/evolution of the Web • Is it realistic to construct such a meta-Web? • It benefits even if it is partially constructed • The benefit may justify the cost of tool development, standardization, and partial restructuring

Web Usage (Click-Stream) Mining • Web-log provides rich information about Web dynamics • Multidimensional Web-log analysis: • disclose potential customers, users, markets, etc. • Plan mining (mining general Web accessing regularities): • Web linkage adjustment, performance improvements • Trend analysis: • Dynamics of the Web: what has been changing? • Customized to individual users

Intelligent Tools for Data Mining • Integration of users and mining algorithms paves the way to intelligent mining • Smart interface brings intelligence • Easy to use, understand and manipulate • One picture may be worth 1,000 words • Visual and audio data mining • Towards self-tuning, self-managing, self-triggering data mining

Data Mining: Current Status and Directions

Data Mining: Current Status and Directions

Presentation Transcript

Current Status and Future Directions in Substance Abuse Treatment for Women

Data Mining: Current Status and Research Directions

Data Mining I

Data Mining

THA – October 21, 2009 Data-Mining Panel Discussion Talking Points

Data Mining

Introduction to Data Mining

Data Mining

Data Mining

What is Data Mining ?

Course Overview

Air Toxics: Current Status, New Directions

IPWG Validation current status and future directions

WTF-CEOP Status and Plan “Data Mining” and “Data Integration” September 17, 2003

Data Mining: Introduction

Data Mining: What? WHY? HOW?

Introduction to Data Mining

Data Mining: Data

Web Mining

Data Mining: Applications