280 likes | 289 Views
Discover hidden patterns from large datasets for credit risk assessment, healthcare, and more with data mining techniques. Learn about data mining's role, importance, and challenges in extracting valuable insights. Explore the process of knowledge discovery in databases (KDD).
E N D
DATA MINING Data Mining deals with the discovery of hidden Knowledge , unexpected pattern and new rules from large data sets Data Mining -By Dr. S. C. Shirwaikar
Examples of Information extracted using querylanguage • List customers who use credit card to purchase more than Rs 1000 worth groceries • List patients who had atleast one heart attack • Examples of what data mining is used for • Develop a general profile of credit card customers • Determine patients whose lifestyle is prone to getting a heart attack in near future • Differentiate poor credit risk customers from good credit card customers Data Mining -By Dr. S. C. Shirwaikar
Data Mining differs from usual query processing in many ways • Query cannot be well-formed or precisely stated as what you are looking for is usually hidden • Data in operational data bases may not be sufficient. Data from various sources need to be integrated processed before quality mining can be done • Output is not just a subset of data but is analysed and presented as a pattern Data Mining -By Dr. S. C. Shirwaikar
Data explosion problem: • The Explosive Growth of Data: from terabytes to petabytes • Progress in Hardware technology leading to Automated data collection tools, storage media, affordable computers • Progress in database technology, relational technology leading to powerful database systems • Tremendous amounts of data stored in databases, data warehouses and other information repositories • Quantity of data in the world roughly doubles every year • Distribution and sharing of data is possible Data Mining -By Dr. S. C. Shirwaikar
Due to internet hundreds of megabytes of data are distributed around the world • Heterogeneous data sources can be shared using Open DataBase Connectivity tools • Data exchange ,integration through XML technology • Major sources of abundant data • Business: Web, e-commerce, transactions, stocks, … • Science: Remote sensing, bioinformatics, scientific simulation, … • Society and everyone: news, digital cameras, • More data means less information • We are drowning in data, but starving for knowledge! Data Mining -By Dr. S. C. Shirwaikar
Computers against computers Automated data collection tools and mechanical production and reproduction of data force us to develop mechanical methods for filtering selecting and interpreting data Data Mining -By Dr. S. C. Shirwaikar
Data mining (knowledge discovery from data) • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data • Data mining: a misnomer? • Alternative names • Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Data Mining -By Dr. S. C. Shirwaikar
Knowledge discovery in databases (KDD)-is a multistep process of finding useful information and patterns in data while Data Mining is one of the steps in KDD of using algorithms for extraction of patterns Steps Of KDD 1. Selection- Data Extraction -Obtaining Data from heterogeneous data sources -Databases, Data warehouses, World wide web or other information repositories 2. Preprocessing- Data Cleaning- Incomplete , noisy, inconsistent data to be cleaned- Missing data may be ignored or predicted, erroneous data may be deleted or corrected Data Mining -By Dr. S. C. Shirwaikar
3. Transformation- Data Integration- Combines data from multiple sources into a coherent store -Data can be encoded in common formats, normalized, reduced 4. Data mining – Apply algorithms to transformed data an extract patterns 5. Pattern Interpretation/evaluation Pattern Evaluation-Evaluate the interestingness of resulting patterns or apply interestingness measures to filter out discovered patterns Knowledge presentation-present the mined knowledge- visualization techniques can be used Data Mining -By Dr. S. C. Shirwaikar
Visualization Techniques Data Mining -By Dr. S. C. Shirwaikar
Knowledge discovery process KDD is the nontrivial extraction of implicit previously unknown and potentially useful knowledge from data Knowledge Pattern Evaluation Data Mining Data Transformation Data Warehouses Data Preprocessing Data Integration Data Cleaning Selection Operational Databases Data Mining -By Dr. S. C. Shirwaikar
Data Mining is the process of discovering interesting Knowledge from large amounts of data stored in data bases, data warehouses or other information repositories • The architecture of a typical data mining system may have the following major components • Database, Data warehouse, World wide web or other information repository-Data cleaning and data integration techniques may be performed on the data • Database or Data Warehouse Server-It is responsible for fetching the relevant data based on the user’s data mining request. Data Mining -By Dr. S. C. Shirwaikar
Graphical User Interface Pattern Evaluation Knowledge-Base Data Mining Engine Database or Data Warehouse Server data cleaning, integration, and selection Other Info Repositories Data Warehouse World-Wide Web Database Data Mining -By Dr. S. C. Shirwaikar
Data mining Engine-It consists of a set of functional modules for task such as characterization, association and correlation analysis classification, prediction cluster analysis, outlier analysis etc • Knowledge base – It is the domain knowledge used to guide the search or evaluate the interestingness of resulting patterns • Pattern evolution module- It applies interestingness measures to filter out discovered patterns • Graphical User Interface- user can specify a data mining query Data Mining -By Dr. S. C. Shirwaikar
Why Data Mining?—Potential Applications • Data analysis and decision support • Market analysis and management • Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation • Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis • Fraud detection and detection of unusual patterns (outliers) • Other Applications • Text mining (news group, email, documents) and Web mining • Stream data mining • Bioinformatics and bio-data analysis Data Mining -By Dr. S. C. Shirwaikar
Data Mining algorithms-All algorithms attempt to fit a model closest to the data being examined. Model is based on the analysis of attributes of a training data set The Model is than evaluated using a test data set Data Model can be Descriptive-characterize, explore properties of current data Predictive-perform inference on current data to make predictions on future data Data Mining -By Dr. S. C. Shirwaikar
Data Mining Descriptive Predictive Clustering Classification Sequence Discovery Prediction Summarization Regression Association rules Time series Analysis Data Mining -By Dr. S. C. Shirwaikar
Classification- maps data into predefined groups or classes It uses supervised learning . The algorithm uses learning phase to build a classifier using training data set containing data attributes and associated class labels Regression-maps data into real-valued prediction variable- Algorithm tries to find best function (linear, Non-linear that fits the training data) Time Series Analysis- the value of an attribute is examined as it varies over time It can be used to determine similarities, classify the behavior or predict future values Prediction – predicts future values using regression, time series analysis or other approaches Data Mining -By Dr. S. C. Shirwaikar
Clustering -Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes Interpretability and usability-results should be comprehensible and usable-domain expert is required Summarization - maps data into subsets with simple descriptions- It extracts or derives representative summary type of information Association rules–discovers relationship among data – used in Market basket analysis to find item frequently purchased togather Sequence Discovery- discovers sequential patterns in data-oder in which items are purchased or data is accessed Data Mining -By Dr. S. C. Shirwaikar
Database Technology Statistics Data Mining Visualization Machine Learning Pattern Recognition Other Disciplines Algorithm Influence from many disciplines Data Mining -By Dr. S. C. Shirwaikar
Depending on data mining approach, techniques from other disciplines may be applied such as • Information Retrieval • Artificial Intelligence • Neural networks • Fuzzy set theory • Knowledge representation • Logic programming • High performance computing Data Mining -By Dr. S. C. Shirwaikar
Data Mining issues Human interaction- interfaces required with both domain and technical experts- variety of databases, variety of users leading to numerous data mining techniques – What is required is not known hence extraction process need to be interactive. Interpretation of results- requirements of experts- interpretability problems- Background knowledge or domain expertise is essential to guide the discovery process visualization of results- visualization helps- multi-dimensional data is problematic – The discovered knowledge should expressed in the form of trees , tables, graphs, charts curves etc. Data Mining -By Dr. S. C. Shirwaikar
Data Mining issues continued Large datasets- scalability is a problem- algorithms do not scale well with massive real-world datasets- sampling and parallelization are effective tools High dimensionality -Conventional database may contain many different attributes, all are not relevant-increases complexity and reduces efficiency –dimensionality curse-data reduction-dimensionality reduction Multimedia data - found in GIS databases proves conventional data mining algorithms ineffective Missing data -It is not always possible to ignore missing data but in preprocessing data mining algorithms can be used to replace missing data with estimates Data Mining -By Dr. S. C. Shirwaikar
Data Mining issues continued Irrelevant data – Data reduced by removing irrelevant data Noisy dataand outliers –Invalid , incorrect data will lead to poor quality data mining- Outliers are very much different and do not fit nicely into the derived model Changing data- Data warehouses contain non-volatile data-Dynamic data is uploaded and then algorithms are reapplied Integration- KDD requests are one time needs-data mining functions are now integrated into traditional database systems Applications – Effective use of output of mining algorithm is a challenge rather than the complexity of the mining algorithm Data Mining -By Dr. S. C. Shirwaikar
Data Mining Metrics How to measure the effectiveness of data mining process? -KDD process is expensive- Return on investment will be the saving due to decision process using the results -Difficult to measure and quantify -Measured as increase in sales, reduction in advertising cost Social Implications of Data mining Two sides of the coin Data mining can be used to improve customer service and satisfaction Data mining can be used to confront one’s right to privacy Omnipresent Invisible Data mining affecting everyone-profiling is used to label typical characteristics Data Mining -By Dr. S. C. Shirwaikar
Data mining should follow certain Guidelines • Organization for Economic Co-operation and Development(OECD) established a set of international guidelines referred as fair information practices • Purpose specification and use limitation-usage of collected should not exceed stated purpose • Openness-right to know the nature of data collected about them • Security safeguards- protected from loss, unauthorized access, destruction, use, modification or disclosure of data Data Mining -By Dr. S. C. Shirwaikar
Data mining should follow certain Guidelines • Individual participation – Individual has the right to have the data erased, completed or corrected • Privacy Preserving data mining • -secure Multiparty computation- data values are encoded so that no party can learn another’s data values. • -data obscuration- actual data is distorted by aggregation or by adding random noise-reconstruction algorithm is essential for getting the distribution of original data. Data Mining -By Dr. S. C. Shirwaikar
BOOKS Data Mining, Introduction and Advanced Topics by Margaret H. Dunham and Sridhar Pearson Education ISBN 81-7758-785-4 Data Mining Techniques by Arun K Pujari Universities Press (India) Limited ISBN 81-7371-380-4 Data mining, Pieter Adriaans& Dolf zantinge: (pearson Education Asia), ISBN 81-7808-425-2. Addison Wesley Longman (Singapore) Data Mining Techniques for Marketing, Sales and Customer Relationship Management by Michael J. A. Berry and Gordon S. Linoff Wiley-dreamtech India Pvt. Ltd. ISBN 81-265-0517-6 Data Mining Concepts and Techniques by Jiawei Han and Micheline Kamber Morgan Kaufmann Publishers ISBN 81-312-0535-5 . Data Mining -By Dr. S. C. Shirwaikar