180 likes | 419 Views
DATA MINING. Presented by Harendra Kumar Gupta. What Data Mining is ?. Data Mining is the process of applying the techniques like- neural network, clustering, genetic algorithm, decision trees, support vector machine to data
E N D
DATA MINING Presented by Harendra Kumar Gupta
What Data Mining is ?....... • Data Mining is the process of applying the techniques like- neural network, clustering, genetic algorithm, decision trees, support vector machine to data - and how we can extract patterns from the data ? • Data mining is the concept of extraction of hidden predictive information from large datasets.
Origin of Data Mining • Draws ideas from machine learning ,pattern recognition, statistics, and data base systems • Traditional techniques may be unsuitable due to -Enormity of data -High dimensionality of data -Heterogeneous ,distributed nature
Simple Examples of Data Mining • How many cars a person owns basically you looking at data like - …..How far is his office from his home ? -……..What is his income? -….How many children does he have? etc.. • A company uses data mining to find out what adverting schemes are effective for their competitors.
Pre-processing 1)-Before using the data mining algorithms, target data set is assembled be used. -----As data mining can only uncover patterns already present in the data, the target dataset must be large enough to contain these patterns while remaining concise enough to be mined in an acceptable timeframe. 2)-The target set is then cleaned. Cleaning removes the observations with noise and missing data. 3)-The clean data are reduced into feature vector, one vector per observation.( A feature vector is a summarized version of the raw data observation)
Contd… For example- a black and white image of a face which is 100px by 100px would contain 10,000 bits of raw data. This might be turned into a feature vector by locating the eyes and mouth in the image. Doing so would reduce the data for each vector from 10,000 bits to three codes for the locations, dramatically reducing the size of the dataset to be mined, and hence reducing the processing effort. The features selected will depend on what the objective is; obviously, selecting the "right" features is fundamental to successful data mining.
Division of feature vectors • The feature vectors are divided into two sets, 1)- the "training set" and The training set is used to "train" the data mining algorithms 2)- the "testing set". the testing set is used to verify the accuracy of any patterns found.
Process of Data Mining • Data mining commonly involves five classes of tasks 1)-Clustering 2)-Classification 3)-Regression 4)-Association rule learning 5)-Deviation detection
Clustering Clustering is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. Ex: -To find areas of ocean that have similar or likely similar significant impact on the earth’s climate
Classification • Classification is the task of generalizing known structure to apply to new data. • Common algorithms include decision tree learning, naïve Bayesian classification, neural networks and support vector machine and nearest neighbor . Ex: An email program might attempt to classify an email as legitimate or spam.
Regression • Regression attempts to find a function which models the data with the least error. Ex: -….Predicting wind velocity as a function of temperature, humidity, air pressure etc. -….Predicting sales amount of new product based on advertising expenditure.
Association Rule Learning • It searches for relationships between variables. Ex: a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis
Deviation Detection • It is used for detecting significant deviation from normal behavior . Ex: -In credit card detection -In network intrusion detection
Result Validation • The final step is to verify the patterns produced by the data mining algorithms. • Not all patterns found by the data mining algorithms are necessarily valid. • It is common for the data mining algorithms to find patterns in the training set which are not present in the general data set, this is called over fitting. • To overcome this, the evaluation uses a testing set of data which the data mining algorithm was not trained on.
Contd….. • The learnt patterns are applied to this test set and the resulting output is compared to the desired output. Ex: a data mining algorithm trying to distinguish spam from legitimate emails would be trained on a testing set of sample emails. Once trained, the learnt patterns would be applied to the testing set of emails which it had not been trained on, the accuracy of these patterns can then be measured from how many emails they correctly classify
Contd…. If the learnt patterns do not meet the desired standards, then it is necessary to reevaluate and change the preprocessing and data mining. If the learnt patterns do meet the desired standards then the final step is to interpret the learnt patterns and turn them into knowledge.
Summary 1). How we can avoid choosing the least valuable variable? 2)- Data being analyzed may not be the representative of whole domain and so may not contain the examples certain critical relationship and behavior that exists across other part of domain. How to overcome this problem? 3)-What should be the optimum division of feature vector.? 4)-What will be the affect of cleaning target data on pattern extracted? 5)-What are the advantages of data mining over traditional approaches?