350 likes | 373 Views
Evaluation of MineSet 3.0. By Rajesh Rathinasabapathi S Peer Mohamed Raja. Guided By Dr. Li Yang. MineSet. Introduction Problem in existing analytical tools MineSet Client/Server Architecture MineSet Enterprise Manager. INTRODUCTION. Product of Silicon Graphics Inc. Supported by
E N D
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang
MineSet • Introduction • Problem in existing analytical tools • MineSet Client/Server Architecture • MineSet Enterprise Manager
INTRODUCTION • Product of Silicon Graphics Inc. • Supported by • Windows NT 4.0 (Server & Client) • Windows 95 & 98 (Client) • Memory varies with the size of data • 64MB RAM • 1024 768 Resolution with 65K colors • IRIX 6.4 and above (for server parellelization)
MineSet • Helps to pinpoint and understand the complex patterns, relationships, and anomalies that are implicitly present in your data.
Problems in Existing Tools • You must specify directly any relationships between data elements. • ExampleQuery for all the sales by region. • Presupposes you have an idea that sales vary by region. • Relationships may be uncovered that you did not know existed.
MineSet Client/Server Architecture • Client and Server can be on a same systemor on a different system • Server responsibilities • Accessing Data files • Data Transformations (Data Mover) • Running Mining operations (Classification, Association, etc…) • Generating visualization files
MineSet Client/Server Architecture • Client’s Responsibility Providing GUI • Integration with other systems • Support for ODBC complaint databasefor example SQL Server, DB2, Oracle ,Sybase • Open Architecture allows you to coexist with other tools. For example SAS • inegrate with web using hotlinks • Custom Algorithms
MineSet Enterprise Manager • MineSet Tool Manager • MineSet 3D Visualizer • MineSet Cluster Visualizer • MineSet Record Visualizer • MineSet Statistics Visualizer
MineSet Tool Manager • Data Access and Data Transformation. • Data Destinations.
Basic Transformations • Adding New Columns • Removing existing Columns • Aggregation • Filtering • Sampling • Binning • Apply Classifier
Adding New Columns Addition of new columns is possible to the existing dataset. Columns added can be derived from existing column by using expressions.
Removing Existing Columns Removing columns that are not persistent, are redundant, or contain obvious, uninteresting predictors.
Aggregation Grouping records together and finding the sum, maximum, minimum, or average.
Filtering Visualization To view strongest rules or the most profitable customer segments
Sampling Sampling the data to get a random subset of the data
Binning Breaking up of continuous range of data into discrete segments
Data Mining Tools Association. Classification. Cluster. Regression. Column Importance.
Column Importance Column importance helps one to discover which are the most important columns in predicting different values for a label column one chooses. This unlike clustering lets one to decide which label one will use to determine the importance of columns.
Column Importance Options when finding column importance • One can specify Num of columns to find. • Either to use weights or not. • Specify the weight. • No of additional importance columns. • Specify purity of the columns present on right or left.
Association Rules Options • Height –Bars. • Height – Disks. • Color – Bars. • Color – Disks. • Label – Bars. Confidence (1-100) Support (1-100) Use weights or not. Unlimited items per rule / the no of items per rule.
Association Rules Interpreting association rules in Scatter Visualizer: • The LHS represents items in this axis. • The RHS represents items in this axis. • Bar height corresponds to support. • Bar colors represent lift. • By pointing on the object on the bar one can get the specifications of the bar.
Clustering Single K-means Default method Iterative K-means
Clustering Single K-means Default method In single K means clustering one specifies the number of clusters Iterative K-means In iterative K – means one specifies the minimum, maximum no of clusters
Clustering Options present in creating clusters are : The distance measure (Euclidean / Manhattan). The number of iterations. The Random seeds. The Random seeds.
Clustering · The orders in which attributes are displayed represent the importance of the attributes. · The population shows the default settings. · Every column represents the different cluster. · On clicking each column at the top its attribute importance is shown. · Each box represents the max, min, median and deviation of the values in them.
Classifier Classification is the task of assigning a discrete label value to an unlabeled record Different modes : Classifier and Error Classifier Only Estimate Error Learning Curve
Classifier Classifier Mode Classifier mode uses all the available data to build the classifier. It is useful when you are not concerned with error estimation. Classifier and Error It uses the Holdout Error Estimation. Instead of using all the data to build the model, you can hold out the part of the data as a training set to induce the classifier. The classifier and error mode automatically partitions the data set into independent training and test subsets. Holdout ratio/ Random seed.
Classifier Error Estimate It uses the Cross Validation Error Estimation. Cross-validation is used for building the final classifier or for small datasets. Cross-validation is a method for getting a more precise estimate of error. Learning Curve The Learning Curve shows the error of the classifier generated by an inducer in proportion to the number of records used to create the classifier.
Classifier The classification process can be induced by the following methods Decision Tree Option Tree Evidence Decision Table