Mining data with PolyAnalyst

Mining data withPolyAnalyst Your Knowledge PartnerTM www.megaputer.com

Outline • Data Mining in BI chain • PolyAnalyst overview • Learning algorithms • Additional features • Future developments

Data Mining in BI chain Your Knowledge PartnerTM

Data Knowledge Decision Action DM in Decision Making Consider a fragment of the BI chain: • Data - is what we can capture and store • Knowledge- is what provides for informed decisions • Problem:How to get from Data to Knowledge? • Solution:Data Mining (Machine Learning)

Data Mining • "Data Mining is the process of identifying valid, novel, potentially useful, and ultimately comprehensible knowledge from databases that is used to make crucial business decisions." -- G. Piatetsky-Shapiro, KDNuggets editor www.kdnuggets.com • Valid • Novel • Actionable • Comprehensible

Data Mining vs. OLAP • OLAP- Helps prove or reject your hypothesesby dissecting data along different dimensions- But you have to guess the answer first ! • Data Mining- Automatically develops and tests numerous hypotheses by learning from historical data- Analyzes raw data

Business Intelligence Chain • Consider direct marketing automation • Analyze data • Integrate applications X

Data Mining Tasks • Predicting • Classifying • Clustering • Segmenting • Explaining • Associating • Visualizing • Link Analysis • Text Mining

Fields of application

What makes DM hard? • Unfamiliar concept and lack of experience • Results require interpretation by an analyst • Poor integration in existing applications • Difficulty processing very large databases • Necessity to learn a new application • High cost

Megaputer response • Challenge:Unfamiliar concept and lack of experienceResponse:Collaborative Appliance Program – combines Megaputer analysts expertise in data mining and customer knowledge of the business project • Challenge:Results require interpretation by an analystResponse: Simple reporting and batch processing capabilities • Challenge:Poor integration in existing applicationsResponse: Easy scoring of external data with a few mouse clicks • Challenge:Difficulty processing very large databasesResponse: In-Place Data Mining • Challenge:Necessity to learn a new applicationResponse: An SDK of easy-to-integrate PolyAnalyst COM components • Challenge:High costResponse: Flexible licensing mechanism

PolyAnalyst overview Your Knowledge PartnerTM

What is PolyAnalyst? • Multi-strategy data mining suite • The largest selection of ML algorithms for diverse business tasks • Structured data and text processing tools • Ease-of-use: • friendly data manipulation and visualization • Deep integration • Applying models to external DB through the OLE DB protocol • Exporting models to XML • COM components • Best Price/Performance ratio

Key differentiators of PolyAnalyst • Integrated analysis of structured (numeric and categorical) and unstructured (text) data • Easy to learn and operate visual analytical interface • The largest selection of powerful machine learning algorithms • Mouse-driven application of predictive models to data in any external system through a standard OLEDB link • Simple integration with external applications: SDK of COM components • In-Place Data Mining capabilities for processing huge databases • Step-by-step tutorials based on real-world case studies • Rich data manipulation and visualization tools • Reusable analytical scripts for batch process data mining • The best Price/Performance ratio

PolyAnalyst Customer base:300+ installations Sample customers

Control buttons Project navigation tree Data and Results pane Exploration engine report fragment Objects and Collections represented by icons PolyAnalyst log journal PolyAnalyst workplace

PolyAnalyst provides • Access to data held in a database or data warehouse • Numerical • Categorical • Yes/no • Date • Data manipulation and visualization • 14 machine learning algorithms • Convenient results reporting and outputing • Integration with external applications

PolyAnalyst machine learning algorithms Your Knowledge PartnerTM

“Probably one the most impressive characteristic of PolyAnalyst is the sheer number of data mining tasks it can tackle.” Mario Apicella Technology Analyst InfoWorld Test Center July 3, 2000

Learning algorithms • Find Laws(SKAT algorithm) • Cluster(Localization of anomalies) • Find Dependencies(n-dimensional distributions) • Classify(Fuzzy logic modeling) • Decision Tree(Information Gain criterion) • PolyNet Predictor(GMDH-Neural Net hybrid) • Market Basket Analysis(Association rules) • Memory Based Reasoning(k-NN + GA) • Linear Regression(Stepwise and rule-enriched) • Discriminate (Unsupervised classification) • Summary Statistics(Data summarization) • Link Analysis (Visual correlation analysis) • Text Mining(Semantic text analysis)

Cluster (FC) • Identifies clusters of similar records • Selects best variables for clustering • Suggests the number of clusters • Separates clusters of records in new data sets for further investigation - preprocessing for other algorithms

Cluster (continued) Groups of similar records

Cluster (continued) • Based on analyzing distributions in hypercubes of all variables rather than on measuring distances between points • Hence, independent of rescaling of axes variable • Finds only clusters actually present in data, on the background of uniformly distributed cases

Classify (CL) • Fuzzy-logic based classification • The function of belonging modeled by either Find Laws, PolyNet Predictor, or LR • Provides record scoring with Lift and Gain charts used for visualization • Assigns records to one of two classes and furnishes utilized classification rule

PolyAnalyst Lift chartillustrates an increase in the response to a campaign based on the discovered model - instead of random mailing Targeted mailing % of maximal possible response Mass mailing PolyAnalystGain charthelps optimize the profit obtained in a direct marketing campaign Targeted mailing Profit ($) Mass mailing Classify (continued)

Decision Tree (DT) • Intuitively classifies cases to selected categories • Based on Information Gain splitting criteria • The fastest algorithm in PolyAnalyst • Scales linearly with increasing number of records

Decision Tree (continued) Node characteristics Classification tree

Decision Forest (DF) • The most efficient classification algorithm for tasks with multiple target categories • Transforms the task of categorizing data records to N classes into the problem of solving N tasks of categorizing records to two classes • Develops the best collection of N classification trees, with leaves containing probabilities of classifying records in the corresponding classes • Scales linearly with increasing number of records

Link Analysis (LK) • Reveals pairs of correlated objects • Used in Fraud Detection, Text Analysis and other correlation analysis tasks

Text Analysis (TA) • Extracts key concepts from natural language notes • Tags individual records with the main encountered concepts • Recognizes synonyms and othe semantic relations • Can perform user-focused or unsupervised analysis • Integrates the analysis of text with the power of other machine learning algorithms of PolyAnalyst • Facilitates categorization of textual documents

Text Analysis (continued)

Basket Analysis (BA) • Is used in Retailing, Fraud Detection and Medicine • Identifies in transactional data groups of products sold together well • Finds directed association rules for each of these groups • Groups baskets containing similar sets of products • Characterized by • Support • Confidence • Improvement • Based on new mathematics: • works 10 to 50 times faster than traditional algorithms

Basket Analysis (continued) Groups of products sold together well Directed Association Rules

Basket Analysis (continued) • Works with both transactional and flat data format • Easily finds many-to-one rules “I would like to continue working together with Megaputer on other CTP customers’ projects (mainly Swedish and Danish Banks ).” -- Olof Goransson Senior Data Consultant CTP Skandinavien AB

Find Laws (FL) • Models relationships hidden in data • Presents discovered knowledge explicitly • Searches the space of all possible hypotheses “The unique Find Laws algorithm along with an easy to use interface made PolyAnalyst the only choice for our environment.” -- James Farkas, Senior Navigation Engineer, The Boeing Company

Find Laws (continued) • FL is based on the Megaputer’s unique Symbolic Knowledge Acquisition Technology(SKAT) • A good introduction to SKAT:PCAI magazine, January 99, p. 48-52

Find Dependencies (FD) • Determines most influential variables • Detects multi-dimensional dependencies • Predicts target variable in a table format • Used as preprocessing for FL

Find Dependencies (continued) Predicted Sales per Employee

PolyNet Predictor (PN) • Predicts values of continuous attributes • Hybrid GMDH-Neural Network method • Works well with large amounts of data • The best architecture network is built automatically

Memory Based Reasoning(MB) • Performs classification to multiple categories • Based on identifying similar cases in the previous history • Uses Genetic Algorithms to find the most suitable metric for the problem

Discriminate (DS) • Determines what features of a selected data set distinguish it from the rest of the data • Requires no target variable • Can be powered by • Find Laws • PolyNet Predictor • Linear Regression

Linear Regression (LR) • Incorporates categorical and yes/no variables in the analysis correctly • Stepwise Linear Regression: only influential variables included • Can be used as a preprocessing and benchmarking module

PolyAnalyst features in more detail Your Knowledge PartnerTM

Data Analysis Project Workflow • Access data • Understand, clean and transform data • Run machine learning analysis • Visualize, report and share results • Integrate results in existing business process

Data Access • ODBC-compliant databases: Oracle, DB2, Informix, Sybase, MS SQL Server, etc. • Dedicated access • IBM Visual Warehouse • Oracle Express • OLE DB (can do In-Place Data Mining) • CSV or DBF files • Data can be appended to the project when necessary

Data cleansing and manipulation • SQL querying through OLE DB • Records selection according to multiple criteria • Union, intersection, or complement of data sets • Categorical values aggregation • Visual Drill-through • Exceptional records filtering • Split into n-tile percentage intervals • Random sampling

Visualization • Histograms • Line and scatter plots with zoom and drill-through capabilities • Snake charts • Interactive 3D-charts • Interactive Rule-graphs with sliders for visualizing multi-variable relations • Frequency charts for categorical, integer, or yes/no variables • Lift and Gaincharts for marketing applications

Histogram displays distribution of numerical variables Frequencies chart displays distribution of categorical and yes/no variables Histograms and Frequencies

Sliders help visualize effects of other variables in more than two-dimensional models The Find Laws model (red line) for a product market share dependence on the price predicts a dramatic change in the formula when the product goes on promotion 2D charts and Rule-graphs

Compared data sets “High” All variables “Low” Snake-charts • Quickly compare qualitatively several datasets on all their attributes

Mining data with PolyAnalyst

Mining data with PolyAnalyst

Presentation Transcript

Data Mining with Clementine

Issues with Data Mining

Data Mining With Decision Trees

Data Mining: Data

Data Mining with UAI Proceedings

Data Mining with Big data

Data Mining with BioMart

Data Mining with AURA

Data Mining with DB

Data Mining: Data

PolyAnalyst Data and Text Mining tool Your Knowledge Partner TM megaputer

Data Mining with JDM API

Data Mining with CANape 9.0

Data Mining with Big Data

Data mining with DataShop

Data Mining with Neural Networks

Data Mining with BioMart

Data Mining Processing with Rayvat

BENIFITS ASSOCIATED WITH DATA MINING

Data mining with DataShop

Data mining with DataShop