750 likes | 1.64k Views
Mining data with PolyAnalyst. Your Knowledge Partner TM. www.megaputer.com. Outline. Data Mining in BI chain PolyAnalyst overview Learning algorithms Additional features Future developments. Data Mining in BI chain. Your Knowledge Partner TM. Data. Knowledge. Decision. Action.
E N D
Mining data withPolyAnalyst Your Knowledge PartnerTM www.megaputer.com
Outline • Data Mining in BI chain • PolyAnalyst overview • Learning algorithms • Additional features • Future developments
Data Mining in BI chain Your Knowledge PartnerTM
Data Knowledge Decision Action DM in Decision Making Consider a fragment of the BI chain: • Data - is what we can capture and store • Knowledge- is what provides for informed decisions • Problem:How to get from Data to Knowledge? • Solution:Data Mining (Machine Learning)
Data Mining • "Data Mining is the process of identifying valid, novel, potentially useful, and ultimately comprehensible knowledge from databases that is used to make crucial business decisions." -- G. Piatetsky-Shapiro, KDNuggets editor www.kdnuggets.com • Valid • Novel • Actionable • Comprehensible
Data Mining vs. OLAP • OLAP- Helps prove or reject your hypothesesby dissecting data along different dimensions- But you have to guess the answer first ! • Data Mining- Automatically develops and tests numerous hypotheses by learning from historical data- Analyzes raw data
Business Intelligence Chain • Consider direct marketing automation • Analyze data • Integrate applications X
Data Mining Tasks • Predicting • Classifying • Clustering • Segmenting • Explaining • Associating • Visualizing • Link Analysis • Text Mining
What makes DM hard? • Unfamiliar concept and lack of experience • Results require interpretation by an analyst • Poor integration in existing applications • Difficulty processing very large databases • Necessity to learn a new application • High cost
Megaputer response • Challenge:Unfamiliar concept and lack of experienceResponse:Collaborative Appliance Program – combines Megaputer analysts expertise in data mining and customer knowledge of the business project • Challenge:Results require interpretation by an analystResponse: Simple reporting and batch processing capabilities • Challenge:Poor integration in existing applicationsResponse: Easy scoring of external data with a few mouse clicks • Challenge:Difficulty processing very large databasesResponse: In-Place Data Mining • Challenge:Necessity to learn a new applicationResponse: An SDK of easy-to-integrate PolyAnalyst COM components • Challenge:High costResponse: Flexible licensing mechanism
PolyAnalyst overview Your Knowledge PartnerTM
What is PolyAnalyst? • Multi-strategy data mining suite • The largest selection of ML algorithms for diverse business tasks • Structured data and text processing tools • Ease-of-use: • friendly data manipulation and visualization • Deep integration • Applying models to external DB through the OLE DB protocol • Exporting models to XML • COM components • Best Price/Performance ratio
Key differentiators of PolyAnalyst • Integrated analysis of structured (numeric and categorical) and unstructured (text) data • Easy to learn and operate visual analytical interface • The largest selection of powerful machine learning algorithms • Mouse-driven application of predictive models to data in any external system through a standard OLEDB link • Simple integration with external applications: SDK of COM components • In-Place Data Mining capabilities for processing huge databases • Step-by-step tutorials based on real-world case studies • Rich data manipulation and visualization tools • Reusable analytical scripts for batch process data mining • The best Price/Performance ratio
PolyAnalyst Customer base:300+ installations Sample customers
Control buttons Project navigation tree Data and Results pane Exploration engine report fragment Objects and Collections represented by icons PolyAnalyst log journal PolyAnalyst workplace
PolyAnalyst provides • Access to data held in a database or data warehouse • Numerical • Categorical • Yes/no • Date • Data manipulation and visualization • 14 machine learning algorithms • Convenient results reporting and outputing • Integration with external applications
PolyAnalyst machine learning algorithms Your Knowledge PartnerTM
“Probably one the most impressive characteristic of PolyAnalyst is the sheer number of data mining tasks it can tackle.” Mario Apicella Technology Analyst InfoWorld Test Center July 3, 2000
Learning algorithms • Find Laws(SKAT algorithm) • Cluster(Localization of anomalies) • Find Dependencies(n-dimensional distributions) • Classify(Fuzzy logic modeling) • Decision Tree(Information Gain criterion) • PolyNet Predictor(GMDH-Neural Net hybrid) • Market Basket Analysis(Association rules) • Memory Based Reasoning(k-NN + GA) • Linear Regression(Stepwise and rule-enriched) • Discriminate (Unsupervised classification) • Summary Statistics(Data summarization) • Link Analysis (Visual correlation analysis) • Text Mining(Semantic text analysis)
Cluster (FC) • Identifies clusters of similar records • Selects best variables for clustering • Suggests the number of clusters • Separates clusters of records in new data sets for further investigation - preprocessing for other algorithms
Cluster (continued) Groups of similar records
Cluster (continued) • Based on analyzing distributions in hypercubes of all variables rather than on measuring distances between points • Hence, independent of rescaling of axes variable • Finds only clusters actually present in data, on the background of uniformly distributed cases
Classify (CL) • Fuzzy-logic based classification • The function of belonging modeled by either Find Laws, PolyNet Predictor, or LR • Provides record scoring with Lift and Gain charts used for visualization • Assigns records to one of two classes and furnishes utilized classification rule
PolyAnalyst Lift chartillustrates an increase in the response to a campaign based on the discovered model - instead of random mailing Targeted mailing % of maximal possible response Mass mailing PolyAnalystGain charthelps optimize the profit obtained in a direct marketing campaign Targeted mailing Profit ($) Mass mailing Classify (continued)
Decision Tree (DT) • Intuitively classifies cases to selected categories • Based on Information Gain splitting criteria • The fastest algorithm in PolyAnalyst • Scales linearly with increasing number of records
Decision Tree (continued) Node characteristics Classification tree
Decision Forest (DF) • The most efficient classification algorithm for tasks with multiple target categories • Transforms the task of categorizing data records to N classes into the problem of solving N tasks of categorizing records to two classes • Develops the best collection of N classification trees, with leaves containing probabilities of classifying records in the corresponding classes • Scales linearly with increasing number of records
Link Analysis (LK) • Reveals pairs of correlated objects • Used in Fraud Detection, Text Analysis and other correlation analysis tasks
Text Analysis (TA) • Extracts key concepts from natural language notes • Tags individual records with the main encountered concepts • Recognizes synonyms and othe semantic relations • Can perform user-focused or unsupervised analysis • Integrates the analysis of text with the power of other machine learning algorithms of PolyAnalyst • Facilitates categorization of textual documents
Basket Analysis (BA) • Is used in Retailing, Fraud Detection and Medicine • Identifies in transactional data groups of products sold together well • Finds directed association rules for each of these groups • Groups baskets containing similar sets of products • Characterized by • Support • Confidence • Improvement • Based on new mathematics: • works 10 to 50 times faster than traditional algorithms
Basket Analysis (continued) Groups of products sold together well Directed Association Rules
Basket Analysis (continued) • Works with both transactional and flat data format • Easily finds many-to-one rules “I would like to continue working together with Megaputer on other CTP customers’ projects (mainly Swedish and Danish Banks ).” -- Olof Goransson Senior Data Consultant CTP Skandinavien AB
Find Laws (FL) • Models relationships hidden in data • Presents discovered knowledge explicitly • Searches the space of all possible hypotheses “The unique Find Laws algorithm along with an easy to use interface made PolyAnalyst the only choice for our environment.” -- James Farkas, Senior Navigation Engineer, The Boeing Company
Find Laws (continued) • FL is based on the Megaputer’s unique Symbolic Knowledge Acquisition Technology(SKAT) • A good introduction to SKAT:PCAI magazine, January 99, p. 48-52
Find Dependencies (FD) • Determines most influential variables • Detects multi-dimensional dependencies • Predicts target variable in a table format • Used as preprocessing for FL
Find Dependencies (continued) Predicted Sales per Employee
PolyNet Predictor (PN) • Predicts values of continuous attributes • Hybrid GMDH-Neural Network method • Works well with large amounts of data • The best architecture network is built automatically
Memory Based Reasoning(MB) • Performs classification to multiple categories • Based on identifying similar cases in the previous history • Uses Genetic Algorithms to find the most suitable metric for the problem
Discriminate (DS) • Determines what features of a selected data set distinguish it from the rest of the data • Requires no target variable • Can be powered by • Find Laws • PolyNet Predictor • Linear Regression
Linear Regression (LR) • Incorporates categorical and yes/no variables in the analysis correctly • Stepwise Linear Regression: only influential variables included • Can be used as a preprocessing and benchmarking module
PolyAnalyst features in more detail Your Knowledge PartnerTM
Data Analysis Project Workflow • Access data • Understand, clean and transform data • Run machine learning analysis • Visualize, report and share results • Integrate results in existing business process
Data Access • ODBC-compliant databases: Oracle, DB2, Informix, Sybase, MS SQL Server, etc. • Dedicated access • IBM Visual Warehouse • Oracle Express • OLE DB (can do In-Place Data Mining) • CSV or DBF files • Data can be appended to the project when necessary
Data cleansing and manipulation • SQL querying through OLE DB • Records selection according to multiple criteria • Union, intersection, or complement of data sets • Categorical values aggregation • Visual Drill-through • Exceptional records filtering • Split into n-tile percentage intervals • Random sampling
Visualization • Histograms • Line and scatter plots with zoom and drill-through capabilities • Snake charts • Interactive 3D-charts • Interactive Rule-graphs with sliders for visualizing multi-variable relations • Frequency charts for categorical, integer, or yes/no variables • Lift and Gaincharts for marketing applications
Histogram displays distribution of numerical variables Frequencies chart displays distribution of categorical and yes/no variables Histograms and Frequencies
Sliders help visualize effects of other variables in more than two-dimensional models The Find Laws model (red line) for a product market share dependence on the price predicts a dramatic change in the formula when the product goes on promotion 2D charts and Rule-graphs
Compared data sets “High” All variables “Low” Snake-charts • Quickly compare qualitatively several datasets on all their attributes