370 likes | 766 Views
Integrating Data Mining with SQL Databases: OLE DB for Data Mining. Presenter: Lei Chen. Overview. Background knowledge Overview and design philosophy Basic components Operations on data model Concluding remarks. What is Data Ming?. Definition: Discovery of useful summaries of data.
E N D
Integrating Data Mining with SQL Databases: OLE DB for Data Mining Presenter: Lei Chen
Overview • Background knowledge • Overview and design philosophy • Basic components • Operations on data model • Concluding remarks
What is Data Ming? • Definition:Discovery of useful summaries of data. • Some Examples of Applications: • Decision Trees constructed from bank-loan histories to produce algorithms to decide whether to grant a loan. • Patterns of traveler behavior mined to manage the sale of discounted seat on the planes, rooms in hotels, etc. • “Diapers and Beer” in supermarkets for increasing sales. • Comparison of genotype with/without a condition allowed the discovery of genes that account for diseases.
Why need OLE DB To address the following needs, Microsoft created OLE DB. • A vast amount of the critical information is found outside of traditional corporate databases, instead, found in file systems, or personal systems such as Microsoft Access , spreadsheets, E-mails, and even on the World Wide Web. • To take advantage of the benefits of database technology, such as queries, transactions, and security, businesses have traditionally had to move the data from its original containing system into (DBMS). This process is expensive and redundant. • Furthermore, businesses need to be able to exploit the advantages of database technology not only when accessing data within a DBMS but also when accessing data from any other type of information container.
What is OLE DB? • Object Linking and Embedding for DataBases, is a means Microsoft use for accessing different types of data stores in a uniform manner. Now, OLE DB is an open technology available royalty free in many operating systems. • OLE DB is a set of Component Object Model (COM) interfaces that provide applications with uniform access to data stored in diverse information sources and that also provide the ability to implement additional database services. • OLE DB is the way to access data in a MS COM environment. • References:http://msdn.microsoft.com/library/default.asp?url=/library/en-us/oledb/htm/oledboverview_of_ole_db.asp
Overview • Background knowledge • Overview and design philosophy • Basic components • Operations on data model • Concluding remarks
Why OLE DB for Data Mining? • Industry standard is critical for data mining development, usage, interoperability, and exchange. • Objects, data types, properties, and important programming in OLE DB naturally cater to the needs • Building mining applications over relational databases is nontrivial. • Need different customized data mining algorithms and methods. • Significant work on the part of application builders. • Data providers: All structured data, i.e. data that support OLE DB • Data consumers : All development tools or languages requiring access to a broad range of data
Motivation of OLE DB for DM • Facilitate deployment of data mining models • Generate data mining models • Store, maintain and refresh models as data are updated • Programmatically use the model on other data sets • Browse models • Enable enterprise application developers to participate in building data mining solutions
Features of OLE DB for DM • Independent of provider or software • Not specialized to any specific mining model • Structured to cater to all well-known mining models • Part of upcoming release of Microsoft SQL Server 2000 • Not propose new Data mining algorithm, but to suggest an infrastructure to “plug in” any algorithms.
OLE DB for DM: Overview • Core relational engine exposes OLE DB in a language-based API • Analysis server exposes OLE DB OLAP and OLE DB DM • Maintain SQL metaphor • Reuse existing notions Data mining applications OLE DB DM Analysis Server OLE DB RDB engine
Key Operations to Support Data Mining Models • Define a mining model • Attributes to be predicted • Attributes to be used for prediction • Algorithm used to build the model • Populate a mining model from training data • Predict attributes for new data • Browse a mining model fro reporting and visualization
Data Mining Model is Analogous to A Table in SQL • Create a data mining module object • CREATE MINING MODEL [model_name] • Insert training data into the model and train it • INSERT INTO [model_name] • Use the data mining model • SELECT relation_name.[id], [model_name].[predict_attr] • consult DMM content in order to make predictions and browse statistics obtained by the model • Use DELETE to empty/reset • Predictions on datasets: prediction join between a model and a data set (tables) • Deploy DMM by just writing SQL queries!
Overview • Background knowledge • Overview and design philosophy • Basic components • Operations on data model • Concluding remarks
Two Basic Components Beyond Traditional OLE DB • Cases/caseset: input data • A table or nested tables (for hierarchical data) • Data mining model (DMM): a special type of table • A caseset is associated with a DMM and meta-info while creating a DMM • Save mining algorithm and resulting abstraction instead of data itself • Fundamental operations: CREATE, INSERT INTO, PREDICTION JOIN, SELECT, DELETE FROM, and DROP
Flatterned Representation of Caseset Problem: Lots of replication!
Logical Nested Table Representation of Caseset • Caseset: a set of cases. • Use Data Shaping Service to generate a hierarchical rowset.
Defining A Data Mining Model • The name of the model • The algorithm and parameters • The algorithm for prediction using this model • The columns of caseset and the relationships among columns • “Source columns” and “prediction columns”
Example CREATE MINING MODEL [Age Prediction] %Name of Model ( [Customer ID] LONG KEY, %source column [Gender] TEXT DISCRETE, %source column [Age] Double DISCRETIZED() PREDICT, %prediction column [Product Purchases] TABLE %source column ( [Product Name] TEXT KEY, %source column [Quantity] DOUBLE NORMAL CONTINUOUS, %source column [Product Type] TEXT DISCRETE RELATED TO [Product Name] %source column )) USING [Decision_Trees_101] %Mining algorithm used
Content Type of Columns: Column Specifiers • KEY • ATTRIBUTE • RELATION (RELATED TO clause) • QUALIFIER (OF clause) • PROBABILITY: [0, 1] • VARIANCE • SUPPORT • PROBABILITY-VARIANCE • ORDER • TABLE
Content Type of Columns: Attribute Types • DISCRETE: “Area Code” • ORDERED: “ a ranking of skill level” (say one to five) • CYCLICAL: “day of week” • CONTINOUS: “salary” • DISCRETIZED: “Age” • SEQUENCE_TIME a measurement for “time”
Overview • Background knowledge • Overview and design philosophy • Basic components • Operations on data model • Concluding remarks
Populating a DMM • Once a mining model is defined, the next step is to populate a mining model by consuming a caseset that satisfies specification;In other works,pull the information into a single rowset, and use INSERT INTO statement • Consume a case using the data mining model • Use SHAPE statement to create the nested table from the input data. • Train the model using the data and algorithm specified in create syntax.
Example: Populating a DMM INSERT INTO [Age Prediction] ( [Customer ID], [Gender], [Age], [Product Purchases]( [Product Name], [Quantity], [Product Type]) ) SHAPE {SELECT [Customer ID], [Gender], [Age] FROM Customers ORDER BY [Customer ID]} APPEND {SELECT [CustID], {product Name], [Quantity], [Product Type] FROM Sales ORDER BY [CustID]} RELATE [Customer ID] TO [CustID] ) AS [Product Purchases]
Using Data Model to Predict • Prediction join • Prediction on dataset D using DMM M • Different to join statement in SQL • DMM: a “truth table” • SELECT statement associated with PREDICTION JOIN specifies values extracted from DMM
Example: Using a DMM in Prediction SELECT t.[Customer ID], [Age Prediction].[Age] FROM [Age Prediction] PREDICTION JOIN (SHAPE {SELECT [Customer ID], [Gender] FROM Customers ORDER BY [Customer ID]} APPEND ( {SELECT [CustID], [Product Name], [Quantity] FROM Sales ORDER BY [CustID]} RELATE [Customer ID] TO [CustID] ) AS [Product Purchases]) AS t ON [Age Prediction].[Gender]=t.[Gender] AND [Age Prediction].[Product Purchases].[Product Name]=t.[Product Purchases].[Product Name] AND [Age Prediction].[Product Purchases].[Quantity]=t.[Product Purchases].[Quantity]
Browsing DMM • What is in a DMM? • Rules, formulas, trees, …, etc • Browsing DMM • Visualization
Concluding Remarks • This paper focus on the problem of integration of data mining with relational databases, rather than KDD algorithms. • OLE DB for DM integrates data mining and database systems • A good standard for mining application builders • Possible improvement • Extend the functionalities of the OLE DB DM? • Design more concrete language primitives?
Finally See why OLEDB for Data Mining? Disadvantages of other Data Mining Languages: They can not deal with either arbitrary mining models or integration of relational database API with mining applications. So Microsoft proposes OLEDB for DM. • DMQL: A Data Mining Query Language for Relational Databases • MSQL (Imielinski & Virmani’99) • MineRule (Meo Psaila and Ceri’96) • Query flocks based on Datalog syntax (Tsur et al’98) • CRISP-DM (CRoss-Industry Standard Process for Data Mining)
Questions and comments? • Thank You!
OLE DB (sometimes written as OLEDB or OLE-DB), • OLE DB (sometimes written as OLEDB or OLE-DB), Object Linking and Embedding for Databases, is a means Microsoft use for accessing different types of data stores in a uniform manner. Microsoft has separated the data store from the application that needs access to it through the use of this technology: this was done because different applications need access to different types and sources of data and do not necessarily want to know how to access functionality with technology-specific methods. OLE DB is conceptually divided into consumers and providers. The consumers are the applications that need access to the data, and the provider is the software component that exposes an OLE DB interface through the use of the Component Object Model (or COM). • OLE DB is part of the MDAC (Microsoft Data Access Components) stack and is the database access interface technology. MDAC is a group of Microsoft technologies that interact together as a framework that allows programmers a uniform and comprehensive way of developing applications for accessing almost any data store. OLE DB providers can be created to access such simple data stores as a text file or spreadsheet, through to such complex databases as Oracle and SQL Server. However, because different data store technology can have different capabilities, OLE DB providers may not implement every possible interface available to OLE DB. The capabilities that are available are implemented through the use of COM objects - an OLE DB provider will map the data store technologies functionality to a particular COM interface. Microsoft calls the availability of an interface to be "provider-specific" as it may not be applicable depending on the database technology involved. Additionally, however, providers may also augment the capabilities of a data store - these capabilities are known as services in Microsoft parlance.
ADO Microsoft ADO(ActiveX Data Objects) is a Component object model object for accessing data sources. It provides a layer between programming languages and OLE DB (a means of accessing data stores, whether they be databases or otherwise, in a uniform manner), which allows a developer to write programs which access data, without knowing how the database is implemented. You must be aware of your database for connection only. No knowledge of SQL is required to access a database when using ADO, although one can use ADO to execute arbitrary SQL commands. The disadvantage of this is that this introduces a dependency upon the database.
Component Object Model • Component Object Model, or COM, is a Microsoft platform for software componentry. It is used to enable cross-application communication and dynamic object creation in any programming language that supports the technology. COM is often used in the software development world as an umbrella term that encompasses the OLE, ActiveX, COM+ and DCOM technologies. COM has been around since 1993, however Microsoft only really started emphasizing the name around 1997. • Although it has been implemented on several platforms, it is primarily used with Microsoft Windows. COM is expected to be replaced to at least some extent by the Microsoft .NET framework.
Component Object Model, • Component Object Model, or COM, is a Microsoft platform for software componentry. It is used to enable cross-application communication and dynamic object creation in any programming language that supports the technology. COM is often used in the software development world as an umbrella term that encompasses the OLE, ActiveX, COM+ and DCOM technologies.
Component Object Model, • The object model for all Windows software • Proven, de facto industry-standard • Becoming available on non-Windows platforms • Unix via Bristol and Mainsoft products • Software AG has 18 ports in progress for Unix and Mainframe platforms
data warehouse • A data warehouse is, primarily, a record of an enterprise's past transactional and operational information, stored in a databasedesigned to favour efficient data analysis and reporting (especially OLAP). Data warehousing is not meant for current "live" data. • Data warehouses often hold large amounts of information which are sometimes subdivided into smaller logical units called dependent data marts. • Usually, two basic ideas guide the creation of a data warehouse: • Integration of data from distributed and differently structured databases, which facilitates a global overview and comprehensive analysis in the data warehouse. • Separation of data used in daily operations from data used in the data warehouse for purposes of reporting, decision support, analysis and controlling. • Periodically, one imports data from enterprise resource planning (ERP) systems and other related business software systems into the data warehouse for further processing. It is common practice to "stage" data prior to merging it into a data warehouse. In this sense, to "stage data" means to queue it for preprocessing, usually with an ETL tool. The preprocessing program reads the staged data (often a business's primary OLTP databases), performs qualitative preprocessing or filtering (including denormalization, if deemed necessary), and writes it into the warehouse.
ODBC • (pronounced as separate letters) Short for Open DataBase Connectivity, a standarddatabaseaccess method developed by the SQLAccess group in 1992. The goal of ODBC is to make it possible to access any data from any application, regardless of which database management system (DBMS) is handling the data. ODBC manages this by inserting a middle layer, called a databasedriver , between an application and the DBMS. The purpose of this layer is to translate the application's data queries into commands that the DBMS understands. For this to work, both the application and the DBMS must be ODBC-compliant -- that is, the application must be capable of issuing ODBC commands and the DBMS must be capable of responding to them. Since version 2.0, the standard supports SAG SQL.
OLE DB • Object Linking and Embedding for DataBases, is a means Microsoft use for accessing different types of data stores in a uniform manner. Microsoft has separated the data store from the application that needs access to it through the use of this technology: this was done because different applications need access to different types and sources of data and do not necessarily want to know how to access functionality with technology-specific methods. OLE DB is conceptually divided into consumers and providers. The consumers are the applications that need access to the data, and the provider is the software component that exposes an OLE DB interface through the use of the Component Object Model (or COM).