Specification of distributed data mining workflows with DataMiningGrid

Specification of distributeddata mining workflows with DataMiningGrid karthikreddy.nknalla@lakeheadu.ca

In this presentation • Benefits of grid-based technology from a data miner’s perspective • Extended the original DataMiningGrid system by a set of new components. • DataMiningGrid system was evaluated from a problem-oriented data mining perspective . • The capability of the DataMiningGrid workflow environment. • Using Case studies, the capability of the DataMiningGrid system was demonstrated.

Introduction • Grid computing is the collection of computer resources from multiple locations to reach a common goal. • Data mining techniques can be efficiently deployed in a grid environment. • An important benefit of embedding data mining into a grid environment is scalability. • Data mining is computationally intensive: when searching for patterns, most algorithms perform a costly search or an optimization routine that scales between O(n) and O(n3) with the input data;

Improving scalability can be based on the fact that a number of algorithms are able to compute results on a subset of data in such a way that the results on the subset can be merged or aggregated to give an overall result A well-known instance is the k-nearest-neighbor algorithm . Another example is bootstrap • It is relatively easy to decompose the overall computation into a number of independent or nearly independent sub-computations

Distributed computing potentially offers great savings in processing time and excellent scale-up with the numbers of machines involved • A second benefit of grid technology is the ability to handle problems that are inherently distributed, e.g. when data sets are distributed across organizational boundaries and cannot be merged easily because of organizational barriers (e.g. firewalls, data policies etc.) or because of the number of data involved

DataMiningGrid environment • The DataMiningGrid project objective was to developing a generic system facilitating the development and deployment of grid enabled data mining applications • The main outcome is the collection of software components constituting the DataMiningGrid system. • It is based on a service-oriented architecture (SOA) designed to meet the requirements of modern distributed data mining scenarios • DataMiningGrid system forms an up-to-date platform for distributed data mining.

General architecture • The DataMiningGrid system is based on a layered architecture comprising the following layers: • hard- and software resources, • grid middleware, • DataMiningGrid High-Level Services and • client. • The main user interface of the system is the Triana Workflow Editor . • an application is defined as a stand-alone executable file that can be started from the command line.

Scalability • Scalability of a distributed data mining is one of the key requirements for the DataMiningGrid system • Scalability characteristics of the DataMiningGrid have already been evaluated elsewhere on basis of different scenarios e.g. a text-classification scenario and a scenario running the Weka algorithms • The experiments demonstrated that grid-enabled applications in the DataMiningGrid system can reach very good scalability and that the flexibility of the system does not result in a significant performance overhead

Workflow environment • The workflow editor used to design and execute complex workflows in the DataMiningGrid system is based on. It allows composing workflows in a drag-and dropmanner. • Triana was extended by special components which allowaccess to and interaction with the DataMiningGrid grid environment. • The below figure gives an overview of user interface

A component inside a Triana workflow is called a unit. Each unit, which can be seen as a wrapper component, refers to special operations. • The Triana units are grouped in a treelike structure. In the user interface, units are split into several subgroups referring to their functionality, e.g. applications, data resources, execution, provenance and security.

Operations for workflow construction • A workflow in a data mining application is a series of operations consisting of data access and preparation, modeling, visualization, and deployment. • It has typical series of data transformations, starting with the input data and having a data mining model as final output. • Workflows in Triana are data driven. Control structures are not part of the workflow language but are handled by special components – e.g., looping and branching are handled by specific Triana units and are not directly supported by the workflow language

Workflow Operations • In the DataMiningGrid system the client-side components are implemented as extensions to the Triana Workflow Editor and Manager. The workflow can be constructed by using • (a) the standard units provided by Triana and • (b) the DataMiningGrid extensions. • By using and combining these units, workflows performing many different operations like • Chaining • Looping • Branching • Shipping algorithms • Shipping data • Parameter variation • Parallelization

Chaining : • It is a typical data mining task spans a series of steps from data preparation to analysis and visualization that have to be executed in sequence. • Chaining is the concatenation of the execution of different algorithms in a serial manner. • The output of the previous task can be used as input for the • next task. Different tasks can run on different machines. • It is up to the user to decide whether it makes sense to run a second task on the same or on a different machine, which could mean transferring the results of the first task over the network.

Looping • The DataMiningGrid system provides different ways of performing loops. • Triana contains a Loop unit that controls repeated execution of a subworkflow • it provides a loop functionality when grouping units. • The DataMiningGrid components do not directly provide loops but parameter sweeps . • Depending on the kind of loop to be performed, one or more of these choices are possible

Branching • In a workflow there are different possibilities for branching without a condition. • After each application execution the Execution unit returns a URI as reference to the results location in the grid. • To set up a branch, the GridURI output node of the Execution unit has to be cloned. • In addition, Triana contains a Duplicator unit that enables workflow branching . This unit can be used if there is no clonable output node at a unit. • The Duplicator unit’s function is to duplicate any object received at the input node

Shipping algorithms • Shipping of algorithms means sending the algorithm to the machine where the data it operates on are located. • The option to ship algorithms to data allows for flexibility in the selection of machines and reduces the overhead in setting up the data mining environment. This is especially important when the data naturally exist in a distributed manner and it is not possible to merge them. • The executable file that belongs to the application is transferred to the execution machine at each application execution. • If the algorithm is to be shipped to the data to avoid copying files among different sites, the machine where the data is located has to be selected as the execution machine.

Shipping data • Shipping of data means sending the data to the machine where the algorithm that processes them is running. • Each time an application is executed in the DataMiningGrid environment the input data for the application are copied to a local work directory on the execution machine . • If only the data and not the algorithm are to be shipped the machine can be specified where the job should run

Parameter variation • Parameter variation means executing the same algorithm with different input data and different parameter settings. • The DataMiningGrid system provides the possibility of using parameter sweeps, which means that a loop or a list can be specified for each option of the application as well as a loop for the input files or directories . • With this approach it is possible to submit hundreds of jobs at the same time.

Parallelization • Parallelization means distributing subtasks of an algorithm and executing them in parallel. • The system supports the parallel or concurrent execution of applications at the same time. • the DataMiningGrid environment does not support the parallelization of the execution of a single algorithm. • If the application itself takes care of the parallelization process, then an integration is possible.

Extensibility • In the context of the DataMiningGrid system, there are two types of extensibility: • on the client side (local extensibility), and on the grid environment, e.g. the inclusion of new grid-enabled applications. • Local extensibility A local extension of the DataMiningGrid system is a client side extension. • The Triana workflow editor, which is the main client of the system, can be easily extended by new workflow components. • Such components – implemented as Triana units – could, for instance, be viewer or inspection components, which may or may not directly interact with the grid environment

Extensibility of the grid environment The requirement for extensibility of the grid environment requests the following: • Extensibility without platform modification : A data miner who wants to make use of a gridenableddata mining platform typically does not have any knowledge about the details of the underlying system. • Therefore, he or she does not want to change – and even might not be capable of changing – any components or program code of the data mining platform. • Additionally, the data miner may not want to be dependent on a grid platform developer. • Extensibility without algorithm modification: The data miner can reuse his or her favorite algorithms and combine them with third-party methods to address complex data mining tasks.

Related Work • Today several environments for grid-enabled data mining exist. Such systems are, e.g., Anteater, Discovery Net,]GridMiner ,Knowledge Grid . • All of these systems are – to a varying degree – capable of distributed data mining in the grid and have already been comprehensively reviewed and compared, based on the architectural perspective . • It should be noted that the conversion of an existing algorithm into a distributed one was in all cases straightforward. Additional components were needed just for splitting a large data set into pieces, gathering results or performing a simple vote. • The DataMiningGrid addresses his or her needs by providing an extension mechanism that makes it possible to grid-enable his or her favorite algorithm without writing any code, by simply providing metadata that can even be specified via a webpage.

Issues • The success of grid-enabled data mining technology depends on whether it is taken up in real world data mining projects. For this to happen, there is a number of open issues. • First, case studies, success stories and systematic evaluations of one or more systems from a data mining perspective are rare so far. These case studies are needed to convince data miners to consider the use of grid technology for their real-world data mining applications. • Second, a comparison of grid-based platforms with more traditional platforms as well as new developments for distributed computing are needed. • Third, on the architectural side, the combination of grid technology with mobile devices is a potentially very promising area, especially in scenarios where the data mining is done inside these mobile devices themselves.

Conclusion • A desirable future extension of the DataMiningGrid system is full compatibility with the CRISP-DM standard (CRoss-Industry Standard Process for Data Mining) , a process model for data mining projects. • In general, in the DataMiningGrid system each single data mining task performed refers to the execution of a grid-enabled application, which results in separated job executions in the grid. • Theworkflow focuses on the grid-based aspects. • However, it is possible to map the grid-enabled application in the DataMiningGrid system to the CRISP-DM tasks. • The DataMiningGrid system in principle covers all CRISP-DM phases, although it was not designed to follow the CRISP-DM structure and does not support the different tasks by templates or wizards yet. • Providing this CRISP-DM perspective would lead to a view of the data mining process that is more natural for the data miner.

References • Cannataro, M., Talia, D. and Trunfio, P. (2002), ‘Distributed data mining on the grid’, Future Generation Computer Systems . • Cannataro, M. ; Univ. di Catanzaro, Italy ; Congiusta, A. ; Pugliese, A. ; Talia, D. more authors ; Distributed data mining on grids: services, tools, and applications • http://en.wikipedia.org/wiki/Workflow

Thank You Any Questions ?

Specification of distributed data mining workflows with DataMiningGrid

Specification of distributed data mining workflows with DataMiningGrid

Presentation Transcript

Data Mining with Clementine

Issues with Data Mining

Mining data with PolyAnalyst

Troubleshooting Distributed Systems via Data Mining

Specification Mining With Few False Positives

Privacy-Preserving Distributed Data Mining

Data Mining with Big data

Data Mining with BioMart

Data Mining with AURA

CICC Chemical Compound Mining Workflows

Data Mining with DB

Distributed Data Mining in Discovery Net

DISTRIBUTED DATA MINING ON ASTRONOMY CATALOGS

Data Mining in Ubiquitous Distributed Environments

Distributed Data for Science Workflows

Data Mining with Big Data

Data mining with DataShop

Distributed Data Mining System in Java

Data Mining with BioMart

Data mining with DataShop

CICC Chemical Compound Mining Workflows

Data mining with DataShop