360 likes | 529 Views
Grids, Grid Technologies and Data Mining. Peter Brezany Institut für Softwarewissenschaft Universität Wien E-mail : brezany@par.univie.ac.at. Grid and Grid Technologies. Grid computing has emerged as an important field, distinguished
E N D
Grids, Grid Technologies and Data Mining Peter Brezany Institut für Softwarewissenschaft Universität Wien E-mail : brezany@par.univie.ac.at
Grid and Grid Technologies Grid computing has emerged as an important field, distinguished from conventional distributed computing by its focus on large- scale resource sharing, innovative applications, and, in some cases, high-performance orientation. Grid itself is supposed to connect computing resources over the wide area network. Internet computing and Grid technologies promise to change the way we tackle complex problems. Harnesing these new technolo- gies effectively will transform scientific disciplines ranging from high-energy physics to the life sciences. The Grid research field can further be divided into 2 subdomains: - Computational Grid : a natural extension of the former cluster computer - Data Grid : efficient management, placement, and replication of large amounts of data; once data are in place, computational tasks can be run.
Data mining on the Grid (DMG) : finding data patterns in an environment with geographically distributed data and computation – an environment with a special data management, data placement, and data replication. A good DMG algorithm analyzes data in a distributed fashion with modest data communication overhead. A typical DMG algorithm involves local data analysis followed by the generation of a global data model. Huge data volumes are involved – high performance I/O needed. Data Mining on (Data) Grids
Finding out the dependency of the emergence of hepatitis-C on the weather patterns: access to a large hepatitis-C DB at one location and an environmental DB at another location. 2 major financial organizations want to cooperate. They need to share data patterns relevant to the data mining task, they do not want to share the data since it is sensitive - combining the databases may not be feasible. A major multi-national corporation wants to analyze the customer transaction records for quickly developing successful business strategies. It has thousands of establishments through out the world and collecting all the data to a centralized data warehouse, followed by analysis using existing commercial data mining software,takes too long. Telemedical applications – see the next 2 slides. Application Examples
Components of Telemedical Applications Database Database Raw Medical Data Derived Medical Data Reconstructed Medical Data Web
Telemedical Collaboration - Example A patient living in a remote village has a heart problem. An EEG is taken by the local doctor and all the patient’s details are stored in the doctor’s PC based telemedical system. MRI and CT scans are taken within different departments of a general hospital and stored in the telemedical DB. A consultant compiles a report and saves it in the DB. If necessary, in a specialized clinic a 3D ultrasound scan is taken and further report compiled. Requiring complicated surgery, an external specialist using Virtual Reality techniques defines how the surgery should be planned. The resulting operation is placed on video for, e.g., education. Data mining support/assistance is needed.
Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals—in the absence of central control, omniscience, trust relationships Grid Computing Concept
Grid Computing Concept (2) The term ``the Grid´´ was coined in the mid 1990s to denote a proposed distributed computing infrastructure for science and engineering. The aim is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations. Resources: computers, files, data to computers, sensors, networks, laboratory equipments, etc. Sharing is highly controlled, with resource providers and consumers defining clearly and carefully just what is shared, who is allowed to share, and conditions under which sharing occurs. A set of individuals and/or institutions defined by such sharing form a virtual organization (VO).
Grid Computing Concept (3) Grid technologies complement rather than compete with existing distributed computing technologies. For example, CORBA focus on enabling resource sharing within a single organization. GRID technologies focus on dynamic, cross-organizational sharing.
Grid Communities and Applications:Home Computers Evaluate AIDS Drugs • Community = • 1000s of home computer users • Philanthropic computing vendor (Entropia) • Research group (Scripps) • Common goal= advance AIDS research
The Nature of Grid Architecture A Grid architecture identifies fundamental system components, specifies the purpose and function of these components, and indicates how these components interact with one another. Interoperability is the central issue to be addressed. In a network environment, interoperability means common protocols. The GRID architecture is first and foremost a protocol architecture, with protocols defining the basic mechanisms by which VO users and resources negotiate, establish, manage, and exploit sharing relationships. Standard protocols make it easy to define standard services that provide enhanced capablities and construct Application Programming Interfaces and Software Development Kits.
The Nature of Grid Architecture (2) Just as the Web revolutionized information sharing by providing a universal protocol and syntax (HTTP and HTML) for information exchange, so we require standard protocols and syntaxes for general resource sharing. A Grid protocol definition specifies - how distributed system elements interact with one another in order to achieve a specified behavior, and - the structure of the information exchanged during this interaction
The Nature of Grid Architecture (3) A Grid service is defined solely by the protocol that it speaks and the behaviors that it implements. There are standard Grid services for: - access to computation - access to data - resource discovery - coscheduling (mechanisms for coordinating operations across multiple resources) - data replications, etc. The definition of the above services allows as to enhance services offered to VO participants and also to abstract away resource specific details.
The Nature of Grid Architecture (4) Why do we also consider Application Programming Interfaces (APIs) and Software Development Kits (SDKs)? There is more to VOs than interoperability, protocols and services. Developers must be able to develop sophisticated applications in complex and dynamic execution environments. Users must be able to operate these applications. Standard abstractions, APIs, and SDKs can accelerate code development, enable code sharing, and enhance application portability. Summary: identification and definition of 1. protocols 2. services 3. APIs and SDKs.
Grid Architecture The architecture is organized into layers – see the next slide Components within each layer share common characteristics but can build on capabilities and behaviors provided by any lower layer. Resource and Connectivity protocols facilitate the sharing of individual resources. They are designed so that they can be imlemented n top of a diverse range of resource types, defined at the Fabric layer, and can in turn be used to construct a wide range of global services and application-specific behaviors at the Collective layer.
Application Internet Protocol Architecture “Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services Collective “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link Layered Grid Architecture(By Analogy to Internet Architecture) Application
Fabric: Interface to Local Control The Grid Fabric layer provides the resources to which shared access is mediated. Fabric components implement the local resource-specific operations that occur as a result of sharing operations at higher levels. At a minimum, recources should implement enquiry mechanisms that permit discovery of their structure and state, and resource management mechanisms that provide some control of delivered quality of service.
A resource-specific characterization of capabilities: Computational resources: Mechanisms for starting programs and for montoring and controlling the execution of resulting processes. Storage resources: Mechanisms for putting and geting files. Enquiry functions for determining hardware and software cha- racteristics and information about available space utilization. Network resources: Mechanisms that provide control over the resources allocated to network transfers. Enquiry functions to determine network characteristics and load. Code repositories: Managing versioned source and object code. Catalogs: Catalog query and update operations. Fabric: Interface to Local Control (2)
The Connectivity layer defines core communication and authentication protocols required for Grid-specific network transactions. Communication protocols enable the exchange of data between Fabric layered resources. Authentication protocols build on communication services to provide cryptographically secure mechanisms for verifying the identity of users and resources. Connectivity: Communicating Easily and Securely
Authentications solutions for VO environments should have the following characteristics: Single sign on: Users must be able to ``log on´´ (authenticate) just once and then have access to multiple Grid resources defined by the Fabric layer, without further user intervention. Delegation: A user must be able to endow a program with the ability to run on that user´s behalf, so that the program is able to access the resources on which the user is authorized. Integration with various local security solutions: Grid security solutions must be able to interoperate with various local security solutions. User-based trust relationships: If a user hs the right to use sites A and B, the user should be able to use sites A and B together without requiring that A´s and B´s security adminstrators interact. Connectivity (2)
The Resource layer defines protocols (and APIs and SDK´s) for secure initiation, monitoring, and control of sharing operations on individual resources. The primary classes of Resource layer protocols: Information protocols are used to obtain information about the structure and state of a resource, e.g., its configuration, current load, and usage policy. Management protocols are used to negotiate access to a shared resource, specifying, for example, resource requirements and the operations to be performed, such as process creation, or data access. A protocol may support monitoring the status of an operation and controlling (e.g., terminating) the operation. Resource: Sharing Single Resources
Collective layer contains protocol and services (and APIs and SDKs) that are not associated with any one specific resource but rather are global in nature and capture interactions across collections of resources. This layer can, e.g., implement: Directory services allow VO participants to discover the existence and/or properties of VO resources. Co-allcation, scheduling, and brokering services allow VO participants to request the allocatin of one or more resources for a specific purpose and the schedulng of tasks on the appropriate resources. Monitoring and diagnosics services support the monitoring of VO resources for failure, adversarial attack (``intrusion detection´´), overload, and so forth. Collective: Coordinating Multiple Resources
Data replication services suport the management of VO storage (and perhaps also network and computing) resources to maximize data access peformance with respect to metrics such as response time, reliability, and cost. Grid-enabled programming systems enable familiar programming models to be used in Grid environments. E.g., a Grid-enabled implementations of the Message Passing Interface (MPI). Software discovery services discover and select the best software imlementation and execution platform based on the parameters of the problem being solved. Community authorization servers enforce community policies governing resource access. Collaboratory services support the coordinated exchange of information within potentially large user communties. Collective (2)
Applications are constructed in terms of, and by calling upon, services defined at any layer. Effective application development can often benefit from the use of higher-level languages and frameworks (e.g., the Common Component Architecture, CORBA, etc.). These higher-level systems can build on protocols, services, and APIs provided within the Grid architecture. Applications
Protocols, Services, and InterfacesOccur at Each Level Applications Languages/Frameworks Collective Service APIs and SDKs Collective Service Protocols Collective Services Resource APIs and SDKs Resource Service Protocols Resource Services Connectivity APIs Connectivity Protocols Local Access APIs and Protocols Fabric Layer
Data Grid The need for Data Grids stems from the fact that scientific applications like data analysis in High Energy Physics, climate modeling or earth observation are very data intensive and a large community of researchers all around the globe wants to have fast access to the data. Future Data Grid applications: Medical Grids and E-Business Grids. Grid Data Warehousing and Grid Data Mining – a new challenging field.
2 different kinds of files: Master files (owned by their creators) Replica files. There may be many replicas of a master file. Replicas are owned by, managed by, and may be deleted by, the Grid. The notion of replicas is new, and critical in a Grid environment. Example: Before a DataGrid job can run at site A, data at site B may need to be copied to site A. This data may then be used by subsequent jobs at site A, or may be needed by jobs at site C, which has a better network connection to site A than site B. For this reason, the data should be kept at site A as long as possible. The ReplicaManager keeps track of all replica data so that the replica selection service can select the optimal replica to use for a given job, or to request the creation of a new replica. Storage Model
SQLDatabaseService This servis allows to efficiently store, retrieve and query very large amounts of meta data held in any type of local or remote RDBMS. The database can be used for the implementation of catalogs.
GridMinerA Framework for Data Miningon Grids A new research field
Knowledge base Database Architecture of a Data Mining System Graphical user interface Pattern evaluation Data mining engine Database or data warehouse server Data cleaning, data integration Filtering Data warehouse
Preprocessing - data cleaning - data transformation - data reduction Data mining (e.g., association rules) - find frequent itemsets - generate association rules Evaluation of discovered patterns Graphical User Interface Decomposition of a Knowledge Discovery Process
Data mining systems can be decomposed into a set of communicating components distributed component architecture Placement of data-processing functionalities iscritical. Grid data mining research tightly coupled to the ongoing work on parallel I/O for Grids(e.g., Armada project at the Dartmouth College, USA) Our Philosophy
Basic Grid Data Mining Models • Local data analysis followed by the generation • of a global data model – adapting distributed • data mining techniques. No data replication. • 2. Data mining system components are optimally • located on the grid. No dynamic data replication. • 3. Data mining system components are optimally • located on the Grid. Dynamic data replication is • considered.
Data Storage and the Components Site D Site C Site A Site B Preprocessing Preprocesing Preprocessing Preprocessing Local DM Local DM Local DM Local DM Construction of the Global Model GUI Site E