900 likes | 1.16k Views
Grid Computing Research and Applications. Sornthep Vannarat Large scale Simulation Research Laboratory National Electronics and Computer Technology Center. Outline. Introduction to Grid computing Open Grid Service Architecture Bioinformatics applications on Grid Information Grid project
E N D
Grid ComputingResearch and Applications Sornthep Vannarat Large scale Simulation Research LaboratoryNational Electronics and Computer Technology Center
Outline • Introduction to Grid computing • Open Grid Service Architecture • Bioinformatics applications on Grid • Information Grid project • GEO Grid project • Knowledge Grid • Web 2.0 and Grid computing • Grid activities at NECTEC
หน่วยปฏิบัติการวิจัยการจำลองขนาดใหญ่หน่วยปฏิบัติการวิจัยการจำลองขนาดใหญ่ พัฒนาองค์ความรู้ นวัตกรรม และจัดการแก้ปัญหาด้วยการจำลองทางคอมพิวเตอร์ Understand, innovate and manage problems through computer simulations พัฒนาองค์ความรู้ การจำลองด้วยคอมพิวเตอร์นำไปสู่การค้นพบองค์ความรู้ใหม่ ซึ่งจำเป็นต่อการพัฒนาเทคโนโลยีชั้นสูง เพื่อเศรษฐกิจและคุณภาพชีวิตของประชาชน สร้างนวัตกรรม การประยุกต์ใช้การจำลองด้วยคอมพิวเตอร์ในการออกแบบทางวิศวกรรมนำไปสู่ผลิตภัณฑ์ที่มีคุณภาพและความสามารถสูงขึ้น รวมถึงกระบวนการผลิตที่มีประสิทธิภาพ ประหยัดพลังงานและวัตถุดิบ จัดการแก้ปัญหา ในการแก้ปัญหาสิ่งแวดล้อม และ ภัยพิบัติ การจำลองด้วยคอมพิวเตอร์สามารถช่วยทำนายการเปลี่ยนแปลง และ ผลกระทบของปัจจัยต่างๆ นำไปสู่ความเข้าใจปัญหา และสนับสนุนให้เกิดการวางแผนที่ดี
กิจกรรมหลัก • การสร้างระบบคอมพิวเตอร์สมรรถนะสูง และ ระบบจัดเก็บข้อมูลขนาดใหญ่ • การศึกษาและประยุกต์ใช้ virtualization middleware • การพัฒนาโครงสร้างพื้นฐานและ middleware สำหรับการบูรณาการระบบคอมพิวเตอร์และข้อมูล • การพัฒนาโปรแกรมเพื่อสร้างแบบจำลอง • การประยุกต์ใช้การสร้างแบบจำลองด้วยคอมพิวเตอร์เพื่อสร้างองค์ความรู้ เพื่อการออกแบบทางวิศวกรรม และ เพื่อการจัดการและแก้ไขปัญหา เทคโนโลยีที่เกี่ยวข้อง • คลัสเตอร์คอมพิวติ้ง กริดคอมพิวติ้ง ระบบจัดเก็บข้อมูลขนาดใหญ่ • การประมวลผลแบบกระจาย Web Services, XML, Java Programming • การคำนวณเชิงตัวเลข ไฟไนต์เอลิเม้นต์(FEM) และ กลศาสตร์ของไหลเชิงคำนวณ(CFD)
What is Grid computing? • Next-generation computing platform and global cyberinfrastructure for solving large-scale problems in science, engineering, and business • Grid Café [http://gridcafe.web.cern.ch/gridcafe/] • Web is a service for sharing information over the Internet, the Grid is a service for sharing computer power and data storage capacity over the Internet • Ian Foster • 1998: Computational Grid is a hardware and software infrastructure that provides dependable, consistent, and pervasive access to high-end computational capabilities • 2000: Grid computing is concerned with coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations • 2002: Grid is a system that (1) coordinates resources that are NOT subject to centralized control (2) uses standard, open, general purpose protocols and interfaces (3) delivers non-trivial qualities of service
Status of Grid computing • A promising work in progress • Usable with a lot of efforts • WISDOM: • EGEE Docking project • Find new inhibitors for proteins produced by Plasmodium falciparum • Over 46 million docking simulations in 6 weeks using 1,700 computers in 15 countries, equivalent to 80 CPU-years • Beyond computing power
Types of Grids • Computing grid • Data/storage grid • Information grid • Instrument grid • Access grid
The Grid Problem • Flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resource From “The Anatomy of the Grid: Enabling Scalable Virtual Organizations” • Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals -- assuming the absence of… • central location, • central control, • omniscience, • existing trust relationships.
Elements of the Problem • Resource sharing • Computers, storage, sensors, networks, … • Sharing always conditional: issues of trust, policy, negotiation, payment, … • Coordinated problem solving • Beyond client-server: distributed data analysis, computation, collaboration, … • Dynamic, multi-institutional virtual orgs • Community overlays on classic org structures • Large or small, static or dynamic
Challenges • To provide seamless access • Heterogeneous environments • Multiple administrative domains and autonomy issues • Scalability • Dynamicity/adaptability
Grid computing middleware • “Global Grids and Software Toolkits: A Study of Four Grid Middleware Technologies”, Parvin Asadzadeh et al. • UNICORE • Uniform Interface to Computing Resources • Ready-to-run Grid system including client and server software • UNICORE 6.0.1 release26 Nov 2007: WSRF based implementation • Globus Toolkit • Developed by Globus Alliance • Open source software toolkit used for building grids with services written in a combination of C and Java • GT 4.0.5 OGSA WSRF based • Legion, Gridbus • EGEE’s gLite
One View of Requirements • Adaptation • Intrusion detection • Resource management • Accounting & payment • Fault management • System evolution • Etc. • Etc. • … • Identity & authentication • Authorization & policy • Resource discovery • Resource characterization • Resource allocation • (Co-)reservation, workflow • Distributed algorithms • Remote data access • High-speed data transfer • Performance guarantees • Monitoring
Application Application Internet Protocol Architecture “Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services Collective “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link Layered Grid Architecture
Open Grid Services Architecture • Service-oriented architecture • Key to virtualization, discovery, composition, local-remote transparency • Leverage industry standards • Internet, Web services • Distributed service management • A “component model for Web services” • A framework for the definition of composable, interoperable services “The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration”, Foster, Kesselman, Nick, Tuecke, 2002
Web Services • XML-based distributed computing technology • Web service = a server process that exposes typed ports to the network • Described by the Web Services Description Language, an XML document that contains • Type of message(s) the service understands & types of responses & exceptions it returns • “Methods” bound together as “port types” • Port types bound to protocols as “ports” • A WSDL document completely defines a service and how to access it • WSRF
Extension of WS • Lifecycle management • Statefull • Subscribable
Writing Grid Service • Define the interface with WSDL, wsrp • Implement the service (Java) • Define the deployment parameters (WSDD, JNDI) • Compile GAR file (Ant) • Deploy service (GT4)
Notification • Polling and pushing • WS-Topics: topic trees • WS-BaseNotification: subscribe, notify • WS-BrokeredNotification: broker
Lifecycle management • Creation operation: factory service • Access and destroy operations: instance service • Destroy operation • Immediate • Scheduled (lease based)
Clients (e.g., WebMDS) Monitoring & Discovery GT4 Container WS-ServiceGroup Index Registration & WSRF/WSN Access adapter GT4 Cont. GT4 Container Index Index Custom protocols for non-WSRF entities Automated registration in container GridFTP RFT GRAM User
Security • Privacy • Integrity • Authenticate • Authorization • Non-reputable
PKI • Public Key Infrastructure • Key based encryption • Symmetry and Asymmetric encryptions • Public and Private keys • Digital signature • Digital certificate • CA
GSI • Grid Security Infrastructure • Transport and message-level security • Authorization schemes • Credential delegation and single sign-on • Different levels of security: container, service, and, resource
OGSA-DAI • An extensible framework for data access and integration • Expose heterogeneous data resources to a grid through web services • Interact with data resources • Queries and updates • Data transformation / compression • Data delivery • Application-specific functionality • A base for higher-level services • Federation, mining, visualisation,… • Open Grid Forum DAIS Working Group • DAIS (Database Access and Integration) specifications • OGSA-DAI to be a reference implementation of DAIS
OGSA-DAI functionality • Interaction with data resources • Relational – MySQL, SQL Server, DB2, PostGres, Oracle • XMLDB – eXist, Xindice • Files – text, binary, indexed • SQL multi-resources – aggregation of OGSA-DAI services exposing relational resources • Transformation and compression • ZIP, GZIP, XSLT, ResultSet-to-WebRowSet, ResultSet-to-CSV, … • WebRowSet projection, frequency distribution, random sample, … • Delivery • Local file, HTTP, SMTP, SOAP attachments, GridFTP, other OGSA-DAI services • Resource creation and destruction • Document-oriented interface – service interface is resource agnostic
Bioinformatics and Grid • Bioinformatics applications often require high-performance computing and large data handling • Tools: bioinformatics tools and web services • Data: • Public databases • Biological knowledge: ontology and meta data • unpublished data • Grid computing meets the requirements • Computing Grids • Data Grids • Knowledge Grids
Computing Grid • High throughput computing • Thousands of small independent tasks • Grid computing v.s. cluster computing • aims at parallel and distributed computing • differ in network latency and robustness. • frequency of task failures is much higher in grid computing • Two types of high-throughput computing • numerical processing • symbolic processing
High throughput numerical processing • Systems biology aims at modeling of biological dynamics in molecules, cells, organs and individuals • Huge computational power is needed for • molecular folding • molecular docking • spatiotemporal molecular interaction • kinetic parameter estimation • Problem decomposition techniques • parameter sweep • stochastic modeling
WISDOM • EGEE Docking project • Find new inhibitors for proteins produced by Plasmodium falciparum • over 46 million docking simulations 6 weeks • 1,700 computers in 15 countries • Equivalent to 80 CPU-years
DIANE • Enhanced version of WISDOM • Light-weight framework • Search for drugs for predicted variants of H5N1 • 2 millions docking complexes with a size of 600 gigabytes • 2,000 grid worker nodes in 17 countries
Limitations of EGEE Infrastucture • Experiences from virtual screening projects • Overall grid efficiency about 50 percent • Major sources of failure • Server license failure 23% • Workload management failure 10% • Site failure 9%
Study of kinetic pathways • Estimation of ODEs for modeling of metabolic pathways and signal transduction pathways • Genetic algorithms: • Estimating optimal parameter fitting to biological experimental results • High degrees of parallelism (multiple trials with initial conditions) • Parameter-parameter dependencies: • Calculating moment parameters, such as AUC, MRT, VRT
High throughput symbolic processing • Sequence analysis: Homology searches, Genome comparisons, Genome-wide analyses • Sequencing data are expected to increase more rapidly • High-throughput DNA sequencing technologies • Metagenomic projects • Human resequencing projects • Genome sequencing projects on other species • Requires large databases such as DNA and protein sequence • Sharing and updating of biological databases on the grid are of key importance
Sharing biological databases • Become more and more difficult and intractable • Automatic updating of databases is necessary • Concerns • Duplicated database copying • Disk overflow • Unexpected shutdown • Version management • File checksum integrity verification • Parallel and pipelined mechanisms for high-throughput data transfer
EGEE Framework • EGEE provides a general framework for sharing replicas of biological databases represented • Physical File Name (PFN) • Logical File Name (LFN) • Globally Unique Identifier (GUID) • Replica Manager System (RMS) • Replica Metadata Catalog (RMC) • Replica Location Service (RLS) LFN-2 LFN-1 LFN-3 RMC GUID RLS PFN-2 PFN-1
GADU • Genome Analysis and Database Update system • Automated, scalable, high-throughput computational workflow engine • Executes bioinformatics tools (BLAST, BLOCKS, PFam, Chisel and InterPro) • Public databases (NCBI RefSeq, PIR, InterPro and KEGG)
Homology Search • GRID BLAST implementations have been developed and reported • Prestaging of sequence databases to minimize the runtime overhead of transferal of large sequence databases • Databases update which keeps data consistency on the data-grid • Dynamic load balancing of query sequences • Assembling of the results from distributed jobs
Genome Comparison • Most promising life science applications for grid computing • Expandable and flexible large scale computing facility is needed • E.g. Investigation of horizontal gene transfer among 354,606 ORFs extracted from more than 100 microbial genomes • Used 229 CPUs located in 5 institutions • Number of pair-wise sequence comparison ∝ N2
Integration of bioinformatics services • Resourceome • Uniform and secure interface • Providing workflows • Using Metadata and ontology • Metadata, ontology, XML: fill the semantic gap of heterogeneous databases • Framework: OGSA based on WSRF
RbsB in Different Formats • DDBJ • SWISS-PROT • PDB
BioPfuga • Workflow system integrating application programs • Separating application programs into smaller parts. • Standardize the data format for transferring data between different application programs.
Bioinformatics workflow • Necessary for end-users of bioinformatics web/grid services • Taverna provides a workflow language and graphical user interface for: building, running and editing of workflows • Semantic indexing system of bioinformatics services has become essential for choosing resources • Searching functionally similar bioinformatics workflows is also important • Bioinformatics ontology is essential for automatic generation of bioinformatics workflows
Secure Data Access • Many bioinformatics databases are public and freely available • But access to the data needs to be strictly controlled in distributed collaborative research (For example: clinical data) • Public Key Infrastructures (PKI) is the predominant method for enforcing authentication • Virtual Organization for Trials and Epidemiological Studies (VOTES) project uses Internet2 Shibboleth technology
Information Grid an open and flexible infrastructure that facilitates the integration of any information anywhere across heterogeneous data sources under grid environment. 3 essential components MDL: Marker Description Language Information Services Information Brokers 51