620 likes | 755 Views
Future of Database Systems 2: XML Databases and Grid-based Digital Libraries. University of California, Berkeley School of Information Management and Systems SIMS 257: Database Management. Lecture Outline. Review Future of Database Systems XML and DBMS Grid-Based Digital Libraries
E N D
Future of Database Systems 2: XML Databases and Grid-based Digital Libraries University of California, Berkeley School of Information Management and Systems SIMS 257: Database Management
Lecture Outline • Review • Future of Database Systems • XML and DBMS • Grid-Based Digital Libraries • Data Grids • Grid-based IR • DBMS and usability
Lecture Outline • Review • Future of Database Systems • XML and DBMS • Grid-Based Digital Libraries • Data Grids • Grid-based IR • DBMS and usability
Radio has no future, Heavier-than-air flying machines are impossible. X-rays will prove to be a hoax. • William Thompson (Lord Kelvin), 1899
This “Telephone” has too many shortcomings to be seriously considered as a means of communication. The device is inherently of no value to us. • Western Union, Internal Memo, 1876
I think there is a world market for maybe five computers • Thomas Watson, Chair of IBM, 1943
By the turn of this century, we will live in a paperless society. • Roger Smith, Chair of GM, 1986
I predict the internet… will go spectacularly supernova and in 1996 catastrophically collapse. • Bob Metcalfe (3-Com founder and inventor of ethernet), 1995
Accomplishments of DBMS Research • DBMS are now used in almost every computing environment to create, organize and maintain large collections of information, and this is largely due to the results of the DBMS research community’s efforts, in particular: • Relational DBMS • Transaction management • Distributed DBMS
Next Generation Database Systems • Where are we going from here? • Hardware is getting faster and cheaper • DBMS technology continues to improve and change • OODBMS • ORDBMS • Bigger challenges for DBMS technology • Medicine, design, manufacturing, digital libraries, sciences, environment, planning, etc...
Examples • NASA EOSDIS • Estimated 1016 Bytes (Exabyte) • Computer-Aided design • The Human Genome • Department Store tracking • Mining non-transactional data (e.g. Scientific data, text data?) • Insurance Company • Multimedia DBMS support
New Features • New Data types • Rule Processing • New concepts and data models • Problems of Scale • Parallelism/Grid-based DB • Tertiary Storage vs Very Large-Scale Disk Storage • Heterogeneous Databases • Memory Only DBMS
Coming to a Database Near You… • Browsibility • User-defined access methods • Security • Steering Long processes • Federated Databases • IR capabilities • XML • The Semantic Web(?)
Some things to consider • Bandwidth will keep increasing and getting cheaper (and go wireless) • Processing power will keep increasing • Moore’s law: Number of circuits on the most advanced semiconductors doubling every 18 months • Memory and Storage will keep getting cheaper (and probably smaller) • “Storage law”: Worldwide digital data storage capacity has doubled every 9 months for the past decade • Put it all together and what do you have? • “The ideal database machine would have a single infinitely fast processor with infinite memory with infinite bandwidth – and it would be infinitely cheap (free)” : David DeWitt and Jim Gray, 1992
Lecture Outline • Review • Future of Database Systems • XML and DBMS • Grid-Based Digital Libraries • Data Grids • Grid-based IR • DBMS and usability
Standards: XML/SQL • As part of SQL3 an extension providing a mapping from XML to DBMS is being created called XML/SQL • The (draft) standard is very complex, but the ideas are actually pretty simple • Suppose we have a table called EMPLOYEE that has columns EMPNO, FIRSTNAME, LASTNAME, BIRTHDATE, SALARY
Standards: XML/SQL • That table can be mapped to: <EMPLOYEE> <row><EMPNO>000020</EMPNO> <FIRSTNAME>John</FIRSTNAME> <LASTNAME>Smith</LASTNAME> <BIRTHDATE>1955-08-21</BIRTHDATE> <SALARY>52300.00</SALARY> </row> <row> … etc. …
Standards: XML/SQL • In addition the standard says that XMLSchemas must be generated for each table, and also allows relations to be managed by nesting records from tables in the XML. • Don’t know whether this has actually been implemented by anyone • There is actually something very similar in the Cheshire II interface to RDBMS
Lecture Outline • Review • Future of Database Systems • XML and DBMS • Grid-Based Digital Libraries • Data Grids • Grid-based IR • DBMS and usability
Grid-based Digital Libraries • So what’s this Grid thing anyhow? • Data Grids and Distributed Storage • Grid-Based IR • Grid-Based Digital Libraries This lecture borrows heavily from presentations by Ian Foster (Argonne National Laboratory & University of Chicago), Reagan Moore and others from San Diego Supercomputer Center
The Grid: On-Demand Access to Electricity Quality, economies of scale Time Source: Ian Foster
By Analogy, A Computing Grid • Decouples production and consumption • Enable on-demand access • Achieve economies of scale • Enhance consumer flexibility • Enable new devices • On a variety of scales • Department • Campus • Enterprise • Internet Source: Ian Foster
Not Exactly a New Idea … • “The time-sharing computer system can unite a group of investigators …. one can conceive of such a facility as an … intellectual public utility.” • Fernando Corbato and Robert Fano , 1966 • “We will perhaps see the spread of ‘computer utilities’, which, like present electric and telephone utilities, will service individual homes and offices across the country.” Len Kleinrock, 1967 Source: Ian Foster
But, Things are Different Now • Networks are far faster (and cheaper) • Faster than computer backplanes • “Computing” is very different than pre-Net • Our “computers” have already disintegrated • E-commerce increases size of demand peaks • Entirely new applications & social structures • We’ve learned a few things about software Source: Ian Foster
Computing isn’t Really Like Electricity • I import electricity but must export data • “Computing” is not interchangeable but highly heterogeneous: data, sensors, services, … • This complicates things; but also means that the sum can be greater than the parts • Real opportunity: Construct new capabilities dynamically from distributed services • Raises three fundamental questions • Can I really achieve economies of scale? • Can I achieve QoS across distributed services? • Can I identify apps that exploit synergies? Source: Ian Foster
Why the Grid?(1) Revolution in Science • Pre-Internet • Theorize &/or experiment, aloneor in small teams; publish paper • Post-Internet • Construct and mine large databases of observational or simulation data • Develop simulations & analyses • Access specialized devices remotely • Exchange information within distributed multidisciplinary teams Source: Ian Foster
Why the Grid?(2) Revolution in Business • Pre-Internet • Central data processing facility • Post-Internet • Enterprise computing is highly distributed, heterogeneous, inter-enterprise (B2B) • Business processes increasingly computing- & data-rich • Outsourcing becomes feasible => service providers of various sorts Source: Ian Foster
“Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations” New OpportunitiesDemand New Technology Source: Ian Foster
Building an Open Grid Open Standards
Building an Open Grid Open Standards Open Source
Building an Open Grid Open Standards Open Source Open Infrastructure
Building an Open Grid Open Standards Open Grid Open Source Open Infrastructure
Building an Open Grid Open Standards Open Grid Open Source Open Infrastructure
Grids and Open Standards Open Grid Services Arch Web services GGF: OGSI, … (+ OASIS, W3C) Multiple implementations, including Globus Toolkit X.509, LDAP, FTP, … Globus Toolkit Defacto standards GGF: GridFTP, GSI App-specific Services Increased functionality, standardization Custom solutions Time
Open Grid Services Architecture • Service-oriented architecture • Key to virtualization, discovery, composition, local-remote transparency • Leverage industry standards • Internet, Web services • Distributed service management • A “component model for Web services” • A framework for the definition of composable, interoperable services “The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration”, Foster, Kesselman, Nick, Tuecke, 2002
Realizing a Service-Oriented Architecture: How Do I • Create, name, manage, discover services? • Render resources, data, sensors as services? • Negotiate service level agreements? • Express & negotiate policy? • Organize & manage service collections? • Establish identity, negotiate authentication? • Manage VO membership & communication? • Compose services efficiently? • Achieve interoperability?
Web Services • XML-based distributed computing technology • Web service = a server process that exposes typed ports to the network • Described by the Web Services Definition Language, an XML document that contains • Type of message(s) the service understands & types of responses & exceptions it returns • “Methods” bound together as “port types” • Port types bound to protocols as “ports” • A WSDL document completely defines a service and how to access it
Open Grid Services Infrastructure Client • Introspection: • What port types? • What policy? • What state? GridService (required) Other standard interfaces: factory, notification, collections Grid Service Handle Service data element Service data element Service data element handle resolution Grid Service Reference • Lifetime management • Explicit destruction • Soft-state lifetime Data access Implementation Hosting environment/runtime (“C”, J2EE, .NET, …)
The Gridas Enabler of 21st Century Science • Entirely new approaches to enquiry based on • Deep analysis of huge quantities of data • Interdisciplinary collaboration • Large-scale simulation • Smart instrumentation • Enabled by an infrastructure that enables access to, and integration of, resources & services without regard for location
Grid Infrastructure • Broadly deployed services in support of fundamental collaborative activities • Formation & operation of virtual organizations • Authentication, authorization, discovery, … • Services, software, and policies enabling on-demand access to critical resources • Computers, databases, networks, storage, software services,… • Operational support for 24x7 availability • Integration with campus and commercial infrastructures
The Foundations are Being Laid Edinburgh Glasgow DL Newcastle Belfast Manchester Cambridge Oxford Hinxton RAL Cardiff London Soton Tier0/1 facility Tier2 facility Tier3 facility 10 Gbps link 2.5 Gbps link 622 Mbps link Other link
Data Grid Problem • “Enable a geographically distributed community [of thousands] to pool their resources in order to perform sophisticated, computationally intensive analyses on Petabytes of data” • Note that this problem: • Is common to many areas of science • Overlaps strongly with other Grid problems
Data Grids forHigh Energy Physics ~PBytes/sec ~100 MBytes/sec Offline Processor Farm ~20 TIPS There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second Each triggered event is ~1 MByte in size ~100 MBytes/sec Online System Tier 0 CERN Computer Centre ~622 Mbits/sec or Air Freight (deprecated) Tier 1 FermiLab ~4 TIPS France Regional Centre Germany Regional Centre Italy Regional Centre ~622 Mbits/sec Tier 2 Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS Caltech ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS HPSS HPSS HPSS HPSS HPSS ~622 Mbits/sec Institute ~0.25TIPS Institute Institute Institute Physics data cache ~1 MBytes/sec 1 TIPS is approximately 25,000 SpecInt95 equivalents Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Tier 4 Physicist workstations Image courtesy Harvey Newman, Caltech
Data Intensive Issues Include … • Harness [potentially large numbers of] data, storage, network resources located in distinct administrative domains • Respect local and global policies governing what can be used for what • Schedule resources efficiently, again subject to local and global constraints • Achieve high performance, with respect to both speed and reliability • Catalog software and virtual data
Data Intensive Computing and Grids • The term “Data Grid” is often used • Implies a distinct infrastructure, which it isn’t; but easy to say • Data-intensive computing shares numerous requirements with collaboration, instrumentation, computation, … • Security, resource mgt, info services, etc. • Important to exploit commonalities as very unlikely that multiple infrastructures can be maintained • Fortunately this seems easy to do!
Examples ofDesired Data Grid Functionality • High-speed, reliable access to remote data • Automated discovery of “best” copy of data • Manage replication to improve performance • Co-schedule compute, storage, network • “Transparency” wrt delivered performance • Enforce access control on data • Allow representation of “global” resource allocation policies
A Model Architecture for Data Grids Attribute Specification Replica Catalog Metadata Catalog Application Multiple Locations Logical Collection and Logical File Name MDS Selected Replica Replica Selection Performance Information & Predictions NWS GridFTP Control Channel Disk Cache GridFTPDataChannel TapeLibrary Disk Array Disk Cache Replica Location 1 Replica Location 2 Replica Location 3 Source: Arcot Rajasekar (SDSC)
Data Grid Requirements • Seamless access to data and information stored at local and remote sites • Virtualization of data, collection and meta information • Handle Dataset Scaling – size & number • Integrate Data Collections & Associated Metadata • Handle Multiplicity of Platforms, Resource & Data Types • Handle Seamless Authentication • Handle Access Control • Provide Auditing Facilities • Handle Legacy Data & Methods Source: Arcot Rajasekar (SDSC)
SRB as a Solution Distributed Storage Resources (database systems, archival storage systems, file systems, ftp, http, …) • The Storage Resource Broker is a middleware • It virtualizes resource access • It mediates access to distributed heterogeneous resources • It uses a MetaCATalog to facilitate the brokering • It integrates data and metadata MCAT Application SRB Server HRM DB2, Oracle, Illustra, ObjectStore HPSS, ADSM, UniTree UNIX, NTFS, HTTP, FTP Source: Arcot Rajasekar (SDSC)