170 likes | 379 Views
The Cloud and databases. Issues. What kind of data management is a good fit with the cloud?. Analytical data management: data attributes Far more reads than writes, so security and privacy less of an issue Tend to have far greater data needs, so there is a need for more servers
E N D
The Cloud and databases Issues
What kind of data management is a good fit with the cloud? • Analytical data management: data attributes • Far more reads than writes, so security and privacy less of an issue • Tend to have far greater data needs, so there is a need for more servers • The size of the data set grows over time and does not stabilize, so a better fit with expanding cloud server availability • Analytical applications often want data from multiple sources, and availability is much better in a cloud environment
More on analytical processing • Analytical Data Managements: system attributes • Shared nothing works better when access is mostly reads • ACID transactions do not need to be enforced as there is no need for a single, global state for all users • Generally, statistical results are okay even if some very secure data is not discovered
What is needed for new generation of cloud dbs? • Focus on making use of broad parallelism and on shifting/expanding set of servers • Looser notion of fault tolerance, as there is often no need to restart an interrupted query or if a branch of a query is killed • Need to be able to operate on data in multiple formats, encryptions, attribute domains, namespaces, schemas, database products – heterogeneity! • Must be able to sit underneath business intelligence systems
Hybrid databases: is this the answer? • Folks don’t want to learn/buy/program new data management products • But folks do want commercial grade systems with professional support • Would make the transition from transaction apps to analytical apps easier – like with relational data warehousing • But would we end up with an inelligant mess?
What about Object Databases?A return? • Blending a host language with a query language makes sense when queries involve complex calculations • It is easy to extend an o-o language with statistical procedures • The encapsulation of o-o languages is a good match with the wide and independent distribution of data in a cloud environment • O-O procedures could be built and deployed by distributed volunteers
Mope on O-O DBS • Partial results could be maintained and kept up to date, with batch updating of raw data only infrequently • We know how to build multiple language interfaces to accommodate multiple o-o languages • O-O databases are a good match with service-based interfaces – see diagram on page 29
Object-oriented dbs: relevant research & dev. • Adaptive query processing and optimization in real time • Parallel and distributed database technology • Massively parallel systems • Shared nothing systems • Data management stream technology
Problem: most business data right now is in a relational foRMat • We don’t have truly massively parallel and distributed query models for relational data • We don’t have truly massively parallel and distributed data partitioning for relational data • To perform efficient and fluid analytical processing of data in the cloud, we would need to create new links quickly, but we won’t have a focused, fixed schema as we do in standard relational systems • Object extensions to relational systems don’t include method encapsulation, only expanded domains
More cloud issues: centralized control? • Is the cloud trusted or anonymous? • Trusted, provider-specific commercial cloud solutions are much safer, centrally managed, and optimized as a single network, not as a mesh of networks • In many environments, even trusted, centralized environments, many machines are not properly managed and are controlled by immediate users • People don’t like their machines being co-opted, and so trust is not enough to guarantee dependibility
More on the cloud:Other applications? • Is analytical processing the only likely application? • There are many data sharing applications • There are many applications for selling access to bulk data • Data mining is a more focused form of analytical processing, but demands a very precise level of heterogeneity resolution and integration in the case of most medical and financial applications (and others)
Data mining • Kinds of data (from Data Mining by Han and Kamber) • Relational dbs • Data warehouses • Transaction processing systems • Object-relational dbs • Time sequence and temporal dbs • Spatial dbs • Text dbs • Multimedia dbs • Legacy dbs • Data streams • The Web…
heterogeneity in databases: data mining implications • Note how broad the “Web” is on the previous slide • Includes countless hand-rolled dbs • Includes databases hidden by web development frameworks like Ruby on Rails • Includes data accessible only via specific APIs • Includes data accessible via XML and Xpath, Xquery technology • Includes data stored in proprietary databases for applications like CAD, finance, animation, geography • The heterogeneity problem will only be solved by widespread collaboration on unifying standards
More on the Cloud: the future of transaction processing? • Will the rigidly centralized notion of OLTP survive? • Corporations are adapting to the cloud incrementally and using middleware to leverage their own clouds • With global business comes global data processing, across time zones, and is often managed in a widely distributed fashion • There are large corporations that handle financial and retail transactions for other companies • Are people warming to the idea of managing their personal and small business data in the cloud, including document and other services?
But the cloud is process-centric and not data-centric • Is the process vs. data centric issued about to reawaken? • The process folks kind of lost… • Data is seen more and more as a valuable resource, even if it is only “sold” indirectly • More of us are buying multimedia data • There are actually 3 models, process and data centric, and encapsulated • Some argue that the cloud is actually an encapsulated model and that in fact, data movement is difficult to optimize do to the dynamic nature of the network • Object-oriented databases…?