E N D
1. Computing in the Clouds Aaron Weiss
3. Cloud Computing The next big thing
But what is it?
Web-based applications (thin client)
Utility computing – grid that changes rates for processing time
Distributed or parallel computing designed to scale for efficiency
Also called:
On-demand computing, software as a service, Internet as platform
4. Data Centers Decades ago – computing power in mainframes in computer rooms
Personal computers changed that
Now, in network data centers with centralized computing are back in vogue
But: no longer a hub-and-spoke
Although Google famous for innovating web searching, Google’s architecture as much a revolution
Instead of few expensive servers, use many cheap servers (1/2M servers in ~ 12 locations)
5. With thin, wide network
Derive more from scale of the whole than any one part – no hub
Cloud – robust and self-healing
Uses too much power
Cheaper power solutions we’ve talked about earlier in class
Heavy utilization of virtualization
Single server multiple OS instances, minimize CPU idle time
6. CloudOS (VMWare and Cisco)
Instead of each server running own copy of OS (Current Google model)
Should have single OS treats everything in data center as another resource
Network channels to coordinate events
Cloud more cohesive entity
7. Entire user interface resides in single window
Provide all facilities of OS inside a browser
Program must continue running even as number of users grows
Communication model is many-to-many
8. To move applications to cloud must master multiple languages and operating environments
In many cloud applications, back-end process relies on relational DB so part of code in SQL
Client side in JavaScript or embedded within HTML documents
Server application in between written in scripting language
9. Distributed Computing Speed of cloud depends on delegation
Break up into subtasks
Retrieving results of search
DB query – parse results, construct result sets, formal results, etc.
If tasks small enough, simultaneous
Dependencies? Complex
Distributed computing not new
SETI, Folding
Hadoop – Apache Foundation
No need for creating specialized custom software
Distributes petabytes of data projects, 1000s nodes
10. A Utility Grid In past, pay for cost of cycles used
Today most organizations create own data centers
But cost to run
Use 99% of capacity only 10% of time
In Web service, lots of hosting providers
Typically do not replicate distributed computing
11. Amazon, Google, etc. should scale up data centers, create business models to support third party use
Amazon EC2 fee based public use 10/07
Customers create virtual image of SW environment
Create instance of machine in Amazon’s cloud
Appears to user as dedicated server
Customers choose configuration
Customers can create/destroy at will
If surge in visitors, additional instances on demand
If slows down, terminate extra instances
Charges $0.10 per instance hour based on compute units regardless of underlying hardware
Data cost $0.10 to $0.18 per Gig
12. Google and IBM similar cloud utility model to CS education
Provide CS students access to distributed computing environment
In future businesses will not need to invest in a data center
13. Software as Service Move all processing power to the cloud and carry ultralight input device
Already happening?
E-mail on Internet, then Web
Google Docs
Implications for Microsoft, software as purchasable local application
Windows Live (Microsoft’s cloud)
Adobe web based photoshop
14. Cloud
Paradigm shift and disruptive force
Google and Apple will pair
Lightweight mobile device by Apple tapping into Google’s cloud
But
Failed thin clients of past
Larry Ellison in 90s trouble create cost-effective thin clients
Difficult to produce powerless thin client at low enough cost
Yet, Non-thin-clients can fail, SW needs care
15. Networks will need to be robust
In U.S. broadband quality poor
Broadband advances slow, bottleneck for clouds
Privacy ???
What if 3rd party has your data and government subpoena’s them? Do you even know?
Can you lose access to your info if you don’t pay bill?
Vendor lock-in – need certain client to access cloud operator
Not open like the Internet today
16. Partly Cloudy New name, same familiar computing models?
New because integrates models of centralized computing, utility computing, distributed computing and software as service
Power shifts from processing unit to network
Processors commodities
Network connects all
17. Cloud computing leaving relational databases behing Joab Jackson, 9/08
Government Computer News
18. “One thing you won’t find underlying a cloud initiative is a relational database.
And this is no accident: Relations databases are ill-suited for use within cloud computing environments”
Geir Magnusson, VP 10Gen, on-demand platform service provider
19. DBs specifically designed to work in cloud computing
Google – BigTable
Amazon – SimpleDB
10Gen – Mongo
AppJet – AppJetDB
Oracle open-source - Berkely DB
MySQL for Web - Drizzle
20. Characteristics of Cloud DBs Run in distributes environments
None are transactions in nature
Sacrifice advanced querying capability for faster performance
Queried using object calls instead of SQL
21. Very Large relational like Oracle implemented in data centers
DB material spread across different locations
Executing complex queries over vast locations can slow response time
Difficult to design and maintain an architecture to replicate data
Instead: Data targeted in a clustered fashion
22. The Claremont Report on Database Research SIGMOD 2008
23. What is it? May, 2008 prominent DB researchers, architects, users, pundits met in Berkeley, CA at Claremont Resort
Seventh meeting in 20 years
Report based on discussion of new directions in DBs
24. Turning point in DB Research New opportunities for technical advances, impact on society, etc.
1. Big Data
not only traditional enterprises, but also e-science, digital entertainment, natural language processing, social network analysis
Design new custom data management
solutions from simpler components
25. 2. Data analysis as profit center
Barriers between IT dept. and business units dropping
Data is the business
Data capture, integration, etc. keys to efficiency and profit
BI vendors - $10B (only front-end)
Also need better analytics, sophisticated analysis
non-technical decision makers want data
26. 3. Ubiquity of structured and unstructured data
Structured data – extracted from text, SW logs, sensors and deep web crawl
Semi-structured – blogs, Web 2.0 communities, instant messaging
Publish and curate structured data
Develop techniques to extract useful data, enable deeper explorations, connect datasets
27. 4. Expanded developer demands
Adoption of relational DBMS and query languages has grown
MySQL, PostegreSQL, Ruby on Rails
Less interest in SQL, view DBMS as too much to learn relative to other open source components
Need new programming models for Data management
28. 5. Architectural Shifts in computing
Computing substrates for DM are shifting
Macro: Rise of cloud computing
Democratizes access to parallel clusters
Micro: shift from increasing chip clock speed to increase number of cores, threads
Changes in memory hierarchy
Power consumption
New DM technologies
29. Research Opportunities Impact of DB research has not evolved beyond traditional DBs
Reformation
Reform data centric ideas for new applications and architectures
Synthesis
Data integration, information extraction, data privacy
Some topics not mentioned, because still part of significant effort
Must continue with these efforts
Also must continue with
Uncertain data, data privacy and security, e-science, human-centric interactions, social networks, etc.
30. DB Engines Big market relational DBs well known limitations
Peak performance:
OLTP with lots of small, concurrent transactions debit/credit workloads
OLAP with few real-mostly, large join, aggregation
Bad for:
Text indexing, server web pages, media delivery
31. DB engine technology could be useful in sciences and Web 2.0 applications, but not in current bundled DB systems
Petabytes of storage and 1000s processors, but current DB cannot scale
Need schema evolution, versioning, etc
Currently, many DB engine startup companies
32. 1. Broaden range for multi-purpose DBs
2. Design special purpose DBs
Topics in DB engine area:
Systems for clusters of many processors
Exploit remote RAM and Flash as persistent
Query opt. and data layout continuous
Compress and encrypt data integrated with data layout and optimization
Embrace non-relational DB models
Trade off consistency/availability for performance
Design power aware dBMS
33. Declarative programming for emerging platforms
Programmer productivity is important
Non-expert must be able to write robust code
Data Centric programming techniques
Map reduce – language and data parallelism
Declarative languages – Data log
Enterprise application programming – Ruby Rails, LINQ
34. New challenges – programming across multiple machines
Data independence valuable, no assumptions about where data stored
XQuery for declarative programming?
Also need language design, efficient compilers, optimize code across parallel processors and vertical distribution of tiers
Need more expressive languages
Attractive syntax, development tools, etc
Data management – not only storage service, but programming paradigm
35. Interplay of Structured and Unstructured Data Data behind forms – Deep Web
Data items in HTML
Data in Web 2.0 services (photo, video sites)
Transition from traditional DBs to managing structured, semi-structured and unstructured data in enterprises and on the web
Challenge of managing dataspaces
36. On the web
Vertical search engines
Domain independent technology for crawling
Within the enterprise
Discover relationships between structured and unstructured data
37. Extract structure and meaning from un- and semi-structured data
Information extraction technology – pull entities and relationships from unstructured text
Need: apply and management predictions from independent extractors
Algorithms to determine correctness of extraction
Join with IR and ML communities
38. Better DB technology needed to manage data in context
Discover implicit relationships, maintain context through storage and computation
Query and derive insight from heterogeneous data
Answer keyword queries over heterogeneous data sources
Analysis to extract semantics
Cannot assume have semantic mappings or domain is known
39. Develop algorithms to provide best-effort services on loosely integrated data
Pay as you go as semantic relationships discovered
Develop index structures to support querying hybrid data
New notions of correctness and consistency
40. Innovate on creating data collections
Ad-hoc communities to collaborate
Schema will be dynamic
Consensus to guide users
Need visualization tools to create data that are easy to use
Result of tools may be easier to extract info
41. Cloud Data Services Infrastructures providing software and computing facilities as a service
Efficient for applications
Limit up-front capitol expenses
reduce cost of ownership over time
Services hosted in a data center
Shared commodity hardware for computation and storage
42. Cloud services available today Application services (salesforce.com)
Storage services (Amazon S3)
Compute services (Google App Enginer, Amazon EC2)
Data services (Amazon SimpleDB, SQL Server Data Services, Google’s Datastore)
43. Cloud data services offer API more restricted than traditional DBs
Minimalist query languages, limited consistency
More predictable services
Difficult if had to provide full-function SQL data service
Managability important in cloud environments
Limited human intervention
High workloads
Variety of shared infrastructures
44. No DBA or system admin
Automatically by platform
Large variations in workloads
Economical to user more resources for short bursts
Service tuning depends upon virtualization
HW virtual machines as programming interface (EC2)
Multi-tenant hosting many independent schemas in single managed DBMS (salesforce.com)
45. Need for manageability
Adaptive online techniques
New architectures and APIs
Depart from SQL and transations semantics when can
SQL DBs cannot scale to thousands of nodes
Different transactional implementation techniques or different storage semantics?
46. Query processing and optimization
Cannot exhaust search plan if 1000s sites
More work needed to understand scaling realities
Data security and privacy
No longer physical boundaries of machines or networks
47. New scenarios
Specialized services with pre-loaded data sets (stock prices, weather)
Combine data from private and public domains
Reaching across clouds (scientific grids)
Federated cloud architectures
48. Mobile applications and virtual worlds Manage massive amounts of diverse user-created data, synthesize intelligently and provide real-time services
Mobile space
Large user bases
Emergence of mobile search and social networks
Timely information to users depending on locations, preference, social circles, extraneous factor and context in which operate
Synthesize user input and behavior to determine location and intent
49. Virtual worlds – Second Life
Began as simulations for multiple users
Blur distinction with real-world
Co-space, for both virtual and physical worlds
Events in physical captured by sensors, materialized in virtual
Events in virtual can affect physical
Need to process heterogeneous data streams
Balance privacy against sharing person RT info
Virtual actors requires large-scale parallel programs
Efficient storage, data processing, power sensitive
50. Moving Forward DB research community doubles in size last decade
Increasing technical scope make it difficult to keep track of field
Review load for papers growing
Quality of reviews decreasing over time
Need more technical books, blogs, wikis
Open source software development in DB
Competition: system components for cloud computing
Large-scale information extraction