420 likes | 614 Views
6 . Big Data and Cloud Computing. http://en.wikipedia.org/wiki/Big_data. What is big data? Big data is data that cannot be analyzed using a traditional relational database – there is so much of it!
E N D
6. Big Data and Cloud Computing http://en.wikipedia.org/wiki/Big_data
What is big data? Big data is data that cannot be analyzed using a traditional relational database – there is so much of it! Companies that develop the database platforms to analyze big data will make (are making) a fortune! Big data is the next technology problem looking for a solution!
Using a Traditional Database -RDBMS Storing large data in traditional database – It is easier to get the data in than out. Most RDBMS are designed for efficient transaction processing – Adding, updating, searching for, and retrieving small amount of information – Data is acquired in a transactional fashion Then, what is the problem with “Big Data”?
The trouble comes … Managing massive amounts of accumulated data – Collected over months or years andlearning something from the data andnaturally we want the answer in seconds or minutes Primarily, it is about the analysis of the large data sets.
Big Data Cloud • Source Data • Log Files • Event Logs / Operating System (OS) - Level • Appliance / Peripherals • Analyzers / Sniffers • Multimedia • Image Logs • Video Logs • Web Content Management (WCM) • Web Logs • Search Engine Optimization (SEO) • Web Metadata
Data in the cloud • Storing the data • Google BigTable, Amazon S3, NoSQL (Cassandra, MongoDB), etc. • Processing the data • MapReduce (Hadoop), Mahout, etc.
Big Data Cloud • Cloud-Based Big Data Solutions • DBaaS • Amazon Web Services (AWS) • DynamoDB • SimpleDB • Relational Database Service (RDS): Oracle 11g / MySQL • Google App Engine • Datastore • Microsoft SQL Azure • Processing • AWS Elastic MapReduce (EMR) • Google App Engine MapReduce: Mapper API • Microsoft: Apache Hadoop for Azure
Cloud File Systems Traditional distributed file systems (DFSs) need modifications. Like in traditional DFSs we need ... ... performance ... scalability ... reliability ... availability Differences: - Component failures are the norm (large number of commodity machines). - Files are huge (>>100GB). - Appending new data at the end of files is better than overwriting existing data. - High, sustained bandwidth is more important than low latency. Examples of Cloud DFSs: Google File System (GFS), Amazon Simple Storage System (S3), Hadoop Distributed File System (HDFS).
Storage as a Service • Many companies already provide access to massive data sets as a service (e.g. Amazon, Google) • Provide access to raw storage as a service • Advantages: • Already know how to manage storage clusters • More reliable than personal storage • Available anywhere • Disadvantages: • Security?
The Cloud Scales: Amazon S3 Growth S3 = Simple Storage System
Overview As the Internet reaches every corner of the world and the technology keeps advancing, the amount of digital data generated by the web and digital devices grows exponentially. According to estimates by the Economist, a total of 1200 exabytes of data were generated in 2010, and 7,900 exabytes are predicted by 2015 (An exabyte is equal to one billion gigabytes). The amount of data, the speed at which it is generated, and the variety of data formats raise new challenges, not only in technology, but also in all other fields where data is utilized as one of the critical resources in making decisions and predictions.
As defined by McKinsey Global Institute, "Big Data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze" and it "is the next frontier for innovation, competition, and productivity." • The USA government announced last March a Big Data Initiative with $200 million in new funding to support research in improving “the ability to extract knowledge and insights from large and complex collections of digital data." • Big Data has been moved to centrestage for advancing research, technology, and productivity not only in mathematics, statistics, natural sciences, computer science, technology, and business, but also in humanities, medical studies and social sciences.
Big Data - What is it? The term big data refers to collections of data sets so large and complex that it becomes difficult to process them using traditional database management tools or data processing applications. Big data is difficult to work with using most conventional relational database management systems, and desktop statistics and visualization packages, requiring instead massively parallel software running on tens, hundreds, or even thousands of servers. http://www.zdnet.com/blog/virtualization/what-is-big-data/1708 http://queue.acm.org/detail.cfm?id=1563874
Some Trends in Computing • The Data Deluge is a clear trend from commercial (Amazon, e-commerce) , community (Facebook, Search) and scientific applications. • Lightweight clients from smartphones, tablets with sensors. • Multicore processors are reawakening parallel computing. • Clouds with cheaper, greener, easier to use IT for (some) applications.
Internet of Things and the Cloud • It is projected that there will be 24 billion devices on the Internet by 2020. - Most will be small sensors that send streams of information into the cloud where it will be processed and integrated with other streams and turned into knowledge that will help our lives in a multitude of small and big ways. At least, that is the hype! • The cloud will become increasing important as a controller of, and resource provider for, the Internet of Things – the use of computers to enrich every aspect of everyday life.
Our Data-driven World Science Data bases from astronomy, genomics, environmental data, transportation data, … Humanities and Social Sciences Scanned books, historical documents, social interactions data, … Business and Commerce Corporate sales, stock market transactions, census, airline traffic, … Entertainment Internet images, Hollywood movies, MP3 files, … Medicine MRI and CT scans, patient records, …
Data-rich World Data capture and collection: - Highly instrumented environment - Sensors and Smart Devices - Networks Data storage: - Seagate 1 TB Barracuda @ $68.74 from Amazon.com
Cloud Computing Modalities “Can we outsource our IT software and hardware infrastructure?” “We have terabytes of click-stream data – what can we do with it?” • Hosted Applications and services • Pay-as-you-go model • Scalability, fault-tolerance, elasticity, and self-manageability • Very large data repositories • Complex analysis • Distributed and parallel data processing
Data in the Cloud- Platforms for Data Analysis Data Warehousing, Data Analytics and Decision-Support Systems Used to manage and control business. Transactional Data: historical or point-in-time. - Optimized for inquiry rather than update. Use of the system is loosely defined and can be ad-hoc. Used by managers and analysts to understand the business and make judgments
Data Analytics in the Web Context Now, data capture at the user interaction level - In contrast to the client transaction level in the Enterprise context As a consequence the amount of data increases significantly Greater need to analyze such data to understand user behaviours
Data Analytics in the Cloud Scalability to large data volumes: - Scan 100 TB on 1 node @ 50 MB/sec = 23 days - Scan on 1000-node cluster = 33 minutes Divide-and-Conquer (i.e. data partitioning - sharding) Cost-efficiency: - Commodity nodes (cheap, but unreliable) - Commodity network - Automatic fault-tolerance (fewer administrators) - Easy to use (fewer programmers)
Limits to Computation • Processor cycles are cheap and getting cheaper • What limits application of infinite cores? 1. Data: inability to get data to processor when needed 2. Power: cost rising and will dominate • Attributes thatneed most innovation – Infinite cores require infinite power – Getting data to processors in time to use next cycle. • Caches, multi-threading, … – All techniques consume power • More memory lanes drives bandwidth but more pins costs power • Power and data movement remain key constraints
Platforms for Large-scale Data Analysis Parallel DBMS technologies - Proposed in the late eighties - Matured over the last two decades Multi-billion dollar industry: Proprietary DBMS engines intended as Data Warehousing solutions for very large enterprises Map Reduce - Pioneered by Google – the theory - Popularized by Yahoo! (via Hadoop implementation)
Parallel DBMS technologies • Popularly used for more than two decades • Research Projects • Commercial: Multi-billion dollar industry but access to only a privileged few • Relational Data Model • Indexing • Familiar SQL interface • Advanced query optimization • Well understood and well studied
MapReduce • Overview: • Data-parallel programming model • An associated parallel and distributed implementation for commodity clusters • Pioneered by Google • Processes 20 PB of data per day • Popularized by open-source Hadoop project • Used by Yahoo!, Facebook, Amazon, and the list is growing …
Defining big data Big data refers to any data that cannot be analyzed by a traditional database due to three typical characteristics: - High volume, high velocity and high variety. High volume: big data’s sheer volume slows down traditional database racks. High velocity: big data often streams in at high speed and can be time-sensitive. High variety: big data tends to be a mix of several data types, typically with an element of unstructured data (e.g. video), which is difficult to analyze.
As the big data industry evolves, four trends are emerging. 1. Unstructured data: Data is moving from structured to unstructured format, raising the costs of analysis. This creates a highly lucrative market for analytical search engines that can interpret this unstructured data. 2. Open source: Proprietary database standards are giving way to new, open source big data technology platforms such as Hadoop. This means that barriers to entry may remain low for some time. 3. Cloud:Many corporations are opting to use cloud services to access big data analytical tools instead of building expensive data warehouses themselves. - This implies that most of the money in big data will be made from selling hybrid cloud-based services rather than selling big databases.
4. M2M:In future, a growing proportion of big data will be generated from machine-to-machine (M2M) using sensors. • - M2M data, much of which is business-critical and time-sensitive, could give telecom operators a way to profit from the big data boom.
Today, 90% of data warehouses hold less than 5 terabytes of data. Yet Twitter alone produces over 7 terabytes of data every day! As a result of this data deluge, the database industry is going through a significant transformation.
The first businesses that had to deal with big data were the leading Internet companies such as Google, Yahoo and Amazon. Google and Yahoo, for example, run search engines which have to gather unstructured data – like web pages – and process them within milliseconds to produce search rankings. Worse, they have to deal with millions of concurrent users all submitting different search queries at once. So Google and Yahoo developers designed entirely new database platforms to deal with this type of unstructured query at lightning speed.
They built everything themselves, from the physical infrastructure to the storage and processing layers. • Their technique was to scale out horizontally (rather than vertically), adding more nodes to the database network. • Horizontal scaling out involves breaking down large databases and distributing them across multiple servers. • These innovations resulted in the first “distributed databases” and provided the foundation for two of today’s most advanced database technology standards, commonly referred to as NoSQL and Hadoop.