250 likes | 367 Views
Why Database. Music Collection. What information would you like to store about your collection?. List of information for song collection. Song Name Song Artist Length of Song Album Album Artist Genre Rating Number of plays Year Released Album Artwork. Size of File Kind of File
E N D
Music Collection • What information would you like to store about your collection?
List of information for song collection • Song Name • Song Artist • Length of Song • Album • Album Artist • Genre • Rating • Number of plays • Year Released • Album Artwork • Size of File • Kind of File • Bit Rate • Sample Rate • Date Modified • Last Played • Location of File • Track Number • Number of Tracks on Album
Themes • Song Information • Song Name • Length of Song • Genre • Rating • Number of plays • File Information • Size of File • Kind of File • Bit Rate • Sample Rate • Date Modified • Last Played • Location • Album Information • Album Name • Year Released • Artwork • Artist Information • Artist Name
Connecting • Song Information • Song Name • Length of Song • Genre • Rating • Number of plays • Album ID • Artist ID • File ID • Album Information • Album ID • Album Name • Year Released • Artwork • File Information • File ID • Size of File • Kind of File • Bit Rate • Sample Rate • Date Modified • Last Played • Location • Artist Information • Artist ID • Artist Name
Big Data IST210 Week 2, Lecture 2
Big Data Summary • http://www.youtube.com/watch?v=eEpxN0htRKI
What is Big Data? • In information technology, big data is a collection of data sets so large and complex that it becomes difficult to process using traditional relational database management systems. • Relational Data-Base Management System
Challenges: • capturing • storing • searching • sharing • analyzing • visualization
Data types with size issues: • scientific models/simulations (biology, astrophysics) • genetic studies • traffic • internet searching • business information (order management to stock data)
What’s making so much data? • ubiquitous computing (an area of study for those interested) • more people carrying data-generating devices (mobile phones with facebook, gps, cameras, etc.)
Just how big are we talking? • In 2012 we hit the capability of creating and storing 2.5 quintillion bytes of data PER DAY (2.5 x 10^18) (2.5 billion gigabytes) • 90% of the world's data created in last two years • Human genome, at the time it was originally mapped, took 10 years to process. It can now be done in a week (as of 2012). • Walmarthandles 1 million+ transactions per hour and needs to store these for analysis to determine what products sell where, etc.
Where is the problem? • When trying to get useful information out of the huge volume of data (drinking from a fire-hose), the use of traditional RDBMS queries isn't sufficient. • Why? IF you could store all of this data for one example (all tweets in a week, for instance), to search it with traditional tools to find out if a particular topic was trending would take so long that the result would be meaningless by the time it was computed. • Big Data solutions, then, consider how to store this data in novel ways in order to make it more accessible, and also to come up with methods of performing analysis on it.
Where is the problem? • This quite commonly now includes massively parallel software on anywhere from hundreds to thousands of servers, which could be virtual machines themselves on growing server farms. • The overall idea of "big data" includes not only storage and analysis, but considering just how to shape the data, what to store, how store, how to search, share and visualize it. • There is so much demand, right now, for understanding how to handle the massive amounts of data and make it useful that the industry is now more than $100 billion in size and growing at about 10% per year, about twice as fast as other software technology.
Changing how we store data: • Big Data analytics, in order to be performed in a practically useful manner, are requiring a redevelopment of data storage. • Instead of older SAN storage farms or data warehouses, data is moving into directly connected (Direct-Attached Storage: DAS) of things like solid state disks or large SATA disks attached to parallel processing nodes. • This brings the huge amounts of data closer to large processing capabilities in order to perform more timely analytics.
Activity/Discussion: • http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovation • What do you take away from this reading?
Structured Storage • A Column (not the same as a column in a relational database) • A Super Column • A Column Family
Getting Information Out of Structured Storage - Map Reduce • Map – distribute the task among multiple computers • Reduce – take the results from each computer and combine them
IBM considers Big Data: • Big data spans four dimensions: Volume, Velocity, Variety, and Veracity. • Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes—even petabytes—of information. • Turn 12 terabytes of Tweets created each day into improved product sentiment analysis • Convert 350 billion annual meter readings to better predict power consumption
IBM considers Big Data: • Big data spans four dimensions: Volume, Velocity, Variety, and Veracity. • Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value. • Scrutinize 5 million trade events created each day to identify potential fraud • Analyze 500 million daily call detail records in real-time to predict customer churn faster
IBM considers Big Data: • Big data spans four dimensions: Volume, Velocity, Variety, and Veracity. • Variety: Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together. • Monitor 100’s of live video feeds from surveillance cameras to target points of interest • Exploit the 80% data growth in images, video and documents to improve customer satisfaction
IBM considers Big Data: • Big data spans four dimensions: Volume, Velocity, Variety, and Veracity. • Veracity: 1 in 3 business leaders don’t trust the information they use to make decisions. • How can you act upon information if you don’t trust it? • Establishing trust in big data presents a huge challenge as the variety and number of sources grows.
Discussion • What do you think? • Opinions of all this?
Two weeks down 14 to go! • Next week • NoSQL and traditional RDBMS • One lecture • Lab • Next Thursday (1/24), class will be office hours. No attendance will be taken. • Homework Assignment 3 is due Tuesday 11:59