250 likes | 474 Views
Big Data Analytics Lecture Series. Kalapriya Kannan IBM Research Labs July, 2013. What is the aim of the course. Focus is on “Systems” and applications for cloud-based storage and processing of BIG DATA . Big Data - Definition Big Data - Analytics Big Data - Storage (HDFS)
E N D
Big Data AnalyticsLecture Series Kalapriya Kannan IBM Research Labs July, 2013
What is the aim of the course Focus is on “Systems” and applications for cloud-based storage and processing of BIG DATA. • Big Data - Definition • Big Data - Analytics • Big Data - Storage (HDFS) • Big Data - Computing (Map/Reduce) • Big Data - Database (HBase) • Big Data – Graph DB (Titan) • Big Data - Streaming (Strom)
Pre-Requisite • “Nothing” – All of you are equally qualified. • A VM machine either through a VMPlayer/Virtual Box • Acknowledgements: • IBM Material/Examples/Machine etc., • IBM External talks/publically available material and authors of the same. • Several Internet material – Thanks to “Internet” • Apache Documentation and Examples
Mantra “Learning is not just restricted to listening, it is actively asking relevant questions”
After 6 hrs of lecture • Get Convinced about “Big Data” • Understand why we need a different paradigm. • Ascertain with confidence the need to look at data computing in a different way. • Realize the potential of big data • All of you are skilled enough to get into it. • What we will not do • Do research on why things have evolved into the current trends as it stands. • Try to be hands-on – But not guaranteed
Introduction to Big Data Kalapriya Kannan IBM Research Labs July, 2013
What are we going to understand • What is Big Data? • Why we landed up there? • To whom does it matter • Where is the money? • Are we ready to handle it? • What are the concerns? • Tools and Technologies • Is Big Data <=> Hadoop
Simple to start • What is the maximum file size you have dealt so far? • Movies/Files/Streaming video that you have used? • What have you observed? • What is the maximum download speed you get? • Simple computation • How much time to just transfer.
What is big data? • “Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is “big data.”
Huge amount of data • There are huge volumes of data in the world: • From the beginning of recorded time until 2003, • We created 5 billion gigabytes (exabytes) of data. • In 2011, the same amount was created every two days • In 2013, the same amount of data is created every 10 minutes.
Big data spans three dimensions: Volume, Velocity and Variety • Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes—even petabytes—of information. • Turn 12 terabytes of Tweets created each day into improved product sentiment analysis • Convert 350 billion annual meter readings to better predict power consumption • Velocity:Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value. • Scrutinize 5 million trade events created each day to identify potential fraud • Analyze 500 million daily call detail records in real-time to predict customer churn faster • The latest I have heard is 10 nano seconds delay is too much. • Variety: Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together. • Monitor 100’s of live video feeds from surveillance cameras to target points of interest • Exploit the 80% data growth in images, video and documents to improve customer satisfaction
Finally…. `Big- Data’ is similar to ‘Small-data’ but bigger .. But having data bigger it requires different approaches: Techniques, tools, architecture … with an aim to solve new problems Or old problems in a better way
Whom does it matter • Research Community • Business Community - New tools, new capabilities, new infrastructure, new business models etc., • On sectors Financial Services..
The Social Layer in an Instrumented Interconnected World 4.6 billion camera phones world wide 30 billion RFID tags today (1.3B in 2005) 12+ TBsof tweet data every day 100s of millions of GPS enabled devices sold annually ? TBs ofdata every day 2+ billion people on the Web by end 2011 25+ TBs oflog data every day 76 million smart meters in 2009… 200M by 2014
What does Big Data trigger? • From “Big Data and the Web: Algorithms for Data Intensive Scalable Computing”, Ph.D Thesis, Gianmarco
BIG DATA is not just HADOOP Understand and navigate federated big data sources Federated Discovery and Navigation Hadoop File System MapReduce Manage & store huge volume of any data Data Warehousing Structure and control data Stream Computing Manage streaming data Text Analytics Engine Analyze unstructured data Integrate and govern all data sources Integration, Data Quality, Security, Lifecycle Management, MDM
Types of tools typically used in Big Data Scenario • Where is the processing hosted? • Distributed server/cloud • Where data is stored? • Distributed Storage (eg: Amazon s3) • Where is the programming model? • Distributed processing (Map Reduce) • How data is stored and indexed? • High performance schema free database • What operations are performed on the data? • Analytic/Semantic Processing (Eg. RDF/OWL)
When dealing with Big Data is hard • When the operations on data are complex: • Eg. Simple counting is not a complex problem. • Modeling and reasoning with data of different kinds can get extremely complex • Good news with big-data: • Often, because of the vast amount of data, modeling techniques can get simpler (e.g., smart counting can replace complex model-based analytics)… • …as long as we deal with the scale.
Time for thinking • What do you do with the data. • Lets take an example: • “From application developers to video streamers, organizations of all sizes face the challenge of capturing, searching, analyzing, and leveraging as much as terabytes of data per second—too much for the constraints of traditional system capabilities and database management tools.”
Why Big-Data? • Key enablers for the appearance and growth of ‘Big-Data’ are: • Increase in storage capabilities • Increase in processing power • Availability of data
IBM big data • IBM big data • IBM big data THINK IBM big data • IBM big data IBM big data • IBM big data IBM big data • IBM big data • IBM big data