240 likes | 377 Views
AGENDA. Buzz word. What is BIG DATA ?. Big Data refers to massive, often unstructured data that is beyond the processing capabilities of traditional database management tools .
E N D
AGENDA Buzz word
What is BIG DATA ? • Big Data refers to massive, often unstructured data that is beyond the processing capabilities of traditional database management tools. • Big Data can take up terabytes and petabytes of storage space in diverse formats including text, video, sound, images etc. • Traditional relational database management systems cannot deal with such large masses of data. • Examples : User updates over fb. • Clicks over the internet. • 3 V’s of big data ?.. Structured vs unstructured
Volume • Volume refers to huge amount of data being generated every minute. • 90% of the data we have now is created in just past 2 years. • IP traffic by 2015 would turn 4X than what it is now. • 3 billion people would be online by 2015 . 2.7 zetabytes , hydron exp.
Velocity • Velocity refers to SPEED at which new data is being generated and moves around. • It includes Real time working systems such as Online banking. • Need of low response time. • Technology “In-Memory Analytics” is employed to deal with data in motion. 90k youtube, 45k google/sec
Variety • Variety refers to various datatypes which we can now use. • Earlier focus was on neat and structured data kept in form of tables in RDBMS. • 80% of data available now is unstructured data • Datatypes are anomalous varying from text to videos to audios to pictures etc Portable devices, sensors n Social media How we gain? Video..
Transform problems into possibilities Big data analytics ..
Big Data Analytics • It is the process of examining large amounts of data of a variety of types (big data) to uncover hidden patterns, unknown correlations and other real- time insights. • Use of Big Data Analytics – Google Search recommendations, Satyamevjayate. • Future scope – Genes reading for curing deadly diseases like cancer . Types of Analytics..
Leading Technologies • Relational databases failed to store and process Big Data. • As a result, a new class of big data technology has emerged and is being used in many big data analytics environments. • The technologies associated with big data analytics include : • Hadoop. • Mapreduce. • NoSQL.
Hadoop • Hadoop is an open source framework. • Generally is Java-based programming framework . • Processing and storing of large data sets. • Distributed computing environment. • Components of hadoop • HDFS( hadoop distributed file system). • Mapreduce.
HDFS (Hadoop Distributed File System) • HDFS stores data in DISTRIBUTED,SCALABLE and FAULT-TOLERANT WAY. • Name node have metadata about data on DataNodes. • DataNodes actually have data on them in form of blocks and they are capable of communicating .
Hadoop SQL Any questions ???...
Benefits of Hadoop • Copying same file over all (thousands) of nodes ? • doesn’t it seem like wastage of space ! • It actually is not a waste memory, because of 2 reasons: • If one node failed ,System would still work as data is never lost. • The query is scaled over nodes so it bring about faster results due to parallel processing • eg- Count all words of my twitter history to check what i talk about the most. • The query is split across multiple servers with a criteria (here months), and the results are consolidated.
Map-Reduce Algorithm • MapReduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. • as in previous example twitter data was processed on different hosts on basis of months . • Hadoop is the physical implementation of Mapreduce . • It is combination of 2 java functions : Mapper() and Reducer(). • example: to check popularity of text. • use of word-count..
Mapper() and Reducer() • Mapper function maps the split files and provide input to reducer. • Mapper ( split_filename, file –contents): • for each word in file-contents: • emit (word , 1) • Reducer function clubs the input provided by mapper and produce output. • Reducer ( word , values): • sum=0; • for each value in values: • sum=sum + value • emit(word , sum) • can anyone think of any disadvantages??..
Disadvantages of hadoop • There were 2 major disadvantages when hadoop was developed which now its strengths. • HDFS dependency on single Namenode • solution: A secondary Namenode is attached to Primary • Namenode. • MapReduce is a java framework and did not support sql queries • solution: Facebook developed HIVE which allowed scientists to work with sql on distributed database.
NoSQL • Not only SQL. • Non- relational database management system. • Used where no fix schemas are required and data is scaled horizontally. • 4 Categories of Nosql databases: • Key-value pair • Columnar database • Graph databases • Document databases
NoSQL Categories • KEY-VALUE PAIR • Keys used to get • Value from opaque • Data blocks. • Hash map. • Tremendously fast. • Drawback: • No provision for content based queries .
DOCUMENT DATABASE • Again a key value store but value is in form of document. • Documents are not of fixed schemas. • documents can be nested. • Queries based on content as well as keys. • Use cases: blogging websites.
COLUMNAR DATABASE • Works on attributes rather than tuples. • Key here is column name and value is contiguous column values. • Best for aggregation queries. • Trend : select (1 or 2 column’s values ) where ( same or the other column value ) = some value.
GRAPH DATABASES • Is a collection of nodes • and edges. • Nodes represent data • while edges represent • link between them. • Most dynamic and • flexible. Base Vs Acid properties ..
Data is the new oil Without Big data analysis companies are deaf and dumb , mere wanderers on web ... Like a cattle on the highway ! Thank you ! Keep dreaming BIG :D CONCLUSION
References • Websites : • http://searchbusinessanalytics.techtarget.com/ • Experts sound off on big data , Analytics and its tools • http://www.ibmbigdatahub.com/infographic/four-vs-big-data Big data and analytics hub • https://bigdatauniversity.com/bdu-wp/bdu-course/hadoop-fundamentals-i-version-3/ • Hadoop fundamentals • Research papers : Dean J. and Ghemawat S., “MapReduce: Simplified Data Processing on Large Clusters”,“OSDI: Sixth Symposium on Operating System Design San Francisco, CA”, “2004”.