300 likes | 375 Views
IE 5331 – Grad Survey Paper – Issues and Current Researches in BIG DATA. Abisheik Kar Ramachandran Akhila Thota 11/19/2013. BIG DATA. What is BIG DATA ?. BIG DATA. Large – Is it really LARGE ??. Google processes > 20 PB a day (2012) Facebook has 2.5 PB of user data + 500 TB/day (2012)
E N D
IE 5331 – Grad Survey Paper – Issues and Current Researches in BIG DATA Abisheik Kar Ramachandran Akhila Thota 11/19/2013
BIG DATA • What is BIG DATA ?
Large – Is it really LARGE ?? • Google processes > 20 PB a day (2012) • Facebook has 2.5 PB of user data + 500 TB/day (2012) • 300 million photos are uploaded in Facebook Every day • eBay has 7.5 PB of user data + 50 TB/day (5/2009) • Nearly 35 Zeta Bytes of data !!!! • Need an Analogy ???
Large – Is it really LARGE ?? • 35 ZB of Data is enough data to fill a stack of DVD’s reaching halfway to Mars.
BIG DATA – Issues and current Research Characteristics of Big Data
Volume • Volume represents the amount of data. • 44x increase from 2009 to 2020
Velocity • Volume represents the speed in which data is being created, accessed or streamed. • Few decades ago, real time streaming was beyond our imagination but due to the advancement in technology, data is being streamed in real time today. • Data is being generated fast and should be processed fast.
Variety • Unstructured Data. • Text File, Audio File, Video File.
BIG DATA – Issues and current Research Big Data Issues
Big Data Issues • General Issues • Fundamental Issues • Storage & Transport issues • Processing issues • Management issues • Design Issues • High Availability • Privacy Issues
General Issues • Handling wide range of unstructured data combined with the size posses big threat in handling big data. • Since we are talking about zeta bytes of data, efficient mechanisms to store and retrieve data is a vital point. • In RDBMS, the data has to be stored in a form of table and retrieved using queries and RDBMS is designed to handle structured queries. • Contrast to RDBMS, Big data is a collection of huge unstructured data and it’s not possible to define them in table.
Fundamental Issues – Storage & Transport Issues • Due to the enormous amount of data created each second, storing these data becomes a major issue. • Storage Media is not able to cope up with the growth of data size. • To explain this, to process an exabyte of data on a single system, we would need nearly 25,000 disks. • With the current communication networks with transfer rate of 1 gigabytes per second and with an effective 80% sustainable transfer rate, transferring an exabyte of data would take nearly 2800 hours.
Fundamental Issues – Processing Issues • Considering the fact that the technology era has given way for huge amount of data, processing them becomes an essential part. • Effective processing of data in the range of exabytes would require extensive parallel processing capabilities and effective algorithm to handle them.
Management Issues • A decade ago, 1K MB of data was read at the rate of 4 MB/sec. • Now, we have reached a state where the speed has been raised from 4 MB/sec to 100 MB/sec. • Even with this speed, a system would take days to read zetabytes of data. • Reading the disks continuously, leads to another potential problem of hardware failure.
BIG DATA – Issues and current Research Available Technologies to handle these issues
Google File System (GFS) • Designed by Google to handle their big data. • GFS has two types of Nodes • Master Node • Chunk Nodes • Data will be divided into chunks and stored in chunk nodes. • The master node stores the metadata of all file chunks in the chunk nodes. • Lets see them in detail !!!
Hadoop • Hadoop is an open source software used for distributed computing. • It can be used to query a large set of data and get the results faster using reliable and scalable architecture. • In a traditional non distributed architecture, data is stored in one server and any client program will access this central data server to retrieve the data. • This architecture is also not reliable, as if the main server fails, you have to go back to the backup to restore the data. • Every server has local computation and storage.
Hadoop Master node user Job tracker Slave node N Slave node 2 Slave node 1 Task tracker Task tracker Task tracker Workers Workers Workers
Design Issues and High Availability • The major questions arise while designing the big data system includes the following. • To decide what data is relevant to the system. • To decide the amount of data needed to successfully predict the result. • To decide the value of data in decision making process. • High Availability • To make sure system is available for the user. • Distributed systems gives good solution to this but needed more power.
Privacy Issues • Social Media growth and Big data usage in it has raised privacy concerns • Photos, location tracker etc., • Location based tagging and geo tagging focused photo sharing sites • CYBER SECURITY • Mobile Tracking • And lot more …..
Future Work • Big data is still an emerging field and lot of improvement can be done to increase the efficiency. • Systems like Google File System, Hadoop have taken a step further to solve these issues. • Efficient algorithms must be developed where processing of big data can be done faster. • In HDFS, if the job tracker machine fails then all the currently running jobs fails. There must be way to handle such scenarios.
References • Kaisler, S.; Armour, F.; Espinosa, J.A.; Money, W., "Big Data: Issues and Challenges Moving Forward," System Sciences (HICSS), 2013 46th Hawaii International Conference on , vol., no., pp.995,1004, 7-10 Jan. 2013 doi: 10.1109/HICSS.2013.645 • Gantz, J. and E. Reinsel. 2011. “Extracting Value from Chaos”, IDC’s Digital Universe Study, sponsored by EMC. • Stonebraker, M. and J. Hong. 2012. “Researchers' Big Data Crisis; Understanding Design and Functionality”, Communications of the ACM,55(2):10-11. • Smith, M.; Szongott, C.; Henne, B.; von Voigt, G., "Big data privacy issues in public social media," Digital Ecosystems Technologies (DEST), 2012 6th IEEE International Conference on, vol., no., pp.1,6, 18-20 June 2012doi: 10.1109/DEST.2012.6227909 . • Eldawy, A., R. Khandekar, and Wu Kun-Lung. 2012. Clustering Streaming Graphs. Paper read at Distributed Computing Systems (ICDCS), 2012 IEEE 32nd International Conference on, 18-21 June 2012. • Prokaj, J.; Xuemei Zhao; Jongmoo Choi; Medioni, G., "Big Data Scalability Issues in WAAS," Computer Vision and Pattern Recognition Workshops (CVPRW), 2013 IEEE Conference on , vol., no., pp.399,406, 23-28 June 2013 doi: 10.1109/CVPRW.2013.67 • Apache Hadoop, http://hadoop.apache.org. • White, Tom. 2010. "Hadoop the Definitive Guide." In. Sebastopol: O'Reilly Media, Inc. http://www.UTXA.eblib.com/patron/FullRecord.aspx?p=590867.