60 likes | 218 Views
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
E N D
INTRODUCTION TO HADOOP What is Big Data? Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. Gartner defines Big Data as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. According to IBM, 80% of data captured today is unstructured, from sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals, to name a few. All of this unstructured data is Big Data. Big data spans three dimensions: Volume, Velocity and Variety. Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes - even petabytes - of information. Turn 12 terabytes of Tweets created each day into improved product sentiment analysis Convert 350 billion annual meter readings to better predict power consumption Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value. Scrutinize 5 million trade events created each day to identify potential fraud Analyze 500 million daily call detail records in real-time to predict customer churn faster Variety: Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together. Monitor 100’s of live video feeds from surveillance cameras to target points of interest Exploit the 80% data growth in images, video and documents to improve customer satisfaction What does Hadoop solve? Organizations are discovering that important predictions can be made by sorting through and analyzing Big Data. WWW.KELLYTECHNO.COM Page 1
However, since 80% of this data is "unstructured", it must be formatted (or structured) in a way that that makes it suitable for data mining and subsequent analysis. Hadoop is the core platform for structuring Big Data, and solves the problem of making it useful for analytics purposes. The Importance of Big Data and What You Can Accomplish The real issue is not that you are acquiring large amounts of data. It's what you do with the data that counts. The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyze it to find answers that enable 1) cost reductions, 2) time reductions, 3) new product development and optimized offerings, and 4) smarter business decision making. For instance, by combining big data and high-powered analytics, it is possible to: Determine root causes of failures, issues and defects in near-real time, potentially saving billions of dollars annually. Optimize routes for many thousands of package delivery vehicles while they are on the road. Analyze millions of SKUs to determine prices that maximize profit and clear inventory. Generate retail coupons at the point of sale based on the customer's current and past purchases. Send tailored recommendations to mobile devices while customers are in the right area to take advantage of offers. Recalculate entire risk portfolios in minutes. Quickly identify customers who matter the most. Use clickstream analysis and data mining to detect fraudulent behavior. Challenges Many organizations are concerned that the amount of amassed data is becoming so large that it is difficult to find the most valuable pieces of information. What if your data volume gets so large and varied you don't know how to deal with it? Do you store all your data? Do you analyze it all? WWW.KELLYTECHNO.COM Page 2
How can you find out which data points are really important? How can you use it to your best advantage? Until recently, organizations have been limited to using subsets of their data, or they were constrained to simplistic analyses because the sheer volumes of data overwhelmed their processing platforms. But, what is the point of collecting and storing terabytes of data if you can't analyze it in full context, or if you have to wait hours or days to get results? On the other hand, not all business questions are better answered by bigger data. You now have two choices: Incorporate massive data volumes in analysis. If the answers you're seeking will be better provided by analyzing all of your data, go for it. High-performance technologies that extract value from massive amounts of data are here today. One approach is to apply high-performance analytics to analyze the massive amounts of data using technologies such as grid computing, in- database processing and in-memory analytics. Determine upfront which data is relevant. Traditionally, the trend has been to store everything (some call it data hoarding) and only when you query the data do you discover what is relevant. We now have the ability to apply analytics on the front end to determine relevance based on context. This type of analysis determines which data should be included in analytical processes and what can be placed in low-cost storage for later use if needed. Technologies A number of recent technology advancements enable organizations to make the most of big data and big data analytics: Cheap, abundant storage. Faster processors. Affordable open source, distributed big data platforms, such as Hadoop. Parallel processing, clustering, MPP, virtualization, large grid environments, high connectivity and high throughputs. Cloud computing and other flexible resource allocation arrangements. The goal of all organizations with access to large data collections should be to harness the most relevant data and use it for better decision making. Three Enormous Problems Big Data Tech Solves But what’s less commonly talked about is why Big Data is such a problem beyond size and computing power. The reasons behind the conversation are the truly interesting part and need to be understood. WWW.KELLYTECHNO.COM Page 3
Here you go…there are three trends that are driving the discussion and should be made painfully clear instead of lost in all the hype: We’re digitizing everything. This is big data’s volume and comes from unlocking hidden data from common things all around us that were known before but weren’t quantified, stored, compared and correlated. Suddenly, there’s enormous value in the patterns of what was recently hidden from our view. Patterns offer understanding and a chance for prediction of what will happen next. These each are important and together are remarkably powerful. There’s no time to intervene. This is big data’s velocity. All of that digital data creates massive historical records but also rich streams of information that are flowing constantly. When we take the patterns discovered in historical information and compare it to everything happening right now, we can either make better things happen or prevent the worst. This is revenue generating and life saving and all of the other wonderful things we hear about, but only if we have the systems in place to see it happening in the moment and do something about it. We can’t afford enough human watchers to do this, so the development of big data systems is the only way to get to better things when the data gives humans insufficient time to intervene. Variation creates instability. This is big data’s variety. Data was once defined by what we could store and relate in tables of columns and rows. A world that’s digitized ignores those boundaries and is instead full of both structured and unstructured data. That creates a very big problem for systems that were built upon the old definition, which comprise just about everything around us. Suddenly, there’s data available that can’t be consumed or generated by a database. We either ignore that information or it ends up in places and formats that are unreadable to older systems. Gone is the ability to correlate unstructured information with that vast historical (but highly structured) data. When we can’t analyze and correlate well, we introduce instability into our world. We’re missing the big picture unless we build systems that are flexible and don’t require reprogramming the logic for every unexpected (and there will be many) change. There you have it… The underlying reasons that big data matters and isn’t just hype (though there’s plenty of that). The digitization, lack of time for intervention and instability that big data creates leads us to develop whole new ways of managing information that go well beyond Hadoop and distributed computing. It’s why big data presents such enormous challenge and opportunity for software vendors and their customers, but only if these three challenges are the drivers and not opportunism. BI vs. Big Data vs. Data Analytics By Example Business Intelligence (BI) encompasses a variety of tools and methods that can help organizations make better decisions by analyzing “their” data. Therefore, Data Analytics falls under BI. Big Data, if used for the purpose of Analytics falls under BI as well. WWW.KELLYTECHNO.COM Page 4
Let’s say I work for the Center for Disease Control and my job is to analyze the data gathered from around the country to improve our response time during flu season. Suppose we want to know about the geographical spread of flu for the last winter (2012). We run some BI reports and it tells us that the state of New York had the most outbreaks. Knowing that information we might want to better prepare the state for the next winter. Theses types of queries examine past events, are most widely used, and fall under the Descriptive Analytics category. Now, we just purchased an interactive visualization tool and I am looking at the map of the United States depicting the concentration of flu in different states for the last winter. I click on a button to display the vaccine distribution. There it is; I visually detected a direct correlation between the intensity of flu outbreak with the late shipment of vaccines. I noticed that the shipments of vaccine for the state of New York were delayed last year. This gives me a clue to further investigate the case to determine if the correlation is causal. This type of analysis falls under Diagnostic Analytics (discovery). We go to the next phase which is Predictive Analytics. PA is what most people in the industry refer to as Data Analytics. It gives us the probability of different outcomes and it is future-oriented. The US banks have been using it for things like fraud detection. The process of distilling intelligence is more complex and it requires techniques like Statistical Modeling. Back to our examples, I hire a Data Scientist to help me create a model and apply the data to the model in order to identify causal relationships and correlations as they relate to the spread of flu for the winter of 2013. Note that we are now taking about the future. I can use my visualization tool to play around with some variables such as demand, vaccine production rate, quantity… to weight the pluses and minuses of different decisions insofar as how to prepare and tackle the potential problems in the coming months. The last phase is the Prescriptive Analytics and that is to integrate our tried-and-true predictive models into our repeatable processes to yield desired outcomes. An automated risk reduction system based on real-time data received from the sensors in a factory would be a good example of its use case. Finally, here is an example of Big Data. Suppose it’s December 2013 and it happens to be a bad year for the flu epidemic. A new strain of the virus is wreaking havoc, and a drug company has produced a vaccine that is effective in combating the virus. But, the problem is that the company can’t produce them fast enough to meet the demand. Therefore, the Government has to prioritize its shipments. Currently the Government has to wait a considerable amount of time to gather the data from around the country, analyze it, and take action. The process is slow and inefficient. The following includes the contributing factors. Not having fast enough computer systems capable of gathering and storing the data (velocity), not having computer systems that can accommodate the volume of the data pouring in from all of the medical centers in the country (volume), and not having computer systems that can process images, i.e, x-rays (variety). Big Data technology changed all of that. It solved the velocity-volume-variety problem. We now have computer systems that can handle “Big Data”. The Center for Disease Control may receive the data WWW.KELLYTECHNO.COM Page 5
from hospitals and doctor offices in real-time and Data Analytics Software that sits on the top of Big Data computer system could generate actionable items that can give the Government the agility it needs in times of crises. WWW.KELLYTECHNO.COM Page 6