370 likes | 482 Views
Big Data Analytics. A Presentation by Meg Monsen , Michael Leonard, and Eric Zeng. Agenda. Big Data Analytics and its Objective s Financial Impact Structured vs Unstructured Data Us ers of Big Data Relevant Technologies ( Hadoop, MongoDB) Coding Examples Future of A nalytics.
E N D
Big Data Analytics A Presentation by Meg Monsen, Michael Leonard, and Eric Zeng
Agenda • Big Data Analytics and its Objectives • Financial Impact • Structured vs Unstructured Data • Users of Big Data • Relevant Technologies ( Hadoop, MongoDB) • Coding Examples • Future of Analytics
What is Big Data and why does it matter? • Defining Big Data Analytics • Examining large sets of data • Discovering patterns and trends • Data warehouses are insufficient • Purposes • Uncovering hidden needs of customers • Improve operational efficiency
Big Data & Operational Efficiency • “By using big data for operations analysis, organizations can gain real-time visibility into operations, customer experience, transactions and behavior.” – IBM • Core Objectives • Gain • Analyze • Apply • Optimize
Financial Impact of Big Data • High cost of poor data quality • 3.1 trillion to US government annually • 10-25% of US business revenues • Opportunities for qualified analysts • Business Analyst: $66,000 • Data Analyst: $60,000 • Data Scientist: $113,000
Dimensions of Big Data • Essential Characteriestics: • Volume - Data quantity • Velocity - Data Speed • Variety - Data Types
Structured vs. Unstructured Data Structured Data • Represented as text • Transactional data, formal reports, accounting records of sales and costs • Relational databases / data warehouse • SQL Unstructured Data • May be textual or non-textual • Mobile usage, click stream activity, social media responses, genomic data • No structured database / data lake • NoSQL (Not only SQL), SQL Batch Queries
Illustrative Example Inventory Analyst Insurance Actuary
Interpretations Structured Data Big Data Analytics Big Data Analytics Structured Data
Users of Big Data • Device manufacturers, ERP providers, consulting firmscomprise 7 of top 10 users Big Data • Based on a survey conducted by Dell of large corporations in 2014… • 55% now follow Big Data strategy • 60% of Big Data projects involve a cloud • 32% involve real-time or near real-time processing • 22% use data lake • 20% of projects by outside consultants
Hadoop • Free, Java-Based programming framework • Distributes storage and processes large data sets • Started from a Google File System paper published in October 2003 • Development was furthered by Apache • Named after Doug Cutting’s son’s toy elephant (logo!)
When to Use (and Not Use) Hadoop YES! • Analytics • Search • Data Retention • Log File processing • Analysis of Text, Image, Audio, and Video Content • Recommendation systems like in E-Commerce Websites NO! • Low-latency or near real-time data access • Large number of small files to process • Multiple write scenarios requiring arbitrary writes between files
Hadoop Framework • Hadoop Common: Contains all the libraries and utilities • Hadoop Distributed File System (HDFS): Storage with high bandwith • Hadoop YARN: Resource-management platform • Hadoop MapReduce: Programming Model • for data processing
MongoDB = “The database for giant ideas” • Cross-platform document-oriented database • Open-source • “The database for giant ideas” • Founded in 2007 written to • handle specific problems with DoubleClick • Classified as NoSQL database
MongoDB Example Also, we can practice! http://www.w3resource.com/mongodb-exercises/#PracticeOnline