220 likes | 605 Views
Big Data’s Virtualization Journey. Andrew Yu Sr. Director, Big Data R&D VMware. Big Data: Not Just for the Web Giants – Now the Intelligent Enterprise. Real-time analysis allows instant understanding of market dynamics.
E N D
Big Data’s Virtualization Journey Andrew Yu Sr. Director, Big Data R&D VMware
Big Data: Not Just for the Web Giants – Now the Intelligent Enterprise
Real-time analysis allows instant understanding of market dynamics. Retailers can have intimate understanding of their customers needs and use direct targeted marketing. Market Segment Analysis Personalized Customer Targeting`
The Emerging Pattern of Big Data Systems: Retail Example Analytics Real-Time Processing Data Science Machine Learning Real-Time Streams Parallel Data Processing Exa-scale Data Store Cloud Infrastructure
A single GE Jet Engine produces 10 Terabytes of data in one hour – 90 Petabytes per year. Enabling early detection of faults, common mode failures, product engineering feedback. Post Mortem Proactively Maintained Connected Product
Storage: Plan for Peta-scale Data Storage and Processing Analytics Rapidly Outgrows Traditional Data Size by 100x PB of Data
Change workload types to Real-time Analytics, Machine Learning , Hadoop above cloud infra, too Cloud Infrastructure Supports Mixed Big Data Workloads MachineLearning Compute MachineLearning Real-TimeAnalytics Hadoop Cloud Infrastructure Real-TimeAnalytics Storage/Availability Network/Security Hadoop Management
Change workload types to Real-time Analytics, Machine Learning , Hadoop above cloud infra, too Cloud Infrastructure Supports Multiple Tenants Web User Analytics Historical Customer Behavior Financial Analysis Compute Cloud Infrastructure Storage/Availability Network/Security Management
Software-defined Datacenter: Compute The Core Values of Virtualization Apply to Big Data Agility / Rapid deployment Isolation for resource control and security Storage/Availability Compute Network/Security Lower Capex 1 2 3 4 Operational efficiency Management
Strong Isolation between Workloads is Key Hungry Workload 1 Reckless Workload 2 Nosy Workload 3 Cloud Infrastructure
Consolidation of workloads: Higher Utilization Hadoop 1 Hadoop 2 HBase • Without virtualization • independent Hadoop clusters each have access to fraction of total physical resources • Consolidate and virtualize, • Consolidated cluster has access to entire pool of physical resources • For common use cases, reduce latency on priority jobs on consolidated cluster • Multiple HDFS striped across all physical hosts
Big Data Mix of Workloads NoSQL Cassandra, Mongo, etc Other Spark, Shark, Solr, Platfora, Etc,… Hadoop batch analysis Big SQL Impala, Pivotal HawQ Compute layer HBase real-time queries File System/Data Store Host Host Host Host Host Host Host Virtualization
Software-defined Datacenter: Storage Requirements of Next Generation Storage 10x lower cost of storage Support a variety ofapplication types Storage/Availability Compute Network/Security Handle explosive data growth 1 2 3 4 Solve the privacy andsecurity issues Management
Software-defined Storage Enables Fundamental Economics Traditional SAN/NAS Distributed Object Storage HDFS MAPR CEPH Petabytes Deployed Scale-out NAS Isilon, NTAP
Top of Rack Switch Host Host Host Host Host Host Host Big-Data using Local Disks Servers with Local Disks High Performance 10GBE Switch per Rack 16-24 core server 12-24 SATA 2-4TB Disks 10 GbE adapter iSCSI/NFS for Shared Storage for vMotionetc,…
Big Data Storage Scale-out Network Storage • Hadoop Protocol • Snapshots • Posix Apps • Full NFS Access • Replication • Erasure Coding Elastic Compute Scale-out Network Storage
Customer Success: Hadoop as a Service at FedEx • Elastic vSphere Cluster • Mixed Workloads • vSphere • Existing Rack Mount Servers • Scale-out Isilon Cluster • Shared Data • NAS + Hadoop
Storage Configuration for Data/Compute Separation With Isilon Hadoop Virtual Node 2 Hadoop Virtual Node 3 Hadoop Virtual Node 1 Task- tracker Task- tracker Job- tracker Ext4 Ext4 Ext4 Ext4 Ext4 Ext4 NN NN data node NN Virtualization Host VMDK NN OS Image – VMDK OS Image – VMDK VMDK OS Image – VMDK NN Isilon VMDK NN Shared storage SAN/NAS Temp
Breakthrough Use Cases • Web Log Analysis • Initial exploration was around detection of mobile devices accessing the website. • Analysis of 570 billion web server log entries took approximately 9 minutes to complete on a small cluster. • ZIP code Analysis • Analysis of data to determine which ZIP codes are the highest source or destination for shipments. • Shipment Analysis • Analysis of shipment information to determine patterns that may delay a package.
Cloud Infrastructure is Ready for Big Data – Are you? Cloud Infrastructure