1 / 48

Hamid Djam Principal Architect Business Intelligence & Analytics

Business Intelligence & Big Data Analytics. Hamid Djam Principal Architect Business Intelligence & Analytics.

dyami
Download Presentation

Hamid Djam Principal Architect Business Intelligence & Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Business Intelligence & Big Data Analytics Hamid Djam Principal Architect Business Intelligence & Analytics

  2. EMC makes no representation and undertakes no obligations with regard to product planning information, anticipated product characteristics, performance specifications, or anticipated release dates (collectively, “Roadmap Information”). Roadmap Information is provided by EMC as an accommodation to the recipient solely for purposes of discussion and without intending to be bound thereby. Roadmap information is EMC Restricted Confidential and is provided under the terms, conditions and restrictions defined in the EMC Non-Disclosure Agreement in place with your organization. Disclaimer

  3. Why A Complete Big Data Analytics Stack Matters • Big Data is the new source for economic value • The clearest path to competitive advantage • The ultimate manifestation of fact-based decision making • The net new catalyst for business innovation and workplace evolution • The driving force of a new computing paradigm: data computing

  4. New Realities: Your Data Rules the World

  5. Challenges in Today’s DW Environments… Traditional solutions cannot meet new challenges • Critical business insight is outside enterprise data warehouse because the traditional DW solutions cannot absorb data fast enough • 100s of data marts • ‘Shadow’ databases • Data is everywhere and growing • 44x data growth by 2020 Enterprise Data Warehouse But it only holds 10 % of data Data-marts and ‘personal databases’ e.g. Access, Excel …… Makeup up 90% of corporate data • Source: IDC Digital Universe Study, • sponsored by EMC, May 2010

  6. DW Challenges Resolved With, BI as a Service BUSINESS IT • Long Project Duration. • Gap in understanding business requirements. • Business creating their own data marts. • Inconsistent data between IT systems and business systems. Speed Agility Flexibility Change Short term Stability Security Control Standards Long term Reference: Nine Secrets to Building an Agile, Adaptable BI Environment ,TDWI

  7. EMC IT: Offering IT-as-a-Service Desktop-as-a-service Virtual Desktops Client Devices Enterprise Applications/ Software-as-a-service MDM CRM Governance, risk, compliance Apps ERP Business intelligence Security Info. Lifecycle Mgmt Ent. Content Mgmt Application Platforms Platform-as-a-service Integration Web server Application Server Runtime environments Development tools App. frameworks Greenplum SQL Server Oracle … Database Platform Infrastructure-as-a-service Network Compute Storage & backup vBlock Infrastructure

  8. Information Management Core Disciplines • Guarantees data availability where and when it is required • Movement and transformation of enterprise information • Interconnectivity of IT portfolio • Standardized formats and service interfaces – SOA • Identification and deduplication of shared master data • Cross-referencing and disambiguation • Hierarchy management • Data governance framework and stewardship processes • Unstructured data storage and management • Workflow-based publishing & versioning services • Tie-in to enterprise portal and user identity / security strategies Data Integration Master Data Management Content Management • Framework and organization to ensure management of data as a strategic corporate asset • Data stewardship • Policies and procedures; monitoring and measuring • Data warehouse methodology – envisioning to deployment • Business use-case- or function-specific datamarts / reporting solutions • Moving with agility fromreactive to predictive capability • Assurance that trustworthy data is accessible at time of demand • Standardization& cleansing • Business data rule enforcement • Stale data refresh • Augmentation from external sources Data Governance Business Intelligence Information Quality

  9. Building The Industry’s Only Complete Big Data Analytics “Stack” Analytic Toolsets (Business Analytics, BI, Statistics, etc.) Greenplum Chorus Enterprise Collaboration Platform for Data Greenplum Data Computing Appliances Purpose-built for Big Data Analytics Greenplum HD Hadoop Enterprise & Community Editions Enterprise Analytics Platform for Unstructured Data Greenplum Database Enterprise & Community Editions World’s Most Scalable MPP Database Platform

  10. GREENPLUM DATABASE Industry-Leading Massively Parallel Processing (MPP) Performance Placeholder-waiting for box image from Beth

  11. Building The Industry’s Only Complete Big Data Analytics “Stack” Analytic Toolsets (Business Analytics, BI, Statistics, etc.) Greenplum Chorus Enterprise Collaboration Platform for Data Greenplum Data Computing Appliances Purpose-built for Big Data Analytics Greenplum HD Hadoop Enterprise & Community Editions Enterprise Analytics Platform for Unstructured Data Greenplum Database Enterprise & Community Editions World’s Most Scalable MPP Database Platform

  12. EMC Greenplum Database IsPurpose-built for Big Data • EMC Greenplum is a shared nothing, massively parallel processing (MPP) data warehouse system • Core principle of data computing is to move the processing dramatically closer to the data and to the people Fast DataLoading Extreme Performance& Elastic Scalability Unified Data Access

  13. Massively Parallel ProcessingAnd Linear Performance Scalability Greenplum 4.0: Database Architecture SQL MapReduce MasterServers Query planning & dispatch ... ... Network Interconnect SegmentServers Query processing & data storage ... ... ExternalSources Loading, streaming, etc.

  14. Platform IndependenceDelivers Choice and Flexibility • Data Computing Appliance • Optimized Price/Performance • Minimum time-to-value • Ideal for Production Environments • Software-Only • On your x86 hardware • Flexibility for any workload • Ideal for Q/A or DR • Virtualized Infrastructure • Pool resources • Elastic scalability • Ideal for Test & Development

  15. Mature Enterprise Platform CLIENT ACCESS ODBC, JDBC, OLEDB, etc. 3rd PARTY TOOLS BI Tools, ETL Tools Data Mining, etc ADMIN TOOLS GP Performance Monitor pgAdmin3 for GPDB CLIENT ACCESS & TOOLS LOADING & EXT. ACCESS Petabyte-Scale Loading Trickle Micro-Batching Anywhere Data Access STORAGE & DATA ACCESS Hybrid Storage & Execution(Row- & Column-Oriented) In-Database Compression Multi-Level Partitioning Indexes – Btree, Bitmap, etc. LANGUAGE SUPPORT Comprehensive SQL Native MapReduce SQL 2003 OLAP Extensions Programmable Analytics PRODUCT FEATURES GPDB ADAPTIVE SERVICES Multi-Level Fault Tolerance Online System Expansion Workload Management Shared-Nothing MPP Parallel Query Optimizer Polymorphic Data Storage™ Parallel Dataflow Engine gNet™ Software Interconnect MPP Scatter/Gather Streaming™ CORE MPP ARCHITECTURE

  16. EMC GREENPLUM HD Delivering Enterprise-Ready Apache Hadoop

  17. Building The Industry’s Only Complete Big Data Analytics “Stack” Analytic Toolsets (Business Analytics, BI, Statistics, etc.) Greenplum Chorus Enterprise Collaboration Platform for Data Greenplum Data Computing Appliances Purpose-built for Big Data Analytics Greenplum HD Hadoop Enterprise & Community Editions Enterprise Analytics Platform for Unstructured Data Greenplum Database Enterprise & Community Editions World’s Most Scalable MPP Database Platform

  18. Greenplum HD – Enterprise Ready Hadoop Platform for Unstructured Data • Greenplum Hadoop is faster, more dependable, and easier to use • Faster to address the growth of unstructured data • EMC reliable for the Enterprise • Easier to use with existing systems and tools

  19. Why Hadoop? • With massive growth of unstructured data, open-source software, Apache Hadoop has quickly become an important new data platform and technology • We've seen this first-hand with customers deploying Hadoop alongside Greenplum databases

  20. Why EMC Greenplum HD? • EMC has the technical depth, expertise and critical mass in building the scalable and reliable distributed data processing systems necessary to drive technical innovation into Hadoop • Hadoop needs to become “mission critical” and “easier to use and manage” • HDFS optimizations, workload management, job scheduling, systems management, etc. • Fault-tolerance: Eliminate SPOF for Name-Node, Job Tracker and other key components underlying Hadoop

  21. Greenplum HD: Hadoop Software Distributions • Introducing Greenplum HD, enterprise-ready Apache Hadoop software distributions • Community Edition software • 100% open source • Enterprise Edition software • Advanced features • 100% API compatible

  22. Greenplum HD Data Computing Appliance • Introducing the world’s first: • high-performance • purpose-built • data co-processing Hadoop appliance • Combining Hadoop and Greenplum Database in one appliance

  23. THE ANSWERMACHINE DATA IN. DECISIONS OUT. Introducing the Greenplum Data Computing Appliance

  24. Building The Industry’s Only Complete Big Data Analytics “Stack” Analytic Toolsets (Business Analytics, BI, Statistics, etc.) Greenplum Chorus Enterprise Collaboration Platform for Data Greenplum Data Computing Appliances Purpose-built for Big Data Analytics Greenplum HD Hadoop Enterprise & Community Editions Enterprise Analytics Platform for Unstructured Data Greenplum Database Enterprise & Community Editions World’s Most Scalable MPP Database Platform

  25. Key Architectural Principles • Keep it simple • Build on standard hardware components • Performance comes from our software architecture • Best of breed x86 and Ethernet networking technologies • Benefit from broad ecosystem innovation • Make it modular for easy scaling • SAN connectivity designed in • Focus on Data Computing, not Data Warehousing • Greenplum Database • SAS Analytics • Hadoop

  26. DCA Functional Components Administrative Switch Free Functional Block 8 Segment Servers Free Functional Block 2 10GE Switches 2 GPDB Master Servers Free Functional Block Free Functional Block 4 GPDB Segment Servers

  27. Scale to Multiple Racks In GranularQuarter Rack Increments 1st Rack Expansion Rack + . . . + Add ¼ rack Increments Add ¼ rack Increments

  28. High Availability Built-In • Master server data protection • HW RAID protection for drive failures • Replicated transaction logs for server failure • On server failure • Standby server activated • Administrator alerted Segment Segment Segment Segment Master Master • Segment Server Data Protection • HW RAID protection for drive failures • Mirrored segments for server failures • On server failure • Mirrored segments take over with no loss of service • Fast online differential recovery …

  29. GPDB HA Groups And Segment Mirrors GPDB HA Group GPDB HA Group Segment Server 1 P1 P2 P3 M6 M8 M10 Segment Server 2 P4 P5 P6 M1 M9 M11 GPDB HA Group Segment Server 3 P7 P8 P9 M2 M4 M12 Set of Active Segment Instances GPDB HA Group Segment Server 4 P10 P11 P12 M3 M5 M7 Number of primary and mirror instances shown above are for illustration purposes only. Each Segment Server in a DCA actually supports a total of 12 instances (6 primaries and 6 mirrors)

  30. DCA Can Sustain Up to Four Server Failures Per Rack, One Per HA Group GPDB HA Group GPDB HA Group Segment Server 1 P1 P2 P3 M6 M8 M10 Segment Server 2 P4 P5 P6 M1 M9 M11 GPDB HA Group Segment Server 3 P7 P8 P9 M2 M4 M12 Set of Active Segment Instances GPDB HA Group Segment Server 4 P10 P11 P12 M3 M5 M7 Number of primary and mirror instances shown above are for illustration purposes only. Each Segment Server in a DCA actually supports a total of 12 instances (6 primaries and 6 mirrors)

  31. EMC Dial-Home andRemote Support Built-In • EMC Premium Support • ESRS secure IP connection enabled for DCA racks • Automatic dial home for DCA HW and SW failures • 24x7 Remote technical support and trouble shooting • Online support triggers FRU parts shipment • Four hour on site support objective EMC Support FTPS Or ESRS

  32. Customer Support Services EMC Greenplum Warranty and Premium Maintenance One year Limited HW Warranty Secure Self-Help 24x7 access to eService support tools including knowledgebase, forums Remote Technical Support Technical support and remote troubleshooting during normal business hours Replacement parts shipped for next business day arrival Premium Maintenance • Remote Technical Support • 24x7 technical support and remote troubleshooting • Customer-managed case severity level • Installation of platform operating system updates • Onsite Support • Installation of replacement parts • Four-hour response objective • Proactive Service • Secure remote monitoring for hardware • Notification of engineering technical advisories • Built-in tools maximize stability and performance • Secure Self-Help • 24x7 access to eService support tools including knowledgebase, forums, and appropriately licensed software updates

  33. EMC Effect: Rapidly Expanding Portfolio

  34. Data Computing Appliance (DCA) • Purpose-built, highly scalable next generation data warehousing appliance • Architecturally integrates database, compute, storage, and network into an enterprise-class, easy-to-implement system. • Balanced for best price/performance ratio • Available in quarter-, half-, three-quarter-, full-, and multi-rack configurations

  35. High Capacity DCA • Suitable for large data base customers with PB scalability in mind • Increase the data capacity in a rack by three-times • Reduced rack space, power, and cooling needs per unit data • Lowest price-per-unit data warehouse appliance • Available in quarter-, half-, three-quarter-, full-, and multi-rack configurations

  36. Application Specific Configurations Database Hadoop EMC* makes no representation and undertakes no obligations with regard to product planning information, anticipated product characteristics, performance specifications, or anticipated release dates (collectively, “Roadmap Information”). Roadmap Information is provided by EMC as an accommodation to the recipient solely for purposes of discussion and without intending to be bound thereby.

  37. Seamless Infrastructure Integration • Data Protection • Big Data Loading & Staging • Storage Expansion • Disaster Recovery

  38. Seamless Infrastructure Integration Isilon Scale Out Storage For Big Data Staging EMC Data Domain Efficient Backup & Restore EMC VMAX SRDF EMC Data Domain Replication For Disaster Recovery EMC VMAX SAN Mirror For Advanced Storage Management EMC* makes no representation and undertakes no obligations with regard to product planning information, anticipated product characteristics, performance specifications, or anticipated release dates (collectively, “Roadmap Information”). Roadmap Information is provided by EMC as an accommodation to the recipient solely for purposes of discussion and without intending to be bound thereby.

  39. Efficient Backup/Restore withEMC Data Domain • Data Domain deduplication is a great fit for Greenplum datasets • Drastic reduction in backup storage requirement • Backup all segment servers in parallel directly to Data Domain • With Greenplumdeduplication friendly compressed data streams, achieve effective backup rates up to 6TB/hr

  40. DCA SAN Mirror H12011 • Default DCA configuration has Segment Primaries and Segment Mirrors on internal storage • SAN Mirror offloads Segment Mirrors to SAN storage • Doubles effective capacity of a DCA • Foundation of SAN leverage • Seamless off-host backups • Data replication • No performance impact • Primaries on internal storage • SAN sized for load and failed segment server P1 M1 … … P96 M96 EMC* makes no representation and undertakes no obligations with regard to product planning information, anticipated product characteristics, performance specifications, or anticipated release dates (collectively, “Roadmap Information”). Roadmap Information is provided by EMC as an accommodation to the recipient solely for purposes of discussion and without intending to be bound thereby.

  41. GREENPLUM CHORUS The World’s First Enterprise Data Cloud Platform

  42. Building The Industry’s Only Complete Big Data Analytics “Stack” Analytic Toolsets (Business Analytics, BI, Statistics, etc.) Greenplum Chorus Enterprise Collaboration Platform for Data Greenplum Data Computing Appliances Purpose-built for Big Data Analytics Greenplum HD Hadoop Enterprise & Community Editions Enterprise Analytics Platform for Unstructured Data Greenplum Database Enterprise & Community Editions World’s Most Scalable MPP Database Platform

  43. Greenplum Chorus • Greenplum’s Enterprise Data Cloud Platform (EDC), enabling: • Self-service provisioning • Data services • Collaborative analytics • Customers deploy Chorus along with VMware and the Greenplum Database to create an agile and self-service analytic infrastructure • Chorus can significantly accelerate the time and ease with which companies extract value and insight from their data

  44. Spin up new projects rapidly with self-service provisioning. • Provision instances, both single-node and multi-node. • Provision sandboxes as new databases or schemas. • Import data easily from anywhere in the cloud.

  45. Data is now discoverable, self-documenting, and shared. • Browse schemas and explore data with powerful search and visualization tools. • Attach documents, ask questions, add comments, and build a living data dictionary. • Define data sets, share them with the team, and schedule imports.

  46. Create a collaborative environment for deep analytics on big data. • Create project workspaces with shared files, data, documentation and workflows. • Execute workflows directly in the sandbox, and then track changes to work and results over time. • Control permissions to protect private data. • Publish functions and documentation, to promote common standards and techniques. • Import functions from libraries of in-database analytics functions. • Collaborate within projects, share information across teams.

  47. THANK YOU

More Related