1 / 22

Big Data Distilled Separating the hype from reality

Big Data Distilled Separating the hype from reality. Mike King Technical Fellow Fedex Services November 8, 2012 Midsouth DAMA. What is Big Data?.

Download Presentation

Big Data Distilled Separating the hype from reality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data DistilledSeparating the hype from reality Mike King Technical Fellow Fedex Services November 8, 2012 Midsouth DAMA

  2. What is Big Data? • Applying analytics to construct a model to predict an outcome where two or more dimension([VC])s exist AND your existing solutions can’t solve it. • The dimensions - 4 V’s, 1 C • Volume • Velocity • Variety • Variability • Complexity

  3. The Market • Growing fast • Lots of players • Small and nimble • Large • Changing fast • Hype • Contenders and pretenders • Commercials are deceiving

  4. Marketshare

  5. Why do we need it? • Competitive Intelligence • Joining dissimilar data • Linking data • Adding context to data • Discovery • Diapers • Pregnancy • To supplement our BI/DW • Table stakes

  6. Use Cases? • Customer analysis • Sentiment • Defection • Cannibilization • Cross selling • Network analysis • M2M • Fraud detection • Risk management • Text analytics • Social media analytics • Log analysis

  7. Apache Hadoop • Batch • Open Source • Components • HDFS • DB • Hbase • Cassandra • Map/Reduce • Hive • Pig • Mahout • Chuckwa • Avro • Zookeeper

  8. Solutions • Which stack/distribution? • Varying components • Apples & oranges • Types • Partial • Overlapping • Complementary • Substitute • Fast pace of change • Flux of partnerships

  9. Dealing with vendors, choices • Decide what your requirements are • Don’t let them tell you what you need • Beware bait and switch • Extras • Some are looking to sell • Professional Services • Other Software • All solutions are incomplete • Many solutions are lacking • Multiple…is one enough? • Switching is possible • Low cost? • Beware • Proprietary components • Solutions that have already been fixed….Apache nn • Hammer and nail

  10. My Big Data Vendors • MapR • Kaggle • Karmasphere • Hadapt • Datameer • Lucid Works • 1010data? • Splunk • SAS • IBM • Oracle • Hortonworks • Cloudera • EMC • Teradata • Amazon • Microsoft • HP

  11. Not My Big Data Vendors • Pentaho • Palantir • Kalido • Composite • Couchbase • Marklogic • StoredIQ • Syncsort • Datastax? • IBI • Informatica • SAP • 10Gen • Talend • Denodo • Tableau • Tibco • ParAccel

  12. What’s missing? • Collaboration • Directory, dictionary, metadata • Context • Relevance, value • DQ • Search • Security • Performance • Monitoring • Management tools • Governance • Backup

  13. Counterintuitive & Anti-dogma Notions • Size matters • But not unitarily • Smaller is better • Sampling • Quality matters • GIGO • All data must have structure to be consumed • There is no unstructured data!

  14. Myths • You don’t need a DBA • Schemaless • B.D. is just for unstructured data • Your unstructured data has lots of value • It’s separate from your other BI stuff incl.. • OLAP • DW • Datamarts • Analytics • Nosql

  15. Prerequisites • Many varied skill sets are needed • DBA • Sysadmin • BI analytics • Math (statistics) • Programming • Reading • Training • Scope

  16. Training options • Read books • Add some blogs to your feeds • Follow some of the right people on twitter • Search #bigdata #nosql #datascience …. • Online training • Big Data University (free) • EMC , Hortonworks, Cloudera, Karmashpere • Tutorials • Conferences • Get a degree • NC State, Stanford, Northwestern, Syracuse, UCSD

  17. Suggestions • Start small • Conduct triage on your possible sources • It should be integrated w/ the DW • Silos are bad….think spread marts • Grow your own Data Scientists • Move disparate LOB analysts in a single org • Train and cross train • Limit the BD user population • Design is still required • Mind and mine your structured data first • Get more training

  18. Don’t • Make your nosqldb the system of record • Put all your data in hadoop…to start • Ignore open source • Connect your garden variety query tools to hadoop • Open it up to everyone • Keep data indefinitely • Get heavy handed on security

  19. Other items • Cloud • SIEM • Tools to complement your solution(s) • Which db(s) to use? • For what? • External tables • Nosqldbs • Persist map reduce results in your db • Storage • Servers • X86 linux • External data sources

  20. Trends • March 2012 article by Munish Gupta • SaaS for analytics • Crowdsourcing • Data analysis libraries • Nosql market shakeup • Additionally from the article • RDBMS’s will not make a comeback • Other • More diverse sources • More data • More jobs • More choices, solutions, products, services, etc… • Query tools - yek

  21. Links of interest • http://wikibon.org/wiki/v/Enterprise_Big-data • My diigo bookmarks on Big Data • http://www.diigo.com/user/morpheus/bigdata 266 • Curt Monash’s Blog … http://www.dbms2.com • http://www.keithrozario.com/2012/07/opensource-gold-the-greatest-crowdsourcing-story-ever-told.html • http://www.analyticbridge.com/ • http://gigaom.com/data/ • This deck • http://92lobos.wikispaces.com/file/detail/Big+Data+Distilled.pptx • Future B.D. items • http://92lobos.wikispaces.com/bigdata

  22. Mike.King@fedex.com • mikeking60@gmail.com • @redleg60 Contact Feel free to drop me a note with any questions

More Related