220 likes | 345 Views
Big Data Distilled Separating the hype from reality. Mike King Technical Fellow Fedex Services November 8, 2012 Midsouth DAMA. What is Big Data?.
E N D
Big Data DistilledSeparating the hype from reality Mike King Technical Fellow Fedex Services November 8, 2012 Midsouth DAMA
What is Big Data? • Applying analytics to construct a model to predict an outcome where two or more dimension([VC])s exist AND your existing solutions can’t solve it. • The dimensions - 4 V’s, 1 C • Volume • Velocity • Variety • Variability • Complexity
The Market • Growing fast • Lots of players • Small and nimble • Large • Changing fast • Hype • Contenders and pretenders • Commercials are deceiving
Why do we need it? • Competitive Intelligence • Joining dissimilar data • Linking data • Adding context to data • Discovery • Diapers • Pregnancy • To supplement our BI/DW • Table stakes
Use Cases? • Customer analysis • Sentiment • Defection • Cannibilization • Cross selling • Network analysis • M2M • Fraud detection • Risk management • Text analytics • Social media analytics • Log analysis
Apache Hadoop • Batch • Open Source • Components • HDFS • DB • Hbase • Cassandra • Map/Reduce • Hive • Pig • Mahout • Chuckwa • Avro • Zookeeper
Solutions • Which stack/distribution? • Varying components • Apples & oranges • Types • Partial • Overlapping • Complementary • Substitute • Fast pace of change • Flux of partnerships
Dealing with vendors, choices • Decide what your requirements are • Don’t let them tell you what you need • Beware bait and switch • Extras • Some are looking to sell • Professional Services • Other Software • All solutions are incomplete • Many solutions are lacking • Multiple…is one enough? • Switching is possible • Low cost? • Beware • Proprietary components • Solutions that have already been fixed….Apache nn • Hammer and nail
My Big Data Vendors • MapR • Kaggle • Karmasphere • Hadapt • Datameer • Lucid Works • 1010data? • Splunk • SAS • IBM • Oracle • Hortonworks • Cloudera • EMC • Teradata • Amazon • Microsoft • HP
Not My Big Data Vendors • Pentaho • Palantir • Kalido • Composite • Couchbase • Marklogic • StoredIQ • Syncsort • Datastax? • IBI • Informatica • SAP • 10Gen • Talend • Denodo • Tableau • Tibco • ParAccel
What’s missing? • Collaboration • Directory, dictionary, metadata • Context • Relevance, value • DQ • Search • Security • Performance • Monitoring • Management tools • Governance • Backup
Counterintuitive & Anti-dogma Notions • Size matters • But not unitarily • Smaller is better • Sampling • Quality matters • GIGO • All data must have structure to be consumed • There is no unstructured data!
Myths • You don’t need a DBA • Schemaless • B.D. is just for unstructured data • Your unstructured data has lots of value • It’s separate from your other BI stuff incl.. • OLAP • DW • Datamarts • Analytics • Nosql
Prerequisites • Many varied skill sets are needed • DBA • Sysadmin • BI analytics • Math (statistics) • Programming • Reading • Training • Scope
Training options • Read books • Add some blogs to your feeds • Follow some of the right people on twitter • Search #bigdata #nosql #datascience …. • Online training • Big Data University (free) • EMC , Hortonworks, Cloudera, Karmashpere • Tutorials • Conferences • Get a degree • NC State, Stanford, Northwestern, Syracuse, UCSD
Suggestions • Start small • Conduct triage on your possible sources • It should be integrated w/ the DW • Silos are bad….think spread marts • Grow your own Data Scientists • Move disparate LOB analysts in a single org • Train and cross train • Limit the BD user population • Design is still required • Mind and mine your structured data first • Get more training
Don’t • Make your nosqldb the system of record • Put all your data in hadoop…to start • Ignore open source • Connect your garden variety query tools to hadoop • Open it up to everyone • Keep data indefinitely • Get heavy handed on security
Other items • Cloud • SIEM • Tools to complement your solution(s) • Which db(s) to use? • For what? • External tables • Nosqldbs • Persist map reduce results in your db • Storage • Servers • X86 linux • External data sources
Trends • March 2012 article by Munish Gupta • SaaS for analytics • Crowdsourcing • Data analysis libraries • Nosql market shakeup • Additionally from the article • RDBMS’s will not make a comeback • Other • More diverse sources • More data • More jobs • More choices, solutions, products, services, etc… • Query tools - yek
Links of interest • http://wikibon.org/wiki/v/Enterprise_Big-data • My diigo bookmarks on Big Data • http://www.diigo.com/user/morpheus/bigdata 266 • Curt Monash’s Blog … http://www.dbms2.com • http://www.keithrozario.com/2012/07/opensource-gold-the-greatest-crowdsourcing-story-ever-told.html • http://www.analyticbridge.com/ • http://gigaom.com/data/ • This deck • http://92lobos.wikispaces.com/file/detail/Big+Data+Distilled.pptx • Future B.D. items • http://92lobos.wikispaces.com/bigdata
Mike.King@fedex.com • mikeking60@gmail.com • @redleg60 Contact Feel free to drop me a note with any questions