300 likes | 434 Views
Federal Big Data Working Group Meetup. Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup
E N D
Federal Big Data Working Group Meetup Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup February 18, 2014
Mission Statement • Federal: Supports the Federal Big Data Initiative, but not endorsed by the Federal Government or its Agencies; • Big Data: Supports the Federal Digital Government Strategy which is "treating all content as data", so big data = all your content; • Working Group: Data Science Teams composed of Federal Government and Non-Federal Government experts producing big data products (How was the data collected, Where is it stored, and What are the results?); and • Meetup: The world's largest network of local groups to revitalize local community and help people around the world self-organize like MOOCs (Massive Open On-line Classes) being considered by the White House.
Co-organizers • Brand Niemann and Kate Goodier • Kate Goodier, Host: XcelerateSolutions offices in Tysons Corner: • Capacity about 50 with Skype and WiFiavailable. The Silver Line Spring Hill Metro Stop (planned to open in March) is across the street (Route 7 and Spring Hill Road). • Directions to the building are easy and they have open underground parking: • See photo on Web Site from XcelerateSolutions Office looking south to the Spring Hill Road Silver Line Metro Station (planned to open in March 2014). • Logistics: • Refreshments, restrooms, etc.
Suggested Format • 6:30 p.m. Tutorials (I will start with - Proposed GMU Course, and hope that others would offer to do tutorials as well) and Refreshments • Continue Data Science Tutorial: Class 3, Recent Tutorial, and Modus Operandi Semantic Knowledge Base • 7:00 p.m. Introductions and Announcements (10 seconds per individual depending on the size of the group) • Remarks by Dr. George Strawn, Director, NITRD/NCO and co-chair of the Federal Big Data Senior Steering Work Group • 7:15 p.m. Featured Presentation/Demonstration (where did you get the data, where did you store the data, and what were your results?) • Evolution of Semantic Technologies-The Value of Merging Smart Data With Big Data: Eric Little, Modus Operandi and Department of Defense Metadata Engineers • White Paper “Making Big Data Small" using Semantics & Advanced Analytics for NITRD: Jeff Lessner, Modus Operandi • 8:30 p.m. Networking/Individual Demos (talk among yourselves and look at one another's work) • 9:00 p.m. Continue Your Conversations Elsewhere (We need to clear out of the space)
Next Meetups • NIST Data Science Symposium, March 4-5, 9 a.m. • Hosted at NIST (Gaithersburg, MD) • Registration closes February 21st (free) • We have a poster presentation at 2:45-4 p.m. • Fourth Meetup: March 4, 6:30 p.m. • Hosted at NSF (Ballston, VA) • Welcome by NIH Program Director, Dr. Peter Lyster • Brief demo of NIH Semantic Medline/YarcData by Tom Rindflesch and Aaron Bossett • Presentation by Drs. George Strawn and Barend Mons on A Data Fairport and Semantic Scientific Publishing • Discussions and Networking • Fifth Meetup: March 18, 6:30 p.m. • Continue Data Science Tutorial: Graph Databases and Bigdata SYSTAP Literature Survey of Graph Databases • Bigdata SYSTAP, Bryan Thompson, SYSTAP • Discussions and Networking • Sixth Meetup: April 1, 6:30 p.m., Seventh Meetup: April 15, 6:30 p.m., Eighth Meetup: May 4, 6:30 p.m. and Ninth Meetup: May 18, 6:30 p.m. • 2nd Cloud, SOA, Semantics and Data Science Conference, June (in planning)
Overview • Practical Data Science for Data Scientists: • 2/4 Asking and Answering Questions About Data • Chapters 5 & 6 • Two Book Review Tutorials: • Thinking with Data: • Recall the Borne Ultimatum: Data Literacy for All! Teach Learning From Data K-12 • Data Science for Business: • Introduction to Data Science for NYU’s new MS in Data Science and adopted by more than twenty other universities for programs in nine countries • Two Data Science Client Applications: • GIS Inc. – EPA Waterways • Semantic Verses - Data Science for Business Data Science • Data Science @ Capital One: • Data Science Story and Invitation • Senior Data Scientist Position
Practical Data Science for Data Scientists Providing On-Line Class With Private Tutoring Class 3 http://semanticommunity.info/Data_Science/Practical_Data_Science_for_Data_Scientists
Resources • Required Textbook • Doing Data Science: • http://shop.oreilly.com/product/0636920028529.do • Free Sampler: • http://cdn.oreillystatic.com/oreilly/booksamplers/9781449358655_sampler.pdf (PDF) • Optional Supplemental Reading: • Data Science Starter Kit: • http://shop.oreilly.com/category/get/data-science-kit.do • DC Data Community: • http://datacommunitydc.org/blog/about/ • DC Data Community Calendar: • http://datacommunitydc.org/blog/calendar/ • Technology Requirements • Internet and Free Tools like Spotfire Cloud: • https://spotfire.cloud.tibco.com/tsc/#!/compproductrequest • NodeXL: • http://nodexl.codeplex.com/
Class 3 • 1/28 Finding, Cleaning, Analyzing, and Visualizing Data • Discuss Reading: Chapters 5 and 6, Present and Discuss Team Homework Exercise, Hands-on Class Exercise, and Team Homework Exercise. • My Resources: • AOL Government Stories • Hands-on Class Exercise: • Media 6 Degrees Exercise • Media 6 Degrees kindly provided a dataset that is perfect for exploring logistic regression models, and evaluating how good the models are. dds_ch5_binary-class-dataset • See SpotfireWeb Player: Chapter 5 Logistic Regression Media 6 Degrees and SpotfireFile
Discuss Reading • Chapter 5: • In this chapter, we’re talking about logistic regression, but there’s other classification algorithms available, including decision trees (which we’ll cover in Chapter 7), random forests (Chapter 7), and support vector machines and neural networks (which we aren’t covering in this book). • Chapter 6: • The main topics for this chapter—times series, financial modeling, and fancy-pants regression, and building a GetGlue-like recommendation system to address the problem of content discovery within the movie and TV space.
Present and Discuss Team Homework Exercise • Select One, But Please Present Both: • Jake’s Exercise: Naive Bayes for Article Classification: NYT Data Set (31 CSV files, 151 MB) Already Used • A Spam Filter for Individual Words: To do this yourself, go online and download Enron emails • My Note: Because of the difficulty with these data sets, I provided the Two Data Science Client Applications.
Hands-on Class Exercise • Media 6 Degrees Exercise: • Media 6 Degrees kindly provided a dataset that is perfect for exploring logistic regression models, and evaluating how good the models are:dds_ch5_binary-class-dataset • See SpotfireWeb Player Chapter 5 Logistic Regression Media 6 Degrees and SpotfireFile • See Spotfire User's Guide for Data Science: • Logistics Regression Method • How to Use the Evaluation Page
Chapter 5 Logistic RegressionMedia 6 Degrees Web Player
Team Homework Exercise • Exercise: GetGlue and Timestamped Event Data • GetGlue kindly provided a dataset for us to explore their data, which contains timestamped events of users checking in and rating TV shows and movies. • Raw data is 11 GB (once it’s uncompressed) and could not be imported into Spotfire. • Get the Data: Go to Yahoo! Finance and download daily data from a stock that has at least eight years of data, making sure it goes from earlier to later. • If you don’t know how to do it, Google it. Yahoo: http://finance.yahoo.com/q/hp?s=%5EO...torical+Prices (CSV) See Spotfire Web Player andFile • Form Teams (Same or New), Ask Me Questions, and Prepare to Present Next Week
Chapter 6 Timestamps andFinancial Modeling Web Player
Two Book Review Tutorials • Thinking with Data: • Recall the Borne Ultimatum: Data Literacy for All! Teach Learning From Data K-12: • Thinking with Data: Book Review Tutorial • In-depth look at many of the same topics in Data Science for Business, with a greater focus on the high-level technical ideas. • Data Science for Business: • Introduction to Data Science for NYU’s new MS in Data Science and adopted by more than twenty other universities for programs in nine countries: • Data Science for Business: Book Review Tutorial • Used in Semantic Verses - Data Science for Business Pilot.
Two Data Science Client Applications 1 • GIS Inc. – EPA Waterways: • We’ve been following your work with the Voyager implementation at the National Geospatial Intelligence Agency (NGA). I’m doing some work for the Williams Company (Oil and Gas) and we have just launched Voyager at 5 of their office locations. My team was wondering if you might have any time to relay any lessons’ learned with your implementation at NGA? • Semantic Community: • I use Voyager like a Geographic Clearinghouse to find GIS data and then Spotfire 6.0 to analyze it. • What is the spatial relationship between Williams Co. current and planned activities and EPA Waterways data?
Answer the Questions About EPA Waterways • Where did we find the data? • Online most recent (Better than Voyager this time) • Where did we store the data? • Shape files & Excel spreadsheets (ultimately Spotfire) • What did we find when we analyzed the data? • See Spotfiredashboards • What is our data story and product? • See Spotfiredashboards and TIBCO Spotfire 6 for Data Science Documentation
Where did we find the data? • The Environmental Protection Agency has maintained public databases on the condition of rivers, lakes and streams for decades. But until about a year ago, anyone who wanted to get at that data faced a labyrinthine process, either devising search queries to try to navigate the databases or resorting to a Freedom of Information Act request. • My Note: We improved on that! Web Site
Where did we store the data? The Data Ecosystem! http://semanticommunity.info/@api/deki/files/28161/EPAWaterways.xlsx
What did we find when we analyzed the data? Web Player
What is our data story and product? • Data Ecosystem: • My Note: I downloaded, inventoried and imported these to Spotfire which resulted in a 2.5 GB Spotfire file which I then reduced three times to 1.7 GB, 0.7 GB, and finally publish a 0.3 GB Spotfire files to the Web Player. • Individual Tabs: • 303(d) Listed Impaired Waters • 305(b) Waters • 2002 Impaired Waters • Watershed Boundaries 2002 • Impaired Waters with TMDL • 2009 Beaches • Water Quality Standards Program
Two Data Science Client Applications 2 • Semantic Verses - Data Science for Business Data Science: • “Magnet is the only engine that treats topics as semantic objects, which gives it a competitive edge since the identification of “key topics” is generally considered to be the main feature of any semantic engine.” • Source: Walid S. Saba, PhD, AI/NLP Scientist, February 2014. • Produced a Data Science for Business Knowledge Base in MindTouch, Excel, and Spotfire: • Structured Mashup with everything treated as an object with a well-defined URL for the Glossary (taxonomy) and Table of Contents (thesaurus) Integrated together in an Information Model! • Allows one to construct a natural language front-end for enterprise data (and big data) integration across multiple sources.
Where did we store the data? The Data Ecosystem: 491 rows by 16 columns 94 rows by 12 columns 23 rows by 2 columns so far! http://semanticommunity.info/@api/deki/files/28363/DataScienceforBusiness.xlsx
TIBCO Spotfire 6 for Data Science Data Science Recipe for Data Science Cooking http://semanticommunity.info/Data_Science/TIBCO_Spotfire_6_for_Data_Science#Story
TIBCO Spotfire 6 for Data Science The Data Relationships tool is used for investigating the relationships between different column pairs. The Linear regression and the Spearman R options allow you to compare numerical columns, the Anova option will help you determine how well a category column categorizes values in a (numerical) value column, the Kruskal-Wallis option is used to compare sortable columns to categorical columns, and the Chi-square option helps you to compare categorical columns. http://semanticommunity.info/Data_Science/TIBCO_Spotfire_6_for_Data_Science
What did we find when we analyzed the data? My Note: This and the next slide show why it is impossible to distinguish between poisonous and non-poisonous mushrooms.
Data Science @ Capital One • Data Science Story and Invitation: • The classic story of little Signet Bank from the 1990s provides a case in point for Data and Data Science Capability as a Strategic Asset. Fairbanks and Morris became Chairman and CEO and President and COO, and proceeded to apply data science principles throughout the business—not just customer acquisition but retention as well. You may not have heard of little Signet Bank, but if you’re reading this book you’ve probably heard of the spin-off: Capital One. Fairbanks and Morris’s new company grew to be one of the largest credit card issuers in the industry with one of the lowest chargeoff rates. My Note: I invited their data science lead to present. • Senior Data Scientist Position: • Basic Qualifications: • Bachelor’s Degree • 2 years experience in Hadoop • 7 years of experience with data mining, machine learning, statistical modeling tools and underlying algorithms • 5 years experience with relational database and SQL • 5 years experience working with large, unstructured (terabyte or larger) data sets
Preview of What You Are Going To Hear • Remarks by Dr. George Strawn, Director, NITRD/NCO and co-chair of the Federal Big Data Senior Steering Work Group • Evolution of Semantic Technologies-The Value of Merging Smart Data With Big Data: Eric Little, Modus Operandi and Department of Defense Metadata Engineers • White Paper “Making Big Data Small" using Semantics & Advanced Analytics for NITRD: Jeff Lessner, Modus Operandi