300 likes | 477 Views
Federal Big Data Working Group Meetup. Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup March 18, 2014.
E N D
Federal Big Data Working Group Meetup Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup March 18, 2014
Mission Statement • Federal: Supports the Federal Big Data Initiative, but not endorsed by the Federal Government or its Agencies; • Big Data: Supports the Federal Digital Government Strategy which is "treating all content as data", so big data = all your content; • Working Group: Data Science Teams composed of Federal Government and Non-Federal Government experts producing big data products (How was the data collected, Where is it stored, and What are the results?); and • Meetup: The world's largest network of local groups to revitalize local community and help people around the world self-organize like MOOCs (Massive Open On-line Classes) being considered by the White House. Co-organizers: Brand Niemann and Kate Goodier
Joint NSF-NIH Biomedical Big Data Research Meetup “Thanks again for a wonderful gathering of deep thinkers at the NIH-NSF Big Data event -- that was terrific. Great line up of speakers.” http://semanticommunity.info/Data_Science/Euretos_BRAIN#Story
Scientific Data:A View from the US • Dr. George Strawn, Director, NITRD/NCO and co-chair of the Federal Big Data Senior Steering Work Group: • Public access mandated for "scientific results" supported by the U.S. government • Federal agencies have submitted their "initial plans" for public access to scientific data to OSTP • Digital Object Architecture: • An "hour glass" for data? (As the Internet was an hour glass for networks: TCP/IP at the narrow point; many applications above, many implementations below) • One result will be to make the scientific record into a first class scientific object http://semanticommunity.info/@api/deki/files/28467/GeorgeStrawn01132014.ppt
Activities • White House OSTP - MIT Big Data Privacy Workshop: • Story and Network Analysis of Tweets: • April 1stMeetup with Kate Goodier and Marc Smith • NIST Data Science Symposium: • Poster and Story: • Data Science Team Pilot with Information Services Office • White Paper for NIST and NITRD: • “Making Big Data Small" using Data Science and Semantics: • “Thanks again for your effort in putting this program together.!” • Information Visualization MOOC: • Story and Course Work: • Forming Teams to Work with Clients for the Remaining 7 Weeks • DARPA Big Mechanism: • Story and Pilot: • April 15thMeetupwith Mike Megginson, Northrop Grumman, and Fredrik Salvesen, YarcData (in planning)
Agenda • 6:30 p.m. Tutorials (Proposed GMU Course) and Refreshments • Continue Data Science Tutorial: Class 4 andGraph Databases and Bigdata SYSTAP Literature Survey of Graph Databases • 7:00 p.m. Introductions and Announcements (10 seconds per individual depending on the size of the group) • 7:15 p.m. Featured Presentation/Demonstration (where did you get the data, where did you store the data, and what were your results?) • Bryan Thompson, Chief Scientist of SYSTAP, LLC will speak about their SYSTAP open source graph database platform. Highlights will include support for highly available replication clusters as well their recent work with accelerated graph processing on GPUs at 3 billion traversed edges per second. • See CSHALS 2014: Tech Talk and Poster in Wiki • 8:30 p.m. Networking/Individual Demos (talk among yourselves and look at one another's work) • 9:00 p.m. Continue Your Conversations Elsewhere (We need to clear out of the space)
Next Meetups • Sixth Meetup: April 1, 6:30 p.m. • Network Analytics and Visualization of Big Data Privacy Workshop Tweets, Dr. Marc A. Smith, Chief Social Scientist, Connected Action Consulting Group, and Remarks by the President on Review of Signals Intelligence, Dr. Kate Goodier, Information Architect, Xcelerate Solutions • Seventh Meetup: April 15, 6:30 p.m. • DARPA Big Mechanism, Mike Megginson, Northrop Grumman, and Fredrik Salvesen, YarcData (in planning) • Eighth Meetup: May 6, 6:30 p.m. • Federating Big Data for Big Innovation, Dr. Jeanne Holm Data.gov Evangelist • Ninth Meetup: May 18, 6:30 p.m. • The Science Behind Data Science, RuhollahFarchtchi, Director of Big Data, UNISYS • 2nd Cloud, SOA, Semantics and Data Science Conference, June (in planning)
Overview • Practical Data Science for Data Scientists: • 2/11 Specific Data Science Tools and Applications 1 • Chapters 7 & 8 • Data Science for VIVO & Information Visualization MOOC (not time to cover): • 7 Weeks of Course Work with Sci2 Tools • Forming Teams to Work with Clients for Next 7 Weeks • NodeXL and Sci2 for Data Science (not time to cover): • NodeXL: A free, open-source template for Microsoft® Excel® that makes it easy to explore network graphs. • Sci2: A modular tool for science of science research & practice on scholarly datasets.
Practical Data Science for Data Scientists Providing On-Line Class With Private Tutoring Class 4 http://semanticommunity.info/Data_Science/Practical_Data_Science_for_Data_Scientists
Resources • Required Textbook • Doing Data Science: • http://shop.oreilly.com/product/0636920028529.do • Free Sampler: • http://cdn.oreillystatic.com/oreilly/booksamplers/9781449358655_sampler.pdf (PDF) • Optional Supplemental Reading: • Data Science Starter Kit: • http://shop.oreilly.com/category/get/data-science-kit.do • DC Data Community: • http://datacommunitydc.org/blog/about/ • DC Data Community Calendar: • http://datacommunitydc.org/blog/calendar/ • Technology Requirements • Internet and Free Tools like Spotfire Cloud: • https://spotfire.cloud.tibco.com/tsc/#!/compproductrequest • NodeXL: • http://nodexl.codeplex.com/My Note: Current Focus
Class 4 • 2/11 Specific Data Science Tools and Applications 1 • Discuss Reading: Chapters 7 and 8, Present and Discuss Team Homework Exercises, Hands-on Class Exercise, and Team Homework Exercise. • My Resources: • http://semanticommunity.info/Data_Science/Free_Data_Visualization_and_Analysis_Tools • http://semanticommunity.info/Data_Science/KDD_Cup • http://www.kdnuggets.com/datasets/ • Hands-on Class Exercise: • SAS and SAS Public Data Sets • See Spotfire Web Player and Spotfire File, Spotfire Web Player and Spotfire File, and Spotfire Web Player and Spotfire File • Exercise: Build Your Own Recommendation System
Discuss Reading • Chapter 7: • How do companies extract meaning from the data they have? In this chapter we hear from two people with very different approaches to that question—namely, William Cukierski from Kaggle and David Huffaker from Google. • Chapter 8: • This is the most difficult chapter in the book for me to teach since I do not understand the Python code at the end and have never built a Recommendation Engine myself. I would welcome some help here.
Present and Discuss Team Homework Exercise • Get the Data: Go to Yahoo! Finance and download daily data from a stock that has at least eight years of data, making sure it goes from earlier to later. If you don’t know how to do it, Google it. • Yahoo: http://finance.yahoo.com/q/hp?s=%5EO...torical+Prices (CSV) • See Spotfire Web Player and File
Chapter 6 Timestamps andFinancial Modeling Web Player
Hands-on Class Exercise • SAS and SAS Public Data Sets: • SAS-SpotfireWeb Player and Spotfire File, • SAS Exercises-SpotfireWeb Player and Spotfire File, and • SAS Public Data Sets-SpotfireWeb Player and Spotfire File • Exercise: Build Your Own Recommendation System • I would welcome some help here.
SAS Public Data Sets-Spotfire Tutorial Web Player
Team Homework Exercise • Read in next week's reading: Data Visualization for the Rest of Us: • See my Slides and Web Player. • Start to create your own Hubway Data Visualization Challenge and eventually submit it for your class project and the challenge (now closed but still accepting submissions) if you want. • Form Teams (Same or New), Ask Me Questions, and Prepare to Present Next Week
A Data Science Big Mechanism for DARPA • DARPA wants to help the DoD get to the essence of cause and effect for cancer from reading the medical literature. • The Federal Big Data Working Group Meetup has also been doing that with Semantic Medline - YarcData and EuretosBRAIN (Bio Relations and Intelligence Network). • See the video for Cancer Immunotheraphy (21 minutes) which Science magazine called the biggest breakthrough in 2013 at the end of 2013 and which Dr. Tom Rindflesch (the inventor of Semantic Medline) identified from Semantic Medline as a very important breakthrough in early 2013!
Data Science Data Mining Process • Business Understanding: • Broad Agency Announcement (PDF) and Slide Presentation (PPT) • Data Understanding: • Semantic Medline, Open Catalog, CSHALS* 2014, and “Starter kit“ (to be provided) • Data Preparation: • Knowledge Base of the Above • Modeling: • Semantic Medline, Data Papers, and NanoPublications • Evaluation: • Searchability, Discovery, and Reasoning • Deployment: • Story and Knowledge Base in MindTouch, Excel, Spotfire, and Be Informed * Conference on Semantics in Healthcare and Life Sciences
The Initial Knowledge Base-Data Ecosystem http://semanticommunity.info/Data_Science/A_Data_Science_Big_Mechanism_for_DARPA
Where did we find some structured data? http://www.darpa.mil/opencatalog/
Where did we store the structured data? http://semanticommunity.info/@api/deki/files/28732/DARPA.xlsx
Modeling: Approaches • Semantic Medline • Semantic MEDLINE Query: mesothelioma and Data Science for VIVO • Data Papers: • Sepublica 2014: The Semantics for e-science in an intelligent Big Data Context • http://sepublica.mywikipaper.org/ • Nanopublications: • The smallest unit of publishable information: an assertion about anything that can be uniquely identified and attributed to its author. • http://nanopub.org/wordpress/?page_id=65
How did we store the unstructured data? Well-defined URLs Knowledge and Glossary Relational and Graph Linked Data Footnote and References Metadata and Data Sources Ready for NodeXL & Spotfire http://semanticommunity.info/@api/deki/files/28470/BRAIN.xlsx
Modeling: Examples Dr. BarendMons: BRAIN Dr. Tom Rindflesch: Semantic Medline Most Recent: 500 citations, Start Date: 01/01/1900, End Date: 11/30/2013, 3169 predications extracted. Summarized for Substance Interactions
What did we find when we analyzed the data? Web Player
What is our data story and product? • Data Ecosystem: • BRAIN.xlsx • DARPA.xlsx • Individual Tabs: • DARPA Open Catalog: • Bigdata SYSTAP is Category: Infrastructure and License: GPLv2 • DARPA Big Mechanism Knowledge Base: • DARPA Big Mechanism Knowledge Base by Function (21) • DARPA Big Mechanism Knowledge Base by Number of References (175) • BRAIN Knowledge Base and Examples: • BRAIN Knowledge Base by Function (References) • Data Fairport Conference Dropbox Files by Type (PPTX) • Data Science for VIVO & IVMOOC • Citations by Publisher (APS) • Total Award Amount ($) by Principal Investigator (Geoffrey Fox)
Graph Databases 12 Leading BI Tools and Analytic Platforms I Tested for OMB Absent: Bigdata SYSTAP Virtuoso YarcData Etc. http://semanticommunity.info/Data_Science/Graph_Databases#Story http://semanticommunity.info/Data_Science/Graph_Databases/Tutorial
Bigdata SYSTAP Literature Survey of Graph Databases Awarded Best Paper in 2004! And 10 Years Later….. http://semanticommunity.info/Data_Science/Bigdata_SYSTAP_Literature_Survey_of_Graph_Databases#Story