1 / 32

Federal Big Data Working Group Meetup

Federal Big Data Working Group Meetup. Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup April 15, 2014.

dane
Download Presentation

Federal Big Data Working Group Meetup

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Federal Big Data Working Group Meetup Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup April 15, 2014

  2. Mission Statement • Federal: Supports the Federal Big Data Initiative, but not endorsed by the Federal Government or its Agencies; • Big Data: Supports the Federal Digital Government Strategy which is "treating all content as data", so big data = all your content; • Working Group: Data Science Teams composed of Federal Government and Non-Federal Government experts producing big data products (How was the data collected, Where is it stored, and What are the results?); and • Meetup: The world's largest network of local groups to revitalize local community and help people around the world self-organize like MOOCs (Massive Open On-line Classes) being considered by the White House. Co-organizers: Brand Niemann and Kate Goodier

  3. April 1stMeetup:NodeXLand Security and Privacy of Federal Big Data • Dr. Marc Smith, Social Media Research Foundation, and Dr. Kate Goodier, Xcelerate Solutions. • How was the Meetup? • Marc was a fabulous speaker. Wow! • Excellent presentations and discussions with members and guests. Thank you Marc and Kate! • Excellent presentations. NodeXL was made very accessible -- lots of samples and references available in the materials and Kate's Privacy presentation was thought provoking. • Awesome! Great topics and speakers. http://www.meetup.com/Federal-Big-Data-Working-Group/events/172360982/

  4. Charting Collections of Connections in Social Media: Creating Maps and Measures with NodeXL • Dr. Marc Smith is a sociologist specializing in the social organization of online communities and computer mediated interaction. Smith leads the Connected Action consulting group and lives and works in Silicon Valley, California. Smith co-founded the Social Media Research Foundation), a non-profit devoted to open tools, data, and scholarship related to social media research. http://www.slideshare.net/Marc_A_Smith/think-link-network-insights-with-no-programming-skills Sixth Meetup, Tuesday April 1, 2014, 6:30 p.m NodeXL and SCI2 for Data Science

  5. Security and Privacy of Information and Federal Big Data • Last Meetup: • Federal big data is different • Clearly understanding when to use big data and why • The Security and Privacy implications for the federal government • Big Data Best Practice Use Cases • 3 Key Questions: • What are the key technology differences between: • a non big data database and • a big data database? • What are the security and privacy implications of big data for federal application development? • What is the Use Case for the federal government? • How to transition to big data database technology – best practice use cases • This Meetup: • Cognitive Metadata the killer enabler for Federal Big Data Security and Privacy in the Clouds • It’s all about the metadata http://semanticommunity.info/@api/deki/files/28893/XcelerateFederalBigData04012014.pptx

  6. March 18thMeetup:Continue Data Science Tutorial • Practical Data Science for Data Scientists: • 2/11 Specific Data Science Tools and Applications 1 • Chapters 7 & 8 • Data Science for VIVO & Information Visualization MOOC (not time to cover): • 7 Weeks of Course Work with Sci2 Tools • Forming Teams to Work with Clients for Next 7 Weeks • NodeXL and Sci2 for Data Science: • NodeXL: A free, open-source template for Microsoft® Excel® that makes it easy to explore network graphs. • Sci2: A modular tool for science of science research & practice on scholarly datasets. (not time to cover). • Continue Data Science Tutorial:  Network Analytics and Visualization of Big Data Privacy Workshop Tweets, Dr. Marc A. Smith, Chief Social Scientist, Connected Action Consulting Group • May 6thMeetup: Continue Data Science Tutorial: Practical Data Science for Data Scientists (Class 5)

  7. Practical Data Science for Data Scientists Providing On-Line Class With Private Tutoring Class 5 http://semanticommunity.info/Data_Science/Practical_Data_Science_for_Data_Scientists

  8. Activities • DARPA Big Mechanism: • Story and Pilot for Future Meetup with Mike Megginson, Northrop Grumman, and Fredrik Salvesen, YarcData (in planning) • White Paper for NIH, NIST and NITRD: “Making Big Data Small" using Data Science and Semantics: • Data Science Team Pilot with Information Services Office: NIST Scientific Data for Data Science (completed) • Earth Science Information Partnership (ESIP) Earth Science Data Analytics, March 20: • Teleconference Presentation on Data Analytics in Data Science (Next April 17th) • EarthCube Summit at the Open Geospatial Consortium Technical Committee Meeting, March 25: • Presentations to be posted and mined for June 24-26 Conference in Washington, DC • OMG Semantics - Crossing the Chasm Workshop, March 26 • Presentations to be posted and mined for April 15th Meetup

  9. Activities • Ontology Summit Hackathon: Ontology Design Patterns and Semantic Abstractions in Ontology Integration, March 29: • theDataMap as a Data Paper and Ontology (this Meetup) • 2014 VIVO Conference, August 6-8, Austin, Texas: • Submitted Abstract: Data Science for VIVO and the IV MOOC • NSF Data Science Funding: • XiaomingHuo, Program Director, Statistics, Computational and Data-enabled Science & Engineering, National Science Foundation, Division of Mathematical Sciences (invited) • High School Student Interest in Data Science: • Tien Comlekoglu, Langley High School (invited) • Mary Galvin Blogs: • Data Community DC: Deep Learning Inspires Deep Thinking

  10. Data Science Data Papers & Browsers • Recall Dr. Marc Smith said we have browsers of Web pages and referred to NodeXL on the Web as a “browser of a web of data” • Semantic Community has been doing Data Papers in MindTouch (an advanced wiki) to make them data in an Excel spreadsheet for use in Spotfire Cloud, essentially a data browser! Examples are: • NIST Scientific Data for Data Science (IV MOOC) • State Health Databases (Meetup & Ontology Summit 2014) • Data Science for FIBO (Meetup & Ontology Summit 2014) • White Paper for NIH, NIST and NSF/NITRD: “Making Big Data Small" using Data Science and Semantics (in process) • UN Open Data / Open Government Workshop, Abu Dhabi, April 26-28 • CODATA Workshop on Big Data for International Scientific Programmes, Beijing, June 8-9 • Earth Cube All-Hands Meeting, Washington, DC, June 24-26 • Data Transparency Summit, April 29, 2014

  11. NIST Scientific Data for Data Science:Story • In the broader context, NIST and other agencies need to support the following Federal Government Initiatives: • Big Data • Digital Government Strategy • Public access mandated for "scientific results" supported by the U.S. government • Federal agencies have submitted their "initial plans" for public access to scientific data to OSTP • Digital Object Architecture: One result will be to make the scientific record into a first class scientific object • The author has suggested that all of these can be addressed with agency digital content by following the Data Mining Standard (Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, & Deployment) • The NIST Digital Archives (NDA) present images of NIST Museum artifacts and full-text NIST publications: • NIST Publications (one complete paper as a data paper) • Standards • Library Collections • Library Catalog • FAQs • NIST/NBS History (one complete paper as a data paper) • This helps NIST scientists disseminate their research results, manage references, and navigate the NIST editorial policies and review process, and helps the public better find and reuse their content. http://semanticommunity.info/Data_Science/NIST_Scientific_Data_for_Data_Science#Story

  12. NIST Scientific Data for Data Science:NIST Digital Archive http://nistdigitalarchives.contentdm.oclc.org/

  13. NIST Scientific Data for Data Science:Knowledge Base in MindTouch http://semanticommunity.info/Data_Science/NIST_Scientific_Data_for_Data_Science

  14. NIST Scientific Data for Data Science:Knowledge Base in Excel Spreadsheet http://semanticommunity.info/@api/deki/files/28860/NISTDataScience.xlsx

  15. NIST Scientific Data for Data Science:NIST Data Science Spotfire Cover Page Web Player

  16. Rules for Knowledge Base • Few or many Web links • Authoritative reference links or not • Static graphics versus interactivedashboard • Text versus real data • Cardinal assertions versus redundant assertions • Data and metadata together or separate • Normalization or not • RDF first or last or not at all • Other?

  17. State Health Databases:Flows from State Health Databases LatanyaSweeney, Professor, Harvard University: http://thedatamap.org/

  18. State Health Databases:Introduction 1 • When it comes to health data, trust begins with the doctor patient relationship. Without that trust, patients will not give useful information and may risk poor treatment. The patient needs to make his data transparent to physicians and hospitals. What is not transparent are all the places where the data may go. This is important because it is difficult to establish harm when data sharing is hidden. • Using publicly available information acquired through breach notices and FOIA requests, we begin to track all the places a typical, but hypothetical patient's data may go.  The map shows our progress so far.  . Each node represents a category of entities (e.g., companies and agencies) and the lines between them represent documented flows of personal health information. If the line is dashed, the information is shared without explicit personal identity.  If the line is solid, the explicit name of the person is shared. You may click on any node to see actual names of entities and links to evidence of sharing depicted by edges of the node. http://semanticommunity.info/Data_Science/Big_Data_Privacy_Workshop#Latanya_Sweeney.2C_Professor.2C_Harvard_University

  19. State Health Databases:Flows Not Covered by HIPPA LatanyaSweeney, Professor, Harvard University: http://thedatamap.org/

  20. State Health Databases:Introduction 2 • What is surprising about the image is the number of entities and some of the relationships. • Another surprise is that of the hundreds of flows of personal health data documented on the map, only half are actually covered by HIPAA. Most or all of the same data are available elsewhere, and not necessarily under any regime. This becomes important in understanding and assessing risks and remedies. A recipient of the data in the case study may have many other sources on which to link the data than those described. http://semanticommunity.info/Data_Science/Big_Data_Privacy_Workshop#Latanya_Sweeney.2C_Professor.2C_Harvard_University

  21. theDataMap as a Data Paper:Story • I selected theDataMap because it looks and functions like an ontology but is actually static graphics, not interactive networks (see table below).In addition, the paper Survey of Publicly Available State Health Databases is a PDF file in which the tables are images (not data tables) and are not as current as the web tables so it is a challenge to make it a Data Paper. • Recently, Professor Barry Smith, called the author's attention to the fact that their is an ontology for Data Mining called The OntoDMontology which I applied as follows: • Upper-level Ontology: The graphic above and the Knowledge Base Ontology • Mid-level ontologies: The 5 state data tables • Domain ontology: The Categories Linked Data Table • The results are shown in screen captures and interactive Spotfire Dashboard below of the Spotfire visualizations of the theDataMap Data Science spreadsheet.

  22. theDataMap as a Data Paper:Knowledge Base in MindTouch http://semanticommunity.info/Data_Science/State_Health_Databases

  23. theDataMap as a Data Paper:Knowledge Base in Excel Spreadsheet http://semanticommunity.info/@api/deki/files/28873/TheDataMapDataScience.xlsx

  24. theDataMap as a Data Paper:DataMapMap of Categories with Linked Data in Spotfire Web Player

  25. Data Science for FIBO:Stories • What is the Difference Between a Data Dictionary, Ontology, and Vocabulary?: • Frank Guerino, Chairman, The International Foundation for Information Technology (IF4IT) • FIBO Ontologies and Be Informed Metamodels for Financial Services Applications: • Email conversations withoriginal Data Science Team: Andrea Westerinen, Leo Obrst, Dennis Wisnosky, Mike Bennett, Elisa Kendall, Kees van Mansom, and Mills Davis • Semantics - Crossing the Chasm OMG Workshop: • Also included a session outlining the work done by the OMG Finance Domain Task Force in creating a Financial Industry Business Ontology standard and a Healthcare Track on What's Working Today and Where We Need to Go with Standards in Semantics for Healthcare Services. My Note: See next slides.

  26. Standards and Semantics for Biomedicine: Outline • Standard vocabularies • SNOMED CT • RxNorm • LOINC • Semantics across standards • Unified Medical Language System • Slide 26 Integrating Subdomains 1 • Slide 27 Integrating Subdomains 2 • Slide 28 Terminology integration • NCBO BioPortal • CTS2 – Common Terminology Services • Data elements • Common data elements at NIH • Information models • Clinical research data • Clinical information modeling initiative (CIMI) • Document markup standards • Clinical Document Architecture • Exchanging information with patients Blue Button • Biomedical standards and semantics in action • Getting involved • Health IT Standards Committee • Standards and Interoperability (S&I) Framework • IHE • Semantic Web – Health Care and Live Sciences • Linked Open Data Cloud • Medical Ontology Research My Note: The Data Science Program and Team for Semantic Medline led by Dr. Olivier Bodenreider, Staff Scientist, Lister Hill National Center for Biomedical Communications Bethesda, Maryland – USA! Source: http://mor.nlm.nih.gov/pubs/pres/20140326-OMG_Semantics.pdf

  27. Standards and Semantics for Biomedicine: Integrating Subdomains Source: http://mor.nlm.nih.gov/pubs/pres/20140326-OMG_Semantics.pdf

  28. Data Science for FIBO:Conclusions • In conclusion, I did the following: • Integrated ontology, semantic web, and "big data" to address the goal of the 2014 Ontology Summit • Made the background information and two OMG FIBO Standards documents "data papers" by using the TheOntoDM ontology and rules for building three knowledge bases as follows: • Overall which corresponds to the Upper-level ontology • Business Ontology Foundations (268) which correspond to the Mid-level Ontologies • Ontology Business Entities (176) which correspond to the Domain ontology • Made the 114 tables in the two OMG FIBO Standards documents linked data format • Combined the 114 tables with 969 columns and 2129 rows into an Excel spreadsheet and Spotfire dashboard

  29. Data Science for FIBO:Implications • It may be obvious, but it bears restating: Ontology requires strong relationships that can be quantified in order to realize the full benefits like data integration and reasoning, but real world activities and their data usually lack those strong relationship and the ability to quantify them so data science following the data mining process standard to produce semantically linked data is a way to "cross the chasm" with semantic technologies into the main stream because semantics are an important part of the data mining process and data science has crossed the chasm. • Now we are ready to cross the chasm with the upcoming Data Transparency Summit and government financial data sets that will become available because of the Data Act. See Data Transparency Summit knowledge base in development.

  30. Data Science for FIBO:Knowledge Base in Excel Spreadsheet http://semanticommunity.info/@api/deki/files/27961/FIBO.xlsx

  31. Agenda • 6:30 p.m. Brand Niemann, Introduction and Xcelerate Solutions Refreshments • Data Papers in Data Browsers Tutorial: On your Own (where did you get the data, where did you store the data, and what were your results?) • Data Science for the Financial Industry and Three Approaches to Semantic Normalization and Interoperability (see background research Wiki) • 6:35 p.m. Dr. Kate Goodier, Information Architect, Xcelerate Solutions, Cognitive Metadata : The Killer Enabler for Federal Big Data Security and Privacy in the Clouds • 7:00 p.m. Introductions and Announcements (10 seconds per individual depending on the size of the group) • 7:10 p.m. Dr. Kate Goodier, Cognitive Metadata (continues) • 7:30 p.m. Cambridge Semantics , Examples of customer use cases in financial services in areas like compliance, data onboarding, and insider trading surveillance. • 8:30 p.m. Networking/Individual Demos (talk among yourselves and look at one another's work) • 9:00 p.m. Continue Your Conversations Elsewhere (We need to clear out of the space)

  32. Next Meetups • Eighth Meetup: May 6, 6:30 p.m. • Practical Data Science for Data Scientists: Specific Data Science Tools and Applications 2, Federating Big Data for Big Innovation, Data Science for Datapalooza,and ESIP Earth Sciences Data Analytics. See Climate Change and the President Obama's Action Plan Infographic: Where is the data for this? • EPA & NASA Climate/Environmental Data Analytics, Dr. Joan Aron, Global Environmental/Climate Change Scientist (with sample analytics by Brand Niemann) • Federating Big Data for Big Innovation and A Redesigned, Open Source Data.gov: Dr. Jeanne Holm Data.gov Evangelist • Ninth Meetup: May 20, 6:30 p.m. • Continue Data Science Tutorial: Practical Data Science for Data Scientists: Data Science Students and Careers. • Graduate Students Working on Semantic Medline-YarcData Projects: GMU Updates Master's Program for Data Science and Sarah Soliman, Rand, and IV MOOC Student Project (invited) • Keynote from Workshop in China, Charles Randall Howard, an Adjunct Professor in the Applied Information Technology Department and Data Scientist • The Science Behind Data Science, RuhollahFarchtchi, Director of Big Data, UNISYS • 2nd Cloud, SOA, Semantics and Data Science Conference, June (in planning) • Summer Vacation?

More Related