1 / 24

Federal Big Data Working Group Meetup

Federal Big Data Working Group Meetup. Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup May 6, 2014.

csilla
Download Presentation

Federal Big Data Working Group Meetup

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Federal Big Data Working Group Meetup Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup May 6, 2014

  2. Mission Statement • Federal: Supports the Federal Big Data Initiative, but not endorsed by the Federal Government or its Agencies; • Big Data: Supports the Federal Digital Government Strategy which is "treating all content as data", so big data = all your content; • Working Group: Data Science Teams composed of Federal Government and Non-Federal Government experts producing big data products (How was the data collected, Where is it stored, and What are the results?); and • Meetup: The world's largest network of local groups to revitalize local community and help people around the world self-organize like MOOCs (Massive Open On-line Classes) being considered by the White House. Co-organizers: Brand Niemann and Kate Goodier

  3. April 15th Meetup: Kate Goodier, Cognitive Metadata, and Cambridge Semantics, Insider Trading • How was the Meetup? • It is fitting that we meetup to discuss data science for financial data and services on Federal Tax Day! • Excellent presentations, slides, and discussions • Cambridge Semantics was super! • This Meetup is one of the best I attend. • Thanks again for the opportunity to speak to your group on Tuesday. We always appreciate the work you do organizing these communities, and we’re happy to help you out as well whenever we can. http://www.meetup.com/Federal-Big-Data-Working-Group/events/174975182/

  4. Federal Big Data and Cognitive Metadata:Dr. Kate Goodier, Xcelerate Solutions • Cognitive metadata is: • Metadata coming from our perception, reasoning, or intuition, such as preference for a type of content. • Very useful for personalization purposes, and conversely, for limiting PII incidents. • Necessary because policy typically requires audit data that supports policy-driven events. • Necessary because of the paradigm shift from secure EA structures to Cloud Architectures • Cognitive Metadata is a result of data science. • Uses Universal Unique Identifiers (UUIDs) to enable distributed systems to uniquely identify information without significant central coordination. • Employs predictive algorithms from Big Data Machine Learning combined with Natural Language Processing. • Provides Automated Reasonersfor Federal PII policy adherence at scale. • Uses Rules Engines to perform Continuous Monitoring. • Maps the Right Content to the Right Policy and provides that ASAP (Advanced Stream and Prediction). • Cognitive Metadata helps support: • Computer Network Defense (CND) data. • New Executive Orders for Classified Data 13526 and Controlled Unclassified Information 13556. • Dynamic data in audit event management. http://semanticommunity.info/@api/deki/files/28996/XcelerateFederalBigDatatheUseCaseforCognitiveMetadata.pptx

  5. Financial Services, Insider Trading:Marty Loughlin, Cambridge Semantics • Examples of customer use cases in financial services in areas like compliance, data onboarding, and insider trading surveillance. • http://semanticommunity.info/@api/deki/files/28997/CSIApril2014.pptx • Semantic University is a free resource for learning about semantics: • http://www.cambridgesemantics.com/semantic-university • Semantic Web in the Enterprise Blog: • http://www.cambridgesemantics.com/blog/ • Cambridge Semantic Meetup and Cambridge Semantic Web Gatherings (Tim Berners-Lee) • http://www.meetup.com/The-Cambridge-Semantic-Web-Meetup-Group/ • http://www.w3.org/wiki/CambridgeSemanticWebGatherings

  6. Activities • Opening Government Data and Creating Public Value: • Faster Administration of Science & Technology Education & Research (FASTER) and the Big Data Senior Steering Group (BD-SSG), April 22 • Behind the Scenes of Really Big Data: Computing on the Whole World: • Data Science DC Meetup, April 22 • White Paper for NASA, NIH, NIST and NITRD: “Making Big Data Small" using Data Science and Semantics: • Framework Meeting, April 25 • Ontology Summit 2014: • Drs. George Strawn, FarnamJahanian, and Philip Bourne, April 28 • Data Act of 2014: • Data Transparency Summit, April 29 • Federating Big Data: • Michael Stonebraker, March 3, and FedStats.net, July 2012

  7. Can the Scientific Data Be Reused?:Opening Government Data and Creating Public Value • I told Theresa Pardo my conclusion about the value of open government data was that the real demonstrated value is with statistical data which has been publicly available for many years and that I know of only one example where multiple government health data sets were integrated with considerable effort and expertise to win a $100,000 Health Datapalooza competition several years ago. • I told her this was based on my extensive history with this as follows: • I was asked to pilot the original Data.gov which I did using the Census Bureau's Annual Statistical Abstract as a best practice example; • Years later I then was asked by Data.gov to pilot My Data.gov for how I would do it after Data.gov had encountered one problem after another; • When I left government service I was paid as a data scientist/data journalist to write stories about the value of Data.gov and its most popular data sets; and • Finally I was retained as a consultant to the Japanese government to design and pilot their Open Government Data Program to benefit from Data.gov's mistakes (proprietary software, poor quality data sets, and lack of budget support) and I recommended they use their statistical data and they did. • Theresa agreed with my points and that the value of Open Government Data has yet to be quantified and that statistical, and now scientific data, are the most promising areas for doing that. http://semanticommunity.info/Data_Science/Big_Data_Science_for_CODATA/Data_Science_Journal#Story

  8. Behind the Scenes of Really Big Data: Computing on the Whole World • My Announcement at the Meetup was: • The Federal Big Data Working Group Meetup meets on the first and third Tuesdays of the month in Tysons Corner and we are mentoring students and professionals with data science tutorials, preparation of presentations, and writing proposals. • My Comment after the Meetup was: • Kudos to the organizers for hosting such a large group. The speaker should slow down and be more interactive with the audience and present some real data science results on "the whole world" like Facebook's analysis showing the average degree of separation is down from 6 to about 4.2, Recorded Future's analysis of protests and web intelligence, and Marc Smith's uses of NodeXL network graphs in treemaps to discover patterns and what might be done to change them. I also suggest the author look at the presentations we have had in the Federal Big Data Working Group Meetup on the state of the art in big graph computing. http://www.meetup.com/Data-Science-DC/events/175795102/

  9. Framework for White Paper • Organize a Community of Data Scientists and Related Fields to focus on treating all of your content as "Big Data" • Example: Federal Big Data Working Group Meetup • Follow the Cross Industry Standard Process for Data Mining (CRISP-DM; Shearer, 2000) consisting of Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment • Example: Semantic Community Data Science Knowledge Base (Big Data Science for CODATA)  • Mine prominent scientific journals for data policy, data bases, and data results that can be reused. • Example:​ CODATA Data Science Journal (509 publication by 9 attributes) • Provide data stories and presentation materials for public education and conferences • Example: CODATA International Workshop on Big Data for International Scientific Programmes, June 8-9, in Beijing • Obtain NSF funding for sustained data science for data publications work over a period of years • Example: Critical Techniques and Technologies for Advancing Big Data Science & Engineering (BIGDATA) • Provide a Data Fairport with “Data Publication in Data Browsers” • Example: Semantic Community Spotfire Cloud Library

  10. Ontology Summit 2014 • Dr. George Strawn: • Interested in Evaluating Digital Objects as a Technology for Implementing “Research Objects" (a precise technology for a fuzzy concept), See Data Science for VIVO • Dr. FarnamJahanian: • Interested in Attending Meetups of Big Data Scientists • Dr. Philip Bourne: • Interested in our Approach to Data Science Community Building and Products, May 7 • Dr. Brand Niemann: • Data Science and Knowledge Modeling (e.g. Be Informed) automate Ontology Development for Applications (e.g. Healthcare.gov)

  11. Data Act of 2014 • Data Transparency Summit, April 29, 2014: • Passing the Digital Accountability and Transparency Act was one thing. Implementing it will be a much bigger challenge. The success of the open government measure Congress passed on Monday, which the president has pledged to sign, depends on ensuring the executive branch implements the law’s mandates on schedule, lawmakers said on Tuesday. • Some White House officials remain concerned Congress’ implementation plan for the DATA Act, which requires standardized coding for federal grant and contract spending, is too quick, Sen. Mark Warner, D-Va., said. • The new law begins with a two-year pilot program during which the Treasury Department and the Office of Management and Budget will develop uniform coding for federal spending data and develop ways to publish it in machine readable and downloadable formats. • The goal is that people inside and outside of government can use the data to spot inefficiencies, duplication, waste and fraud in federal spending and to suggest alternatives. • Data Transparency Coalition Pilot, January 4, 2013: • Semantic Community showed that the Federal Digital Government Strategy accomplishes the Data Act (Hudson Hollister, Executive Director agreed) http://www.nextgov.com/cio-briefing/2014/04/congress-feds-youre-hook-spending-transparency/83383/

  12. Federating Big Data: Michael Stonebraker • At the recent Whitehouse - MIT Big Data Privacy Workshop, Mike Stonebraker, Adjunct Professor, MIT CSAIL, presented the “State of the Art of Big Data Technology”,​ Watch video (8 minutes) Download PowerPoint slides, he said: • "Where I do see a problem, an Achilles heal, it is going to be this: if your data is coming at you from too many places and formats, there is a mature technology that has been used for 20 years in getting data into data warehouses that scales to 20 or 30, or I'll give you maybe 50 data sources. But if you want to integrate a lot of data sources, think about Data.gov, it is a zillion data sets, each relatively dirty and not integrated with anything else, so it you want to integrate Data.gov into a single, coherent data system you have a big problem. So how do you integrate 1000's of data sources? Lots of people want to do this."

  13. Federating Big Data: Overview • For Data Sets it is called a Catalog: • Example: Data.gov • For Data Elements it is called a Data Dictionary: • Example: I have many at Semantic Community because it is where a Data Scientist should start! • For Data Sets and Data Elements it is called a Data Ecosystem: • Example: I also have many at Semantic Community using Spotfire • For Merging Data Elements in Data Sets it is called Integrating: • Example: I can use Spotfire Information Designer for that

  14. Federating Big Data:Data.gov Catalog History • Phases: • Phase 1: They used proprietary software and there was no “data catalog”: • I used “Beautiful Soup” code to extract one in spreadsheet format. • Phase 2: They provided a catalog as an item in the catalog: • It was in spreadsheet format • Phase 3: They used vendor software (Socrata) that provided a catalog download: • It was in spreadsheet format • Phase 4: They use Open Source software (CKAN) that provides a catalog download: • It was in spreadsheet format • My Note: The Data Elements in the Data Sets are mostly undefined (acronyms) because the metadata is incomplete

  15. Federating Big Data: Example • Census Bureau’s Annual Statistical Abstract: • About 40 Chapters (PDF) with about 1500 Data Sets (Excel) • The Data Sets are in a standardized format with metadata • The Data Elements are usually harmonized within the Chapters and even across the Chapters to some extent by Subject Matter Experts where possible • The Semantic Community version is Federated in MindTouch and Spotfire so a Web Browser Search (e.g. Find in Google Chrome and Spotfire) provides Context, Metadata, and Data: • My Note: This is not possible with the way Data.gov is designed and implemented. • My Note: I did a pilot for SEMIC.EU several years ago to show how to integrate the U.S. Annual Statistical Abstract and the EU Annual EuroStat at the Data Element Level! • FedStats.Net is a Statistical Data Publication in a Data Browser!

  16. FedStats.net: MindTouch http://semanticommunity.info/FedStats.net

  17. FedStats.net: New Tables http://semanticommunity.info/FedStats.net#New_Tables

  18. FedStats.net: Semanticommunity.net About 1500 Excel Spreadsheets http://semanticommunity.net/StatAbs2012/ http://semanticommunity.net/StatAbs2012/Agriculture/

  19. FedStats.net: Spotfire Data Browser: Web Player

  20. Agenda • 6:30 p.m. Brand Niemann, Introduction and Continue Data Science Tutorials (Refreshments) • 7:00 p.m. Introductions and Announcements (10 seconds per individual depending on the size of the group) • 7:10 p.m. EPA & NASA Climate/Environmental Data Analytics, Dr. Joan Aron, Global Environmental/Climate Change Scientist (with sample analytics by Brand Niemann) • 7:45 p.m. Federating Big Data for Big Innovation and A Redesigned, Open Source Data.gov, Dr. Jeanne Holm, Data.gov Evangelist • 8:30 p.m. Networking/Individual Demos (talk among yourselves and look at one another's work) • 9:00 p.m. Continue Your Conversations Elsewhere (We need to clear out of the space)

  21. Next Meetups • Ninth Meetup: May 20, 6:30 p.m. • Continue Data Science Tutorial: Practical Data Science for Data Scientists: Data Science Students and Careers. • Graduate Students Working on Semantic Medline-YarcData Projects: GMU Updates Master's Program for Data Science and Sarah Soliman, Rand, and IV MOOC Student Project • Data Science at GMU, Charles Randall Howard, an Adjunct Professor in the Applied Information Technology Department and Data Scientist • The Science Behind Data Science, RuhollahFarchtchi, Director of Big Data, UNISYS • June Meetups (June 3 and 17): • Health Datapalooza, June 1-3, International Society for Digital Earth (ISDE) Workshop on Big Data for International Scientific Programmes: Challenges and Opportunities, June 8-9, Big Data for Government, June 16-17, and Earth Cube All-Hands Meeting, June 24-26. • DARPA Big Mechanism: • Story and Pilot for Future Meetup with Mike Megginson, Northrop Grumman, and Fredrik Salvesen, YarcData (in planning) • Summer Vacation?: • Silver Line Spring Hill Metro Station Opens in July?

  22. May 6th Meetup:Continue Data Science Tutorial • Practical Data Science for Data Scientists: • Reading Assignments: • Chapter 9: Data Visualization Projects and Risk • Recall Boston Hubway Data Challenge (see next slides) and Cambridge Semantics (Insider Trading) • Chapters 10: Social Networks and Data Journalism • Recall Dr. Marc Smith (NodeXL) and My Data Stories (see below) • Resources: See 2/18 Specific Data Science Tools and Applications 2 • Team Homework Exercise: • Data Journalism: Critique One of My Health Data Stories/Products (How could it be better? Is there other data I should have included? How would you present it?) • State Health Databases

  23. Practical Data Science for Data Scientists Providing On-Line Class With Private Tutoring Class 5 http://semanticommunity.info/Data_Science/Practical_Data_Science_for_Data_Scientists

  24. Web Player

More Related