Federal Big Data Working Group Meetup

Federal Big Data Working Group Meetup Dr. Brand Niemann Director and Senior Data Scientist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup June 2, 2014

Mission Statement • Federal: Supports the Federal Big Data Initiative, but not endorsed by the Federal Government or its Agencies; • Big Data: Supports the Federal Digital Government Strategy which is "treating all content as data", so big data = all your content; • Working Group: Data Science Teams composed of Federal Government and Non-Federal Government experts producing big data products (How was the data collected, Where is it stored, and What are the results?); and • Meetup: The world's largest network of local groups to revitalize local community and help people around the world self-organize like MOOCs (Massive Open On-line Classes) being considered by the White House. Co-organizers: Brand Niemann and Kate Goodier

What Are We Doing? • Leadership of the Semantic Data Science Team that produced Semantic Medline running on the Yarc Data Graph Appliance. • Founding and co-organizing of the Federal Big Data Working Group Meetup. • A graduate class prepared for GMU entitled “Practical Data Science for Data Scientists”. • Using the Cross Industry Standard Process for Data Mining (CRISP-DM; Shearer, 2000) to build a Data Science Knowledge Base • Mining of the Data Science and Digital Earth scientific journals the CODATA International Workshop on Big Data for International Scientific Programmes, (June 8-9, in Beijing). • Participation in the Data FAIRport (Findable, Accessible, Interoperable, and Reusable) with “Data Publication in Data Browsers”. • Providing data stories that persuade and presentation materials for public education conferences like the COM.BigDataConference (August 4-6, in Washington, DC).

How Are we Doing it? • Federating Uses Cases: Data Science (Brand Niemann); Environmental and Earth Science (Joan Aron); and Astronomy (Kirk Borne) • Federating Data Publications: Structured Scientific Content (Papers, journals, books, reports, etc.); Data FAIRports (Findable, Accessible, Interoperable); and Reusable Data Stories That Persuade (Claims and Evidence) • Federating Solutions & Technologies: Hand-Crafted by Individuals and Teams (Mary Galvin, STEM); Data Mining Standards and Products (Brand Niemann, Data Publications in Data Browsers); Machine Processing (Fredrik Salvesen, Semantic Data Publications on Yarc Data Graph Appliance); Reading and Reasoning (Kate Goodier and Chuck Rehberg (Semantic Insights on Elsevier Content Text Mining); and Data Curation at Scale (Michael Stonebraker, Tamr on 1000s of Spreadsheets)

Data FAIRPort Final Report, Interview, and Joint Hackathons Started http://datafairport.org/ http://semanticommunity.info/Data_Science/Euretos_BRAIN

May 20th Meetup: Data Science at GMU and Elsevier Research Data Services • How Was the Meetup? • I want to thank everyone who attended. As it was my first meeting I was very impressed with the speakers and the venue. I am looking forward to working with everyone and future meetings. • Anita was just amazing. • The meeting was great. • We Listen and Respond: • John, Please see my Summary Comments in the next slides • David, Thanks for excellent links which are very relevant to our work on data publications: • Quickly search and analyze billions of public records published by governments, companies and organizations: http://enigma.io/ • Visual document mining for journalists: http://overview.ap.org/ • RMarkdown language used to create 'living' research documents: http://rpubs.com/dabata/17384 • Orest, Thanks for coming last night. I have high praise for a HP Vertica webcast I participated in recently and wrote about. We would welcome a presentation like that with Conservation International. http://www.meetup.com/Federal-Big-Data-Working-Group/events/181656402/

Fourth Paradigm and Fourth Question • The Fourth Paradigm of Science (1): • First Paradigm. Observation, descriptions of natural phenomena, and experimentation. • Second Paradigm. Theoretical science such as Newton’s laws of motion and Maxwell’s equations. • Third Paradigm. Simulation and modelling, such as in astronomy. • Fourth Paradigm. Data-intensive science that exploits the large volumes of data in new ways for scientific exploration, such as the International Virtual Observatory Alliance in astronomy. • The Fourth Question of Big Data for Science (2): • How was the data collected? • Where is the data stored? • What are the data results? • Does the data story persuade? Bell G, Hey, T., & Szalay, A. (2009) Beyond the data deluge, Science 323, 6 March 2009, pp. 1297-1298. de Waard, Anita, (2014) About Stories, that Persuade With Data, Federal Big Data Working Group Meetup, 20 May,, 41 slides.

Anita de Waard: Some Contacts and Links to Projects • Kerstin Lehnert leads their efforts for IEDA: • http://www.ldeo.columbia.edu/user/lehnert • Ariadne is the company Elsevier acquired and Pathway Studio is the product used in the DARPA Big Mechanism Project: • http://www.elsevier.com/online-tools/pathway-studio • GeoDeepDyve: • http://hazy.cs.wisc.edu/hazy/geodeepdive/ • GoPubMed: • http://gopubmed.com/web/gopubmed/ • DataUp(Microsoft Excel/California Digital Library): • http://dataup.cdlib.org/ • Elsevier Content Text Mining License: • http://www.elsevier.com/connect/elsevier-updates-text-mining-policy-to-improve-access-for-researchers

My Summary Comments: • Anita said that knowledge is really in the researcher's head and not the textual papers (their notebooks could be more useful, but still problems getting at it). Stories could/should be the best source, especially when based on real facts (data), and statistics if there is enough data, but only to support statement like compared to what and not absolutes. Computers, natural language processing, Watson, etc. all try to help automate this but have limitations.

My Summary Comments:(continued) • My experience with US EPA Administrator William Ruckelshaus was as follows: You as scientists are to give me the best description of the scientific problem (e.g. acid rains effects on lakes and streams), and I as the Administrator are to make proscriptions of what to do about that to the President and Congress. You can even tell me we have to collect better data (we did) to accomplish your work and I have to support that, but you as scientists should steer clear of the politics of proscription.

My Summary Comments:(continued) • In our next Meetup will talk about the use of ontology as a knowledge representation for organizing and relating concepts and then trying to reason and infer new facts. We already had an example (HealthCare.gov) of knowledge modeling tooling (Be Informed) that "automates" ontology development and the essential role of ontology (UMLS) in knowledge discovery in RDF triple stores (Semantic Medline in YarcData). Semantics like ontology have an essential role in Big Data Ecosystems, Federation, and Integration as we will see in the next two Meetups.

‘Living' Research Documents http://rpubs.com/dabata/17384

Earth Insights from Big Data HP Vertica and Conservation.org http://www.teamnetwork.org/gridsphere/gridsphere?cid=download http://semanticommunity.info/Data_Science/Earth_Insights_from_Big_Data

Activities • Mentoring: • White House Energy Datapalooza, May 28 (In process with Alexandra Winkler, Knowledge Cities Graduate Student) • Health Datapalooza V, June 1-3, and HHS Fellowship: • Story and Application for HHS 12-month External Entrepreneur Fellowship for Innovative Design, Development and Linkages of Databases • Big Data for Government, June 16-17: • Keynote from Dr. George Strawn and Presentation by Dr. Tom Rindflesch and Semantic Medline/YarcData Team • Earth Cube All-Hands Meeting, June 24-26: • ESIP Earth Science Analytics (In process with Joan Aron, Global Environmental/Climate Change Scientist) • Keynote and Panel: COM.BigData2014, August 4-6: • You can participate and attend

White House Energy Datapalooza http://www.whitehouse.gov/blog/2014/05/28/harnessing-power-data-technology-and-innovation-clean-energy-economy http://semanticommunity.info/Data_Science/Data_Science_for_White_House_Energy_Datapalooza

Data Science for the HHS IDEA LAB My Note: Innovative Design, Development and Linkages of Databases Fellowship: My Tribute to George Thomas http://semanticommunity.info/Data_Science/Data_Science_for_the_HHS_IDEA_LAB#Story

Data Publication in a Data Browser Web Player

Health Datapalooza V • A Hack-a-Thon, but with a Scraper Wiki (MindTouch) to produce a detailed Wiki Table of Contents and multiple Spreadsheet Tables for Spotfire analytics (Data Science for the HHS IDEALAB); • A Code-a-Palooza, but without Code using Spotfire so a very large relational database (Health Datapalooza V Medicare Claims) can be used all in memory for Spotfire analytics; and • A Meetup to mentor and train data scientists and others in creating a series of Data Publications in Data Browsers starting with Health United States 2013. http://healthdatapalooza.org/

Keynote and Panel: COM.BigData 2014 http://www.com-geo.org/conferences/2014/prog_keynotes.htm

Volunteers? • Every panelist could give 5-10 minutes short tech-talk presentation with slides before the discussion. The total length is 90 minutes. • The keynote and panel abstracts will be also published in our proceedings with IEEE. • The panelists invited by you will be offered free full conference passes. • We also can offer 50% off discounts to your Meetup group members if they are interested in attending the conference.

ESIP Earth Science Analytics • Use Case Name: Climate Change: Where's the Data? • Provided By: Brand Niemann and Joan Aron, Federal Big Data Working Group Meetup • Brief Description: So the web site says here are the data sets, but are they reusable so one can make the scientific report a data publication? • Key Analytics Needs: Content Analytics, Data Analytics, and Publication Analytics http://wiki.esipfed.org/index.php/Earth_Science_Data_Analytics/ http://semanticommunity.info/Data_Science/Data_Science_for_Climate_Change

CODATA International Workshop on Big Data for International Scientific Programmes • Goal: Find actual data and principal conclusions (e.g. nanopublication) from IJDE title, author(s), and abstract. • Data Preparation: Screen-scrape a sample of how it could be done manually so it can be done more automatically. • Data Selection: Data may be at other locations like from a Google Search for "climate change shapefiles" that found: • http://www.diva-gis.org/Data • http://data.worldbank.org/data-catalog/cckp-ensemble-projections • Data Completion: Screen scrape the rest using the initial pattern developed from experimentation and explore the easier to find data sources to make a selection. http://semanticommunity.info/Data_Science/Big_Data_Science_for_CODATA#Story

CODATA International Journal of Digital Earth: Knowledge Base Google Chrome: Find “Google Earth” Answer: How was the data collected. http://semanticommunity.info/Data_Science/Big_Data_Science_for_CODATA/International_Journal_of_Digital_Earth

CODATA International Journal of Digital Earth: Spreadsheet Index Answer: Where the data is stored. http://semanticommunity.info/@api/deki/files/29000/CODATABigDataScience.xlsx

Data Science for Climate Change:Spotfire Data Publication • Answer: This is where the data is stored and the results. • Answer: The data story persuadewith more reference links. Web Player

Climate Change: Grid Projections-Average A2 SRES Scenario Web Player

Digital Earth: Big Earth Data and Geospatial Analytics • Digital Earth is a visionary concept popularized by Nobel Laureate and the former US Vice President Al Gore for the virtual and three-dimensional representation of the Earth. • Use cases can be interpreted within a broad framework of spatial concepts that provides a better guide to the future of geobrowsers and Digital Earth than current GIS technology. • Now Digital Earth is Geospatial Analytics on Big Data and the questions are where to get that big data and what analytical tools to use. • The International Journal of Digital Earth and Google Search for "climate change shapefiles“ was found to be a good place to start the data mining process. • One significant highlight is the Spotfire visualization of the Climate Change: Grid Projections for Average A2 SRES Scenario superimposed on the global geospatial infrastructure from global to street-level. • The above illustrates the advantages of a closer cooperation between Geoinformatics specialists and scientists involved in Global Change Research. See Data Science for Climate Change

Workshops on Extremely Large Databases: Knowledge Base • CODATA Web Page to • Journals to • Journal of Data Science and • International Journal of Digital Earth to • Workshops of ELDBs to • SciDB.org and • Paradigm4.com to • Michael Stonebraker to • MIT Big Data and • Tamr to • Collaboration! http://semanticommunity.info/Data_Science/Big_Data_Science_for_CODATA/Workshops_on_Extremely_Large_Databases

Agenda • Ontology Summit 2014 Postmortem and Reading & Reasoning with Semantic Insights • 6:30 pm Welcome and Introduction Slides • 6:35 pm Continue Data Science Tutorial: Practical Data Science for Data Scientists: Data Science Students and Careers and Sarah Soliman, Rand, and IV MOOC Student Project (invited) • 7:00 p.m. Brief Member Introductions • 7:10 pm Ontology Summit 2014 Postmortem: Big Data with Semantic Web and Applied Ontology, Brand Niemann See Ontology for Big Data • 7:30 pm Two SIRA-based products: Research Assistant™ and Research Librarian™, Chuck Rehberg, Semantic Insights and Kate Goodier, Xcelerate Solutions (limited beta test in process). See A Data Science Big Mechanism for DARPA • 8:30 p.m. Open Discussion • 8:45 p.m. Networking • 9:00 p.m. Depart http://www.meetup.com/Federal-Big-Data-Working-Group/events/184305652/

Next Meetups • MIT Big Data with Sam Madden and Tamr with Michael Stonebraker • Background: See Workshops on Extremely Large Databases • 6:30 pm Welcome and Introduction • 6:35 pm MIT Big Data Initiative: bigdata@CAIL and the new Intel Science and Technology Center for Big Data, Sam Madden • 7:10 pm Brief Member Introductions • 7:45 pm Why the current "elephants" are good at nothing, Data Tamer, and data integration issues, Michael Stonebraker • 8:30 p.m. Open Discussion • 8:45 p.m. Networking • 9:00 p.m. Depart • July and August: Once a month to be announced • Silver Line Spring Hill Metro Station Opens in July?

June 2thMeetup:Continue Data Science Tutorial • Practical Data Science for Data Scientists: • Reading Assignments: • Chapter 13: The Life of a Chief Data Scientist • Claudia Perlich likes to understand something about the world by looking directly at the data. Claudia’s skill set includes 15 years working with data, where she’s developed data intuition by delving into the data generating process, a crucial piece of the puzzle. • Chapters 14: David Crawshaw and Josh Wills • Josh and David were responsible at Google for collecting data (frontend and backend logging), building the massive data pipelines to store and munge the data, and building up the engineering infrastructure to support analysis, dashboards, analytics, A/B testing, and more broadly, data science. • Resources: See 3/4 Specific Data Science Tools and Applications 4 • Team Homework Exercise: • Your brief chapter of what you have learned so far and hope to yet learn in this class.

Practical Data Science for Data Scientists Providing On-Line Class With Private Tutoring Class 7 http://semanticommunity.info/Data_Science/Practical_Data_Science_for_Data_Scientists

Building Data Products:Josh Wills, Senior Director of Data Science The analogy of data science to man-powered flight that I am very familiar! http://semanticommunity.info/Cloudera#Solve_the_Right_Problem

Federal Big Data Working Group Meetup