1 / 22

The Best Way to Get BIG DATA is By Starting Small

The Best Way to Get BIG DATA is By Starting Small. Dr. Brand Niemann Director and Senior Data Scientist Semantic Community for Johns Hopkins University School of Medicine and Modus Operandi http://semanticommunity.info/

isleen
Download Presentation

The Best Way to Get BIG DATA is By Starting Small

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Best Way to Get BIG DATA is By Starting Small Dr. Brand Niemann Director and Senior Data Scientist Semantic Community for Johns Hopkins University School of Medicine and Modus Operandi http://semanticommunity.info/ http://semanticommunity.info/A_NITRD_Dashboard/Making_the_Most_of_Big_Data#Story http://semanticommunity.info/Modus_Operandi December 12, 2013

  2. BIG DATA • The new Digital Government Strategy is "treating all content as data." So big data = all your content: • But just a small sample to start a pilot. • There are many Big Data Technologies to choose from and many early adopters are finding them more expensive than expected: • Use open source-free trials to pilot. • There are many Big Data Problems to solve that could “boil the ocean”: • Use a data scientist to help build a team and community for a fast, inexpensive, and small semantic data science pilot.

  3. Subcommittee on Networking and Information Technology Research and Development(NITRD Subcommittee) These three activities fostered Semantic Medline on the YarcData Graph Appliance for the White House Big Data Initiative. http://www.nitrd.gov/ & Web Address

  4. Data Science Team Example:Chief Data Science Officer • Chief Data Science Officer: • Dr. George Strawn, Director, White House OSTP NITRD/NCO: Semantic Medline could be the “killer” Semantic Web application for the US Federal Government • Data Science Team: • Dr. Brand Niemann, Lead • Dr. Tom Rindflesch, NLM Semantic Medline Creator • Professor Kirk Borne, George Mason University • Federal Big Data Senior Steering WG Workforce Training Initiative • Tim White, Director, YarcData Federal Global Head • Aaron Bossett, YarcData Federal Solution Architect • Dr. Eric Little, Modus Operandi Chief Scientist

  5. Generic Problems • How to get Big Data: • Unstructured (Natural Language Processing to Graph-RDF Triples) and Structured (Relational-RDF Triples) • Where to store Big Data: • Graph-RDF Triples and Relational • What to show about Big Data: • Statistics, Visualizations, and Network Graphs • Note: RDF Triples make Big Data smaller, smarter, and integrated! • Semantic Medline on the YarcData Graph Appliance is an example of the best content on the best graph data store with the best visualization results so far (in my humble opinion)! • Our Semantic Data Science Team delivered this for the recent White House Big Data Event: SeeMaking the Most of Big Data

  6. Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG:Work Flow

  7. Semantic Medline – YarcDataGraph Appliance Application for Federal Big Data Senior Steering WG:Semantic Medline Database Application See More Information: http://skr3.nlm.nih.gov/SemMedDB/MoreInfo.do http://skr3.nlm.nih.gov/SemMedDB/index.jsp

  8. Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG:Visualization and Linking to Original Text

  9. Semantic Medline – YarcDataGraph Appliance Application for Federal Big Data Senior Steering WG:Bioinformatics Publication My Note: My SQL database for non-commercial use. http://bioinformatics.oxfordjournals.org/content/28/23/3158.short

  10. Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG:Semantic Medline at NIH-NLM • Current : Web based research tool. • Transition: Current systems re-engineered to leverage Urika (less than 5 days). • Purpose: Build a platform for users to perform increasingly complex analysis. • Immediate Requirement : Replicate current capability. • Future: Allow for increasingly complex analysis. Ability to capture and share analytics in addition to sharing data. Tailor Urika to less complex queries.

  11. Semantic Medline – YarcDataGraph Appliance Application for Federal Big Data Senior Steering WG:Graphs and Traditional Technologies … CPU CPU CPU • Square peg, round hole: • Current technology does not support efficient representation, storage, and interaction with complex graph structures • Traditional relational models only add the an already complex structure • Traditional hardware approaches do not support efficient access to highly interconnected graphs • You don’t know what you don’t know: • Efficient relational schemas require prior knowledge of the relationships between database fields • Updating and modifying schemas frequently introduces delays and errors • Problems in partitioning the problem: • Distributed computing solutions are good…If your problem can be easily partitioned • Graphs are not predictable; accessing graph nodes across large clusters can be unwieldy at best and does not work at scale ?

  12. Semantic Medline – YarcDataGraph Appliance Application for Federal Big Data Senior Steering WG:The YarcData Approach Business Challenge: … Large Shared Memory Architecture Up to 512 TB CPU CPU CPU XMT2 Massively Multi-Threaded Processors 128 Threads ? Scalable IO Up to 350TB per Hour Real-time,Interactive Analytics on Large Graph Problems

  13. Semantic Medline – YarcData Graph Appliance Application for Federal Big Data Senior Steering WG:New Use Cases • Schizophrenia • Current therapies target dopamine receptors • Not entirely effective • Side effects • Basic research is exploring glutamate and its NMDA receptor • Goal: can we use Semantic MEDLINE to discover that research trend in the scientific literature • Cancer • With some exceptions, therapy is not effective • Has not progressed significantly in 60 years • Scientific basis • Traditionally – cancer cells • More recently – non-cancer cells (immune system) • Immune system and cancer • Connection noted in 1863 (Virchow) • But not exploited until recently • Goal: look for trends in cancer immunotherapy • Discovery Browsing Method for Exploiting Semantic MEDLINE • Cooperative reciprocity • Between system and human • Issue query • Inspect graph for “interesting” concept • Use selected concept to seed another query • Iterate until satisfied Note: See Two YouTube Video Demos: Schizo (7 minutes) and Cancer (21 minutes) 

  14. Modus Operandi:Mantra, Performance, and Vision • Mantra: • Speeding the Discovery, Integration, and Fusion of Information • Performance: • SBIR Phase Three Successes: Wave Exploitation Framework (EF) • Wave EF: Government-off-the-shelf (GOTS) technology for intelligence applications that tackles the difficult problem of processing unstructured and semi-structured data • C4ISR Government Customers: U.S. Air Force, U.S. Army, U.S. Marine Corps, U.S. Navy, DARPA, DTRA, Missile Defense Agency, and Intelligence Agencies • Vision: • Wave All-Source Semantic Fusion Engine: In development to support individual medical researchers/intelligence analysts to work with big data • Semedy (former Ontoprise founders): Reasoner and Triple Store

  15. Modus Operandi:Finding the Right Needle in the Right Haystack • Dyson said. “So a lot of what we’re doing is enabling that by making the data sources accessible and searchable.” • “Our specialization is what we call ‘semantic technology,’ which is just a way of making the data smarter. We enrich the data with various tags to make it easier to find.” • The software also provides what McNeight called data “provenance” which has to do with the traceability back to the source of the data - the really important aspect for intelligence personnel. • “We don’t make decisions,” McNeight explained. “We just help (the analyst) to make decisions and to find the right data. He may only be interested in a certain person in a certain location at a certain time. We can bring that back to him across multiple databases.” • Source: http://www.spacecoastbusiness.com/modus-operandi-delivers-information-based-intelligence/

  16. Data Science Team Example:President of Modus Operandi • President of Modus Operandi: • Richard McNeight, President, Masters Degree in Artificial Intelligence & Computer Science, Board of Regents, Florida Institute of Technology University, Recognized for Entrepreneurial Leadership, and Recipient of Florida County Economic Development Grant for Big Medical Data • Data Science Team: • Lee Watkins, Director of Bioinformatics & IT JHMI, and Dr. Brand Niemann, Semantic Community, Co-Leads • Dr. Eric Little, Modus Operandi Chief Scientist, Ontology and Wave All-Source Semantic Fusion Engine Development • Bryan Thompson and Michael Personick, SYSTAP Principals, Bigdata® Platform • Tim Barr, YarcData Medical Informatics, and Aaron Bossett, YarcData Federal Solution Architect • Others to be added as needed • Advisors: • Dr. Tom Rindflesch, NIH/NLM Semantic Medline Creator • Dr. Richard Ford and Dr. Marco Carvalho, Florida Institute of Technology

  17. Wave and the vMDC (virtual metadata catalog – which is a query translator for non-semantic queries) Trust/Provenance Algorithms Structured, Semi-structured, Unstructured Data Semantic Reasoner Batch Data Wave Ingest Generated Semantic Graph (RDF) Streaming Data High Performance Triple Store (Rya) vMDC Accumulo DB An engine that can ingest any kind of data, transform that data into RDF graphs, then do a lot of semantic coolness with those graphs. 

  18. How Wave Drives the BLADE Semantic Wiki and Other Kinds of Analytic Visualizations The wiki is just a way to view the entities in the model and make changes and see related content without having to type any SPARQL code or really know anything about the backend model structure – just point and click at the content you want to see. Apps and Visualizations BLADE 2.0 Wiki

  19. Possible Scenario • For medicine – the Blade 2.0 Semantic Wiki would allow different researchers to view the data collectively from within their areas of expertise, but connect them to other areas effortlessly.  • This means – scientist 1 could be looking up information on a given receptor on a cell, while scientist 2 is looking at proteomic information (perhaps not even knowing it is the underlying substance of that cell/receptor).  • Scientist 3 could add some new information about a given compound that shows reactions at the receptor site scientist 1 is studying. • Upon entering that information, scientist 1 would see a new linked piece of data about their receptor related to the compound – and the cool part is scientist 2 would also see information about the connection between their protein structure and that compound. • Scientist 3 would see the information about the protein related to their compound as well (since they were only looking at the receptor-compound connection). • All 3 would basically have new linked information available to pursue if they wanted. • Now imagine being able to do those kinds of joins in near-real-time with a simple tool across the entire corpus of the Semantic Medline data set. Kaboom! • Source: Dr. Eric Little, Chief Scientist and Ontologist

  20. Knowledge Base:Modus Operandi Web Intelligence in MindTouch Practical Example of How to Get BIG DATA By Starting Small with Structured & Unstructured Data as Relational & RDF Triples Stored in Excel and Visualized in Spotfire. http://semanticommunity.info/Modus_Operandi

  21. Big Data in Memory:Innovation Story • Met Jef Sharp, President, Panève: • Amazing fast access and massive storage – Big Data Supercomputer on My Mobile Device • John Hopkins University – Blackbook (CIA Cloud) • I suggested: • Greylock Partners - #2 Data Scientist in the World (DJ Patil, Entrepreneur-in-Residence who built the first formal data science team at LinkedIn) • Works for In-Q-Tel (Robert Ames, Senior VP for Technology, In-Q-Tel) • Works for CIA (Gus Hunt, CTO, CIA) • Who Wants Big Data Supercomputer on Mobile Devices

  22. Future: PossibilityPanève’sZettaLeaf & ZettaTreeProducts • Scalable single level storage • Panève’s scalable single level storage model collapses the server, network, and storage by removing software and replacing them with memory system primitives. This eliminates all network and network-processing overhead associated with accessing storage and delivers a 10,000X increase in raw performance. http://semanticommunity.info/@api/deki/files/19353/exec_summary_20120916.pdf http://www.paneve.com/technology/

More Related