250 likes | 348 Views
Analysis In Motion. Kerstin Kleese van Dam et.al. Not just one Experimental Technique. Methods. Chemical Imaging Techniques are: Working at different physical and temporal scales Utilizing different approaches – scattering, scanning, imaging
E N D
Analysis In Motion Kerstin Kleese van Damet.al.
Not just one Experimental Technique Methods • Chemical Imaging Techniques are: • Working at different physical and temporal scales • Utilizing different approaches – scattering, scanning, imaging • Delivering different insights about the same material or biological system DIAMOND Light Source UK, 2010
The Experimental Landscape is Evolving • Most of our understanding of physical, chemical and biological systems and processes are gained today through inference and interpretation of experimental data and modeling • New In-situ and In-Operando Experimental Methods enable the direct observation through more fine grained imaging (atomic level) and higher repetition rates (nano seconds) • Science however seeks to control the formation and transformation of materials and biological systems by changing experimental conditions in response to evolving phenomena
Concomitantly Evolving Analytical Demands • Inference – Post hoc, large human effort in both analysis and interpretation, individual analysis solutions, infrequent collaborations with modelers to understand processes that can not be observed • Direct Observation – Continued post hoc analysis, orders of magnitude higher data volumes and rates require more scalable, automated analysis approaches - challenge due to variability in experimental tools and methods, use of computational modeling as analysis tool to recreate experimental conditions. Human effort strongly focused on interpretation in big data environment • Control – Shift to real time analysis directed at decision making support during the experiment, equal use of data analysis, interpretation and predictive modeling, need to rebalance of human computer interaction in high data velocity, highly variable environment
Transition Points • Effective analysis support for Direct Imaging requires cultural changes in the community – move from individual analysis solutions to collaborative development of high performance analysis pipelines • Game Changer – Effective control of chemical, physical and biological processes can only be achieved through close integration of experimental design and high performance analytical methods and infrastructures • Solution of direct imaging analysis challenge important stepping stone to be able to address process control challenge Protected Information | Proprietary Information
Scalability in the face of Diversity • Currently the domain is dominated by home grown, one off analysis solutions, that are highly manual • Continued order of magnitude increases in data rates and volume require more performant algorithmic solutions – higher degree of automation is a must • Challenge – Large diversity of imaging methodologies prohibits the development of customized solutions for all • Data Infrastructure services easily abstracted – data movement, replication, scheduling, workflow, data management – but needed to be provided in an easy to combine, customizable form • Analysis methods are viewed as unique, however deeper investigation reveals that commonalities exist across analysis workflows for different imaging methods
Velo – Collaborative Data Management and Analysis • Low barrier of entry for HPC systems • Easy to share data, applications and tools (AppStore for science) • Reproducible results • Collaboration across sites and teams • Highly customizable • Less time spent in data orchestration tasks = more time for science • Reduce project costs, deploy new instances in weeks Users can easily keep track of their scientific projects, even if data, tools, and collaborators are geographically distributed
Experimental Analysis Component Library • Analysis processes are made up of recurring components • Create a library of reusable, highly optimized versions of those components. • Drastically decreased development time and cost (months to weeks or days) • Ability to leverage existing tools, algorithms, projects and expertise • Easy for non domain experts to make useful contributions • Normalization • Alignment • Scaling Limits • Smoot • Remove Lines • Overlay Images • Personal Colomap • Line Scan High Resolution Mass Spectrometry LTQ/ Orbitrap Xcalibur MSIQuickView Paraview • PCA • ClusteringDecon Tools • Normalization • Segment • Isosurface • Filters… • ImageJ Tools • Personal Colomap • Save Images/Videos 3D Viewer Paraview ImageJ DDV X-Ray Tomography Data STXMX-Ray Downsample • Watershed • Centroid Detection • Particle Counter • ImageJ Tools • Filters… • Personal Colomap • Boundary Detection Multimodal Nano-scaleAnalysis STXM Fluorescence 2D Viewer ImageJ 3D Viewer Paraview Visus • Normalization • Connected Threshold • Isosurface • PNN LaGrit • Feature Size X-Ray CT Data Downsample CT DataReduction GraphicalFeedback DataCollection Image Processing All tools build from REXAN Library components, many reused for new analysis processes
Proof of Concept - Combining new experimental techniques and advanced analysis:More Accurate Imaging of Lipids, Metabolites and Drugs in Biological Systems Julia Laskin, IngelaLanekoff and Mathew Thomas Pacific Northwest National Laboratory
Nano-DESI and MSI QuickView • Real time streaming analysis of emerging results (Speed up 10 h to seconds) • User driven secondary, adaptive analysis during experiment • Tool created from REXAN library (enabling analysis of larger data volumes 40 GB in stead of 100 KB, reusing many existing components)
Science Impact The new technique will be essential the for cost-effective production of biofuels and pharmaceuticals. “The new technique reflects what’s in the sample,” said Dr. IngelaLanekoff. “This is a new way of gathering and analyzing data that will feed into a lot of biological applications in the future.” “We were able to distinguish what was really changing in the sample versus what was changing just because of the nature of the matrix effects,” said Lanekoff. “Basically, everything we did relied on MSI QuickView” Dr. Julia Laskin
Computer Science Challenges posed by the Effective Control of Chemical, Physical and Biological Processes
Analysis In Motion Environment • Single-pass • No access to the data stream beyond the sample • Data is forgotten • Each model’s cache is small relative to the data volume • Many algorithms options • Ideal combination of algorithms needs to be determined at runtime • Cooperative user • Important problem knowledge isn’t in training data
Problem Characteristics • Data Rate – 2GB - 200GB/sec and rising • Response speed – in step with evolving phenomena, maximal several minutes • Variability - Each experiment is different and to some extend unpredictable • Decision support – analysis of streaming data, interpretation in the context of domain knowledge and prediction of possible futures • Adaptivity – analytical methods will need to change during analysis process in response to discoveries and changing user priorities • Elasticity – depending on the chosen analysis methods computational resource use will increase and decrease during the analytical process • Cognitive barrier – humans are limited in their ability to effectively work in high volume, high velocity data environments
Adaptive Human-in-the-loop Streaming Data Analysis and Interpretation • Streaming Data Characterization • Interpretation through Hypothesis Generation and Testing • Human-Machine Collaboration • Dynamic Analysis Infrastructure
Challenges in Streaming Data Characterization • Scalable Feature Extraction and Sub-sampling • Define low time-complexity clustering algorithms to find the minimal sample set for hypothesis generation and testing • Achieving similar level of accuracy with potentially several orders of reduction in the working set for hypothesis generation and testing • Detection of events in a pre-symtomaticstate, even when patterns are evolving • Develop evolving models in a partially defined and dynamically changing environment using incremental machine learning approaches and methods • Allowing input from subject matter experts and/or other algorithms to label and/or define new events or outcomes as they arise in streaming data, effectively replacing the need for large training data sets
Challenges in Interpretation through Hypothesis Generation and Testing • Streaming Hypothesis Reasoning • Create deductive stream reasoning approaches to narrow down solution spaces of standing queries and to test user-proposed hypotheses • Fixed Cache Size • Low-latency, high-throughput reasoning on ephemeral data is a hard, open problem • Mathematically sound propagation of beliefs/uncertainty throughout the system taking advantage of subjective logic • Hypothesis Ranking to match the users interests • Maintain and Access External Knowledge Graphs • Add new facts, remove invalid facts, and extract information that is otherwise implicit in the data • Fast Access to relevant knowledge to update knowledge cache at runtime
Optimize Ensembles of Analysis Method • Inductive approaches (data characterization) are good for making guesses (at least as good as the training data), but they can produce false positives. • Deductive methods (Hypothesis generation) rarely come to any certain conclusion. • Is there a statistically optimal way to use, and dynamically adapt, ensembles of analytical methods in a continuously changing environment ? • Characterize the effect of individual models in a population, significant interactions between models, and threshold of performance at runtime
Impact of Ensembles on Result Accuracy Deductive Reasoning (SHyRe) veto's false positives created by Inductive Approaches leading to increased accuracy
Challenges in Human - Computer Collaboration • Biggest challenge - cognitive barriers • How can we create a collaborative environment between human and computer, where both sites benefit. • How to detect and align human cognitive models of the data and the machine centric mathematical models
Challenges for the Analytical Infrastructure • Limited time envelope for overall analysis process from data capture on the instrument to insight and decision by the user (max min) • Needs fast data streaming from instrument to processor • Needs large number of cores with low latency interconnect • Support different programming models – Classic MPI, MapReduce, Graphs • Analysis Ensemble composition will need to change at runtime to adapt to emerging analytical needs • Adaptive workflows with predictive capabilities of future execution possibilities, and their costs / resource requirements • User and system driven changes • Changing Analysis Ensembles will require Elastic Resource Provision at Runtime • Elasticity of Cloud, but performance of HPC
Summary • The experimental landscape is changing moving from inference to Direct Observation and Control • Computer Science can have a tremendous impact in this transition, indeed it will be fundamental to progress • In particular the control of physical, chemical and biological processes in experimental environments presents a wide range of computer science challenges, in particular: • Accurate, highly scalable streaming algorithms for data characterization and hypothesis reasoning • Overcoming Cognitive Barriers in high velocity data environments, creating symbiotic relationships between human and computers • Elastic, adaptive high performance computing platforms
Kerstin Kleese van Dam Chief Scientist and Lead Data Services kerstin.kleesevandam@pnnl.gov Co – lead Chemical Imaging and Analysis in Motion Initiatives aim.pnnl.gov Protected Information | Proprietary Information 25