1 / 43

Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS)

Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS). Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi Hashimoto. Outline. Project Overview Architecture Knowledge Management System Control Module. Project Goal.

gisellez
Download Presentation

Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Anomaly Detection and Analysis Framework for Terrestrial Observation and Prediction System (TOPS) Petr Votava Ramakrishna Nemani Andrew Michaelis Hirofumi Hashimoto

  2. Outline • Project Overview • Architecture • Knowledge Management System • Control Module

  3. Project Goal Provide a framework for automated anomaly detectionand verification in large heterogeneous Earth science data sets as well as on-demand data analysis integrated with the Terrestrial Observation and Prediction System (TOPS)

  4. Use Case Scenario Example • Given a global TOPS NPP (Net Primary Productivity) anomaly product, perform the following actions for identified clustered anomalies: • Check for anomalies in the inputs to the NPP product • Climate data: if anomaly appears in one of the climate datasets, confirm this from another source of the same variable • Satellite Data: If the anomaly is one of the MODIS satellite products (LAI - Leaf Area Index), check the QA (Quality Assurance flags), and if QA indicates problem make that our default position • If QA appears fine, check for other known events that could cause anomaly in LAI product, such as fire. Also check the MODIS fire product for any recent fires in the area of interest.

  5. Architecture Overview

  6. Anomaly Detection and Analysis Flow (System Overview) • Anomalies are detected using by on of the plug-in algorithms (Anomaly Detection) • Related datasets are analyzed to see if the anomaly is observed in other products as well (Knowledge Management System, Anomaly Confirmation) • If the anomaly is confirmed, it is classified against a set of known anomalies or is classified as new (Anomaly Classification) • A pre-defined workflow that represents an in-depth analysis process is executed for recognized anomaly classes (Anomaly Analysis) • The overall flow is managed by the logic in the Control Module.

  7. Objective: Knowledge Management System Develop a Knowledge Management System in order to store and query for information about model hierarchies, dataset hierarchies and their relationships. Capture of data and model relationships such as similarity and compatibility so that different components can be dynamically selected and matched during analysis.

  8. Anomaly Detection System Overview

  9. Building Knowledge Management System • Use Case Scenario • Knowledge Representation • Architecture • Development • Knowledge Management System Interface

  10. Use Case Scenario Example • Given a global TOPS NPP (Net Primary Productivity) anomaly product, perform the following actions for identified clustered anomalies: • Check for anomalies in the inputs to the NPP product • Climate data: if anomaly appears in one of the climate datasets, confirm this from another source of the same variable • Satellite Data: If the anomaly is one of the MODIS satellite products (LAI - Leaf Area Index), check the QA (Quality Assurance flags), and if QA indicates problem make that our default position • If QA appears fine, check for other known events that could cause anomaly in LAI product, such as fire. Also check the MODIS fire product for any recent fires in the area of interest.

  11. Use Case Analysis • Typical interaction among members of our group (from scientists to engineers) • There is a lot of knowledge in the previous statements that we want to capture for future use • We also want to automate much of the process to speed up any future analysis so that • When we automatically process the anomaly products, we already have a possible indication why is the anomaly there • Problem: Things can differ dramatically from one variable to the next AND we may want to try different approaches/hypothesis as well (look at different products for the same variable/event)

  12. Queries • Based on the use case, we need the answers to the following questions: • What are the inputs to NPP? • What are the climate inputs to NPP? • What are the satellite inputs to NPP? • What are the inputs to LAI? • What event(s) can have influence on LAI? • In what product(s) can a fire event be observed? • What products are impacted by a fire event?

  13. Knowledge Capture • We need to be able to capture the knowledge in a persistent machine readable format • Additionally we would like to be able to share this knowledge among the community • Can provide basis for data lineage = documentation of a process of how we arrived in our conclusions • The terminology of actions performed during the analysis is standardized • Good for automated data analysis • Preferably we would re-use existing community vocabularies = attach semantic meaning to the data, models and actions • Good for automation • Great for interoperability

  14. Knowledge Representation • In order to be able to easily build on existing foundation and have a useful vocabulary that is in sync with the community - create new taxonomy (ontology) using one of the existing ontology languages and link it with existing ontologies • Two most popular ontology languages: • Resource Description Framework (RDF) • Web Ontology Language (OWL)

  15. Language Selection • RDF • Sets of <subject, predicate, object> triples • NPP rdfs:type Data OR BGC rdfs:type Model • rdfs:type is a predefined RDF construct • Good, but very low-level with limited inference • OWL (our choice of language) • Vocabulary extension of RDF • NPP isDerivedFrom LAI and LAI isDerivedFrom NDVI • isDerivedFrom is custom predicate • Supports inference through Description Logics (finding implicit information from presented data) • It can be automatically inferred that NPP isDerivedFrom NDVI • Reuse and integration of large number of existing ontologies: • SWEET, Dublin Core, US Census Ontology, …

  16. Knowledge Storage and Access • Standard DBMS (MySQL, Postgres..) • XML encoded, so it could theoretically map easily to database tables, BUT would not provide any semantics, extensibility and inference capabilities as well as support for enhanced queries • On the other hand, very scalable • Ontology Data Stores (Jena, Sesame, …) • Takes advantage of the RDF/OWL standard encodings • Provide a way to query the DB in a way more suitable for ontology • Can be configured to support inference • For scalability, can have standard DB as back end • We chose Sesame for our Knowledge Storage Engine • Good extensibility (including OWL inference component) • Web service interface

  17. Query Interface • Two ways to query ontology stored in Sesame • SeRQL • SPARQL • Similar capabilities and we have tested both • Eventually settled on SPARQL, because it is W3C (World Wide Web Consortium) standard • Support from large number of tools • Ease of interoperability and integration • Example: SELECT DISTINCT ?input WHERE { tops:NPP tops:isDerivedFrom ?input }

  18. Inference Choices • There are two basic places for knowledge inference • Pre-compute it before ingestion into Sesame using one of the standard reasoners (a software component that computes inferences from a given ontology) • Can lead to a large ontology (things get inlined into a single file) • Can be hard to distinguish the original and the inferred knowledge • In the context of the storage component (on ingest) • Have the ability to distinguish original and inferred data • External ontology can stay untouched (good for development) • We use OWLIM as a Sesame Storage and Inference Layer (SAIL) in order to take advantage of the OWL inference capabilities

  19. Knowledge Management System Architecture

  20. Knowledge Management System Development • Start with the Use Cases • Map the Use Case • We have been using Mind Manager across multiple projects for a while • Use the map to define ontology • Can be done by hand, but we used Swoop and Protégé - two popular open source OWL editors

  21. Use Case Map

  22. Ontology Definition • Define classes • Data, Model, Event, … • Define relationships among classes • isDerivedFrom (Data -> Data) • hasImpactOn (Event -> Data) • Link with existing ontologies • For example, SWEET ontlogy defines the term Satellite, so we can define our term Satellite_Data as Data whose source is Satellite, similarly we can re-use the definition of Fire as a type of an Event

  23. Populating Knowledge Management System • Once the ontology is defined, we can populate our system with specific instances • Define LAI as a type of Data with Satellite source • Define NDVI as a type of Data with Satellite source • Define LAI as being derived from NDVI • Define NPP as being derived from LAI • The system will reason that NPP is also derived from NDVI without having to specify it explicitly • This is a trivial example of inference

  24. Finishing Up • Populate the Sesame database • Can be done with custom scripts • Testing • Not performance testing, but a testing for validity • Can the system answer the questions that were identified in the Use Cases

  25. Sample Queries • What are satellite inputs to NPP? SELECT DISTINCT ?input WHERE { tops:NPP tops:isDerivedFrom ?input. ?input tops:hasDataSource ?s. ?s rdf:type sweet:Satellite. } • What data are influenced by a fire event? SELECT DISTINCT ?data WHERE {tops:FireEvent tops:hasImpactOn ?data. }

  26. Adding New Knowledge • We plan on populating our system throughout the project • New research information becomes available • Integration of new models and datasets • We would like to get the scientists more involved so that we can have a richer set of information/knowledge available to the rest of the system • The ontology is being developed with Protégé • Good environment, but steeper learning curve • Experimenting with CMap Tools (Good candidate) • Good: Graphical interface is intuitive and easy to use • Questionable: Some problems editing existing ontology created with different tools

  27. Interface • Two native ways to access the system • Java API • For interfacing with our Control Module • Web services • For access from external and web-based systems • Synergy with our NASA ACCESS project • Bringing in semantics with web-based access to data and processing capabilities

  28. Challenges • Needed more background research than anticipated to survey the field and pick the components that fit our requirements • Can be a fairly complex undertaking - we address it by focusing on our requirements and use cases

  29. Status • Ontology definition completed • Knowledge Management System populated with initial knowledge that satisfies our use case scenarios • Sesame server up and running together with the OWLIM reasoner • All initial queries tested against the system using both web-service and Java API access

  30. Backup Slides

  31. Objective: Building the Control Module • Goals • Design Considerations • Interfaces • Data Acquisition • Data Processing • Knowledge Management System

  32. Control Module Function • Provides a logical flow of the process of retrieving similar datasets as well as matching them with appropriate models and/or execution workflows and executing these workflows • Given a set of anomalies perform a sequence of pre-defined high-level actions that will automatically prepare datasets for analysis and further processing • It doesn’t describe the workflows to be executed on each of the datasets, but rather have the ability to execute a pre-defined workflow • Depending on the anomaly classification, we will have a separate system for defining which workflow gets triggered (Year 3)

  33. Dataset Name Resolution • Because the ontology describes the data and their relationships in different levels of abstraction, we need a process to map these into concrete datasets that we can retrieve • For example LAI (Leaf Area Index) can map to a number of datasets in the system, but we need to be able to distinguish between these dataset in order to retrieve them and provide them to workflows • The KM system can tell us what are the possible sources of LAI, but not which one was used for a specific product - that is product-level metadata available through product database • Need to map between ontology data identifiers (URI) and TOPS internal data identifiers (URN)

  34. Dataset Mapping • Physical data are stored in traditional relational databases • We have a large tested production system that could be expensive to transform into an ontology based database • Outside the scope of this project • Contains specific links to the data files that are needed during processing • Database id’s are URNs

  35. Dataset Mapping( 2 ) • General data semantics is stored in the Ontology-based system (Knowledge Management System ) • These contains information that describes the data at more abstract level • NPP is derived from LAI or LAI is a satellite data • MODIS is a source of LAI data • Doesn’t provide link to specific dataset, only information on how one dataset relates to another and how do they fit into the community naming conventions • These descriptions are needed to “reason” about the analysis process, to find related data, or data that match inputs to processes

  36. Dataset Name Resolution Example • The system reports anomaly in Photosynthesis (GPP) product from MODIS-Terra • This is TOPS id (URN) in the relational database: • urn:x-tops:def:phenomenon:NASA:gpp:5.0:1 • Compatible with OGC URN naming conventions • Within OWL/RDF, data are referenced using URIs • http://ecocast.arc.nasa.gov/TOPSKnowledgeBase.owl#GPP • Use it to find what we need • Input datasets - LAI, Temperature, … • For each of the dataset, find possible sources (Terra, Aqua, weather network, …) • Convert combination of dataset (LAI) and source (Terra) from their URIs to match TOPS URN • For example: urn:x-tops:def:phenomenon:NASA:lai:5.0:2

  37. Dataset Name Resolution Notes • Rather than combining the information into the Knowledge Management System, keep it separate • Keeps everything simpler • Easier to add knowledge to the system • One of our main goals is to get other people involved - more feasible, when the representation of the knowledge contains only the level of abstraction needed to be expressive enough • Keep data acquisition and knowledge management at different levels of abstraction • KM System needs to capture information in terms of general data products (i.e. variable dependencies) (NPP and LAI) • Data Management System needs specific information to be able to retrieve the dataset • Need to know which product(s) to request

  38. Control Module Flow • Input are a set of anomalies generated by: • Routine model runs (TOPS, WRF, Landsat Anomaly Pipeline) • With many products, we already generate anomaly information • On-demand anomaly detection runs • Algorithm testing and development • In response to external processing (NWS weather warning, …) • For each of the reported anomalies: • Check the QA information of the dataset if available • Using KM System, find possible causes and check them against external databases of known events (fire, development, …) • Find related/similar datasets and check for anomalies at the same locations • Similar/related = for example temperature from satellite or ground stations • Re-run the anomaly detection algorithms with different datasets • The datasets could be in different resolutions (MODIS -> Landsat) • Do the same analysis for datasets used to derive the variable(s), where anomaly is observed • If datasets is model output, re-run model with different datasets and re-check for anomaly, possibly follow by picking better higher-resolution model with known good performance over selected region

  39. Design Considerations • Many design considerations for the future components of the system • Anomaly selection • Performance optimization • On-demand process scheduling

  40. Anomaly Selection • How do you pick the top anomalies for analysis • Have a ranking mechanism • Number of possible ranking heuristics • Size of the area impacted by the anomaly • Potential economic impact (proximity to population center, …) • Potentially a good role for the Knowledge Management System = potential impact given anomaly magnitude and a set of related variables • More likely combination of the two above (i.e. area size and potential impact) • Can we define a general Ecosystem anomaly score? • Score number of different anomalies across the region and add them up for all locations • Creates general anomaly heat-map • In itself probably a separate research topic - we’ll keep it simple, but have the design to accommodate for this if needed.

  41. Performance Optimization • Given a limited set of resources, we may want to perform the operations of anomaly verification in a more optimal pre-defined set of steps that would better utilize the resources: • For example: • First check the data quality flags, if available (cheap) • Check for known events in external database (fairly cheap) • Re-run anomaly on similar/related datasets (could be expensive) • Re-run anomaly on derived products (even more expensive) • Re-run number of models with finer resolution datasets (could get really expensive) • Need to define the exit criteria from this process

  42. Interfaces • Data Acquisition • Data Manager component developed under AIST Sensor Web project • Plug-in framework for number of data acquisition protocols (OGC SOS, FTP, TOPS-DB) • Both local and web services capability • Need URI -> URN translation for dataset resolution • Process Execution • Process Manager component developed under AIST Sensor Web project • Based on OGC SensorML protocol • Knowledge Management • Java API and partially operational web services interface

  43. Challenges - Control Module • Decision making in Control Module can easily cross into a research in planning and scheduling • Focus on use cases and simple heuristics first, rather than going in depth into the planning and scheduling problems

More Related