220 likes | 358 Views
Data Knowledge Base for HENP Experiments Kurchatov Institute R&D Project. Maria Grigorieva National Research Center “ Kurchatov Institute”. outline. 2016 kick-off meeting and the first R&D phase Setup NRC KI/TPU teams Metadata sources in ATLAS DKB architecture prototype
E N D
Data Knowledge Base for HENP ExperimentsKurchatov Institute R&D Project Maria Grigorieva National Research Center “Kurchatov Institute”
outline • 2016 kick-off meeting and the first R&D phase • Setup NRC KI/TPU teams • Metadata sources in ATLAS • DKB architecture prototype • April – May discussions within DCC and focus topics for 2017 • Technical discussions highlights • Towards DCC whiteboard • Project status and plans
Data Knowledge Base Kick-off meeting highlights. May 2016 DKB Motivation Torre Wenaus talk in May 2016 “Whether we can/should work to capture and present the whole process from physicist idea ➡ production intent ➡ production request ➡ production status ➡ completion of the full processing chain ➡ available data” DKB Basic Consideration • Organizing metadata in ATLAS, so as to provide a holistic view on physics topics, including integrated representation of all ATLAS documents (papers, drafts, supporting documents, conference notes, Indico meetings, Twiki pages, etc) and corresponding data samples. https://indico.cern.ch/event/527581/
Atlas metadata sources • Public results • CERN Document Server • CERN Twiki • Indico • GLANCE (Papers and ConfNotes) • Data management and analysis • Production System Database • ATLAS Metadata Interface (AMI) • Rucio DDM • Google Docs (dataset’s spreadsheets) • JIRA • ATLAS SVN Repository • BigPanDA Monitoring System • Physics and experiment environment • ATLAS Geometry database • ATLAS Conditions Database • ATLAS Twiki Pages All these data sources exist independently. Some metadata are integrated, but in general scientists need to obtain cross relations among metadata within each category and among categories by themselves. DKB is designed to fill this gap in metadata integration and is considered to look for cross references among the metadata, stored in various data sources.
Ontological Modelling Main classes of DKB ontological data model • First scope: insufficient connectivity between ATLAS public results and underlying data samples. • Data Samples used in ATLAS Papers, are not specified explicitly. • AMI/GLANCE interface implemented this feature, but it doesn’t work for Papers, published earlier. • Each Paper is based on Supporting Documents. • Supporting Documents are supposed to describe how data samples were obtained. • Extract datasets from text of these Documents allows to relate datasets to Paper. • Experiment-specific metadata, extracted from Documents, provides detailed categorization of Papers and data samples by different parameters. • AMI and Production System have detailed metainformation about campaigns, projects, and other experimental attributes. • Adding these metadata to model allows to extend datasets and Papers categorization. • Metadata sources for DKB: • GLANCE • GLANCE API allows to get a list of all ATLAS Publications with links to corresponding Supporting Documents • CERN Document Server • The main source of Document’s metadata • ATLAS Supporting Documents content • Unstructured PDF to structured JSON view • Production System • Datasets/containers metadata • AMI • Campaigns, Projects, Data Taking Periods, …
Metadata extraction from supporting documents • PDFAnalyzer tool extracts metadata from TXT and XML representation of PDF text, by regular expressions and context analysis. • PDFAnalyzerextracts • dataset names by regular expression • datasets metadata from tables • experiment-specific metadata from text • Returns structured metadata in JSON • Has GUI intefrace, providing manual correction of analysis results • Datasets list • Datasets attributes in table TXT XML
Document analysis statistics • PDFAnalyzer extracts following metadata from PDF documents: • Energy • Data taking year • RealDataproject name (suggestion) • Integrated luminosity • Collision type • Datasets • Campaign(s) • MC project name (suggestion) • Datasets (in table format) • Datasets (in plain text format) The information about datasets (in plain form or in tables), used in analysis was found only in 32% of analyzed documents Analyzed documents number: 402
Kafka Streams for dataflow automation • Developed: • Adapters to reuse existing data processing (extracting, transforming, uploading, …) modules within Kafka‐driven dataflow • Management utility to run the dataflow stages • Formulated basic rules to write dataflow modules
DKB architecture prototype Consolidation Virtual Integration Hybrid Integration
Web Interface for Virtuoso Storage Ontodia is a JavaScript library that allows to visualize, navigate and explore the data in the form of an interactive graph based on underlying data sources. http://bamboo.zoo.computing.kiae.ru/dkb/app/
Dkb summary DKB program code on GitHub: https://github.com/PanDAWMS/dkb Technology evaluation and system prototype architecture: • RDF storage: Virtuoso • Transitional storage: Hadoop • Metadata streaming&processing:Apache Kafka • Knowledge Base navigation (Web GUI): Ontodia • PDF documents as metadata source Near term plans (this year) • Fully automated metadata streaming • Federated prototype • Installation and configuration of SPARQL-Endpoint for Oracle DEFT • DKB with user authentication • CERN SSO (under investigation)
Dkb/dcc whiteboard DCC group is working on data curation and characterization (through event metadata) to provide reporting to Computing and Data Preparation groups
DKB/DCC Technical discussions • Data Knowledge Base Technical Discussion 18 Apr 2017 https://indico.cern.ch/event/628962/ • Data Knowledge Base / DCC Technical Discussion 20 Apr 2017 https://indico.cern.ch/event/632634/ • DCC Meeting 18 May 2017 https://indico.cern.ch/event/640192/ • DKB was suggested to become a part of DCC Whiteboard as a searchable system, providing flexible search for metadata and for metadata cross relations in various data sources. • DKB team was motivated to investigate campaigns/datasets discovery and categorization methods. And we started with the analysis of metainformation in PMG and APG Twiki pages, and suggested methods of it’s representation in structured and searchable way.
MC Campaign production Details in Twikiapg • Production Steps Configuration: • Simulation Strategy • Project Name • MC Generators • ATLAS Software Release • Number of Events / Events per File • ATLAS Geometry • Conditions Tags • Production Tags • Bunch Spacing • Trigger Menu • Data Taking Period / Run Numbers Event Generation Detector Simluation Digitization&Reconstruction https://twiki.cern.ch/twiki/bin/view/AtlasProtected/AtlasProductionGroupMC15a Data Knowledge Base Project
Twiki summaries • Twiki Campaign/Subcampaign Pages contains: • Aggregated events reports, categorized by physics categories • Data sample’s lists for each physics category • Data sample’s lists with breakdown by request description, including a set of parameters, like: • Generators • Powheg+Pythia8 … • Physics channel • W+ in mumu • W- in taumu • Z/gamma* in tau tau ... • Filtration • Without lepton filter • Two lepton filter ... • Tune software • AZNLO CT10 …
Dkb in DCC whiteboard • Metainformation in Twiki pages is not always actual, but it provides the most suitable metadata categorization and representation for the end-users. • Production System’s metadata were enhanced with ‘hashtags’ for tasks (from MC16 campaign), providing more detailed search by physics categories, physics channel, MC generators list, etc. • BigPanDA Monitor has aggregated summaries for datasets, tasks, requests, and so on. • DKB stores metainformation about ATLAS Papers, corresponding Supporting Documents and underlying datasets metadata in RDF storage • DKB provides a flexible search by various a sets of parameters, with index storage backend. • For example, ‘Find datasamples, generated in MC16a for Wjets physics topic, using Powheg+Pythia8 generators and W+ in mumu physics channel’ • Real DKB request could be like a string of hashtags: • ‘MC16a, Wjets, Powheg+Pythia8, w+ in mumu’
Functional prototyping Prototype №1: Search metainformation related to physics category (SingleTop, Ttbar, Higgs, Exotic, …) for MC16 campaign in: Production System BigPanDA Monitoring aggregated reports Twiki APG and PMG pages CDS Papers & Supporting Documents
DKB in DCC Whiteboard functional components • Storage of indexes (Lucene/ElasticSearch) • Metadata crawler, extracting metainformation from Production System, BigPanDA monitor reports, Twiki pages, and put it in index storage • User’s request converter • User Interface • BigPanDA Monitoring system is supposed to become the “launching pad” for DKB/DCC Whiteboard prototype.
DKB search Diboson DKB currently indexes and searches over: • Physics topics simulation: e.g. “Diboson”, “Diboson in MC16a ”
DKB search Diboson DKB currently indexes and searches over: • Physics topics simulation: e.g. “Diboson”, “Diboson in MC16a ” MC16a: https://twiki.cern.ch/twiki/bin/view/AtlasProtected/AtlasProductionGroupMC16a MC16c: • https://prodtask-dev.cern.ch/prodtask/request_hashtags_main/#/hashtags/&MC16c_CP
Dkb/dcc Near term plan • BigPanDA Monitoring: • Reimplementphysics category breakdown report for Campaign MC16a report page, using hashtags from Production System • Provide content blocks to be filled by DKB interfaces • DKB team: • Prepare backend infrastructure at CERN (currently, it is suggested to be ElasticSearch) and fill it with hashtags indexing metadata • Implement crawler for indexing available metadata
Thanks • This talk drew on presentations, discussions, comments, input from many. Thanks to all, including those I’ve missed • Mikhail Borodin, Kaushik De, Gancho Dimitrov, Dmitry Golubkov, Alexei Klimentov, Mikhail Korotkov, Dimitry Krasnopevtsev, SiarheiPadolski, Eygene Ryabinkin, Anatoly Tuzovsky,... • Special thanks go to Torre Wenaus who initiated this work and for his ideas about Data Knowledge Base content design This work was funded by the Russian Ministry of Science and Education under contract #14.Z50.31.0024 the Russian Foundation for Basic research under contract #16-37-00246.