310 likes | 392 Views
The C 2 M system. The C 2 M system. Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl. Setting. Scientist working with multiple, heterogeneous resources like Databases Knowledge bases
E N D
The C2M system Paul van der Vet, Peter Geurts, Theo Huibers, Hans Roosendaal, Sjoerd van Tongeren ECCI CTIT, University of Twente, Netherlands p.e.vandervet@utwente.nl
Setting • Scientist working with multiple, heterogeneous resources like • Databases • Knowledge bases • Programs • Task requires co-operation of resources • Resources in-house or remote makes no difference
SciDashboard™ • Long-term vision: scientist’s dashboard • SciDashboard™ allows scientist to visually: • Select resources • Connect resources • Identify sources and sinks • Specify data transformations underway • C2M first step towards SciDashboard™
Co-operating resources • First problem: format multiplicity • Format multiplicity is unavoidable • Standardisation social process with high stakes • No format caters for all needs • Second problem: combining resources • Merging, comparing, deduplicating
Format multiplicity • Chemical example: molecular structure files
Molecular structure files • About 20 formats in daily use, for example: • MDL Molfile (MOL) • Connection table (CT) • Standard Molecular Description file (SMD) • Almost all formats specify plaintext files with record-field structure • Delimiters often space and newline characters
CT-file ethanol CH3CH2OH ethanol.ct 3 2 -0.8667 -0.2500 0.0000 C 0.0000 0.2500 0.0000 C 0.8667 -0.2500 0.0000 O 1 2 1 1 2 3 1 1
CT-file ethanol CH3CH2OH ethanol.ct 3 2 -0.8667 -0.2500 0.0000 C 0.0000 0.2500 0.0000 C 0.8667 -0.2500 0.0000 O 1 2 1 1 2 3 1 1
CT-file ethanol CH3CH2OH ethanol.ct 3 2 -0.8667 -0.2500 0.0000 C (1) 0.0000 0.2500 0.0000 C (2) 0.8667 -0.2500 0.0000 O (3) 1 2 1 1 2 3 1 1
MOL-file ethanol CH3CH2OH ethanol.mol ChemDraw03070310372D 3 2 0 0 0 0 0 0 0 0999 V2000 -1.2975 -0.3750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0025 0.3750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.3000 -0.3750 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 2 3 1 0 0 0 0 M END
Wrappers • Wrapper tools exist such as • Chemistry: Babel, ChemDraw • Molecular biology: SRS • Bibliography management: EndNotes, bp • Disadvantage: adding new format impossible or very difficult • “Roll your own” wrappers: awk, perl • Difficult to maintain
Wrapper generators • Basic idea: produce wrapper from high-level description of formats • Often two-step process: A → R → B with R an internal representation • Obvious argument: two-step process takes fewer converters than direct conversion • Disadvantage: R fixed and dedicated
Preparing for middleware • Keyword: modularisation • Stakeholders are responsible for their own specifications, for example: • Content provider offers syntactic format description • User determines internal representation • Internal representation allows combination of resources
The C2M system • C2M: chemical configurable middleware • Implemented in Quintus Prolog • Current state: a wrapper generator • Wrappers produced from high-level specifications of formats and internal representation • Internal representation chosen by user, if desired per task • C2M can be extended to middleware
Current C2M is … • a specification language • for specifying the format of foreign files • for specifying the internal representation • a programming language • for programming wrappers by means of specifications • for inserting copious documentation • a system • for producing wrappers and their documentation
C2M specifications • Two kinds of specifications: • Specification of internal representation • Specification of file format each in a file of its own • Internal representation: ontology • File format specification: read-only, write-only, or both read and write
Language design principles • Adhere to well-known designs • HTML (tags and tag attributes) • context-free grammar (as in BNF) • functions • Use or mimic well-known symbols • grammar rules: lhs -> rhs1 rhs2 rhs3 (→) or lhs ::= rhs1 rhs2 rhs3 (as in BNF) • instantiation: lhs <- funct(arg1, arg2) (←)
Ontology • Frame system • Tree structure with concepts and attributes • Three kinds of concepts: • concept1 = concept2 concept3 concept4 • concept1 = repeated(concept2) • primitive concepts (leaves) • Leaves hold information
Ontology example <C2M-SPECIFICATION type=“ontology” name=“simple-ont”> <ONTOLOGY> sentence = repeated(word) </ONTOLOGY> </C2M-SPECIFICATION>
File format specification • File format specification: grammar + semantic bindings • Grammar specifies structure • System uses grammar to produce parse tree • Semantic bindings map nodes in parse tree onto concepts in internal representation
File format spec example <C2M-SPECIFICATION type=“file-format” name=“simple-form” <READGRAM> .... </READGRAM> <SBREAD> .... </SBREAD> .... </C2M-SPECIFICATION>
File format spec: readgram <READGRAM> <ULG> line -> string line -> sp-string+ sp-string -> spaces string </ULG> <LLG> spaces -> space+ string -> printable-char+ </LLG> </READGRAM>
File format spec: sbread <SBREAD> sentence =^ line word <- identity(string) </SBREAD>
Claims C2M is • sufficiently expressive • fully declarative • a literate programming environment (specification and documentation in one) • easy to learn • amenable to division of labour
Claims (contnd.) • Compared to ChemDraw and their likes, C2M: • Allows for easy addition of new formats • Format specifications can be reused • Prepares for true middleware • Compared to “roll-your-own” wrappers, C2M: • Facilitates reuse and adaptation • Facilitates extensive documentation
To be done (short term) • Stabilise system • Experiment • Provide extensive manual and documentation • Prepare system for others to experiment • But current version implemented in proprietary software platform
To be done (long term) • Test language by means of user surveys • Develop version 2 • Version x may well be wholly visual • Embed system in larger environment • SciDashboard™ • “Habitable Interfaces”