390 likes | 525 Views
CRIS and DataSpaces by Epitaxial Growth. Keith G Jeffery STFC Anne Asserson UiB. With acknowledgement to http://www.mineral-forum.com/. Authors. Keith G Jeffery STFC-RAL. Anne Asserson UiB. History Hypothesis Achievement Conclusion. CRIS Interoperation Requirement
E N D
CRIS and DataSpaces by Epitaxial Growth Keith G Jeffery STFC Anne Asserson UiB With acknowledgement to http://www.mineral-forum.com/
Authors Keith G Jeffery STFC-RAL Anne Asserson UiB
History Hypothesis Achievement Conclusion CRIS Interoperation Requirement Classical Interoperation CRIS Interoperation Dataspaces Requirement Challenges Solution Concept Schema matching and mapping Other Challenges Structure
History: CRIS Interoperation Requirement • Internationally and Nationally • (a) to allow each funding organization to make strategic funding decisions related to knowledge of the actions of other funding organisations; (b) to provide an international base of reviewers for research proposals; • (c) to provide comparative metrics on research performance and cost-benefit; • (d) for researchers to find colleagues in areas peripheral to their own (where they claim they knew the key players) on an international basis; • (e) for encouraging international innovation taking research outputs through to wealth-creation. • For business purposes • Funding organisations <-----> research organisations
History: Classical Interoperation Technology • Significant challenge in ICT / Computer Science • For structured data sources • From formal schemas erect global schema to reduce n(n-1) interoperations to n • For semi-structured data sources • From XML schema as above • else derive schema from tags • For unstructured data sources • Harvesting • Possibly advanced knowledge engineering and machine learning techniques • All require considerable human intervention • This does not scale for future internet
History: CRIS Interoperation Technology • IDEAS (1984-87), EXIRPTS (1987-89), ERGO 1997-99) • Categorisation of Techniques (2005) • Remote Wrapper • Local Wrapper • Catalog • Catalog Plus Pull (ERGO2++) • Full CERIF • Harvesting • Confirmed by euroHORCs special group 2008 • Recommended converge to CERIF to allow evolution to Full CERIF technique.
History: Dataspaces • The vision behind dataspaces is that the end-user ‘sees’ a space with all relevant information required for a particular purpose presented in a form suitable for the purpose of the end-user. • Utilise human effort and knowledge to guide (or actually execute) the creation of a temporary, partial global schema by matching of schemas and mapping (i.e. specifying the wrapper software) for interoperation. • More recently it has been suggested that machine learning could be employed using manually-created mappings as training data. • Dataspaces includes the concept of ‘pay as you go’ only requiring the matching and mapping necessary for a particular interoperation instance. • The incremental nature of building / accessing dataspaces stimulated the concept of epitaxial growth in dataspaces
History Hypothesis Achievement Conclusion CRIS Interoperation Requirement Classical Interoperation CRIS Interoperation Dataspaces Requirement Challenges Solution Concept Schema matching and mapping Other Challenges Structure
Hypothesis: Requirement remote CRIS query results CERIF remote CRIS Local CRIS CERIF CERIF remote CRIS
Hypothesis: Challenges 1. discovery of relevant CRIS; 2. description of each relevant CRIS; 3. matching each description in (2) to CERIF including (a) management of different character sets and multimedia representation; (b) management of different languages; (c) management of different syntax (data structure); (d) management of different semantics (meaning); 4. generating mappings to describe the required conversions on instances under each CRIS schema to/from CERIF; 5. generating conversion software for instances in any source CRIS; 6. managing a query in terms of a local schema (or CERIF) over all CRIS; 7. managing the response (answers); 8. management of change in schemas (3) (but usually only (c) and (d));
Hypothesis: Solution Concept With acknowledgement to http://www.mineral-forum.com/
Epitaxial Growth ions The term epitaxy comes from the Greek roots epi, meaning "above", and taxis, meaning "in ordered manner". Structurally congruent Crystal structure
History Hypothesis Achievement Conclusion CRIS Interoperation Requirement Classical Interoperation CRIS Interoperation Dataspaces Requirement Challenges Solution Concept Schema matching and mapping Other Challenges Structure
Achievement: Schema Matching and Mapping • Challenges 3 and 4 (matching and mapping) are addressed using the novel technique • We propose epitaxial growth from a canonical CERIF schema to generate the global schema for any group of CRIS. • In the process, for each CRIS, there is documented • entities and attributes matching CERIF (CRIS=CERIF); • entities and attributes in CERIF that are missing in the CRIS CERIF > CRIS; • entities and attributes in the CRIS that are missing in CERIF CRIS > CERIF;
lAnguage match; Lexical match; sYntactic match; sEmantic match; Reconciliation; Epitaxial growth; A L Y E node Achievement: Schema Matching and Mapping Schemas of CRIS and CERIF represented as fully connected cyclic graphs. Match Flags
If Flags AR | AN To reach AM Utilise E : Semantic analysis To achieve EMD EMTB EMTP EMO A L Y E node Achievement: Translation
If Flags LR | LN To reach LM Utilise Y: Syntactic analysis for structural positioning to achieve YM E: Semantic analysis for related terms to achieve EMD EMTB EMTP EMO A L Y E node Achievement: Improvement
Achievement: Algorithm (sketch) 1 • AM | LM | YM | [EMD|EMO] direct mapping • the issue of EMTB, EMTP is dealt with below; • In all other cases R | N may be resolved by a later phase; • AR | AN flag precludes lexical match unless resolved by ED | ET | EO; • AM +LR | LN flag precludes a match unless resolved by ED | ET | EO; • Y processing indicates nodes to attempt to match using ED | ET | EO;
Achievement: Algorithm (sketch) 2 • ED provides for multilingual term equality; • ET processing like Y; the processing must cycle over the graph to find the closest YM and then apply ET to find EMTP | EMTB and allocate the flag(s) appropriately; • EO processing like Y; the processing must cycle over the graph to find the closest match for the current node by locating EMO for terms in nodes YM-related to the current node thus determining the semantics of the current node;
Achievement: Epitaxial Growth With acknowledgement to http://www.mineral-forum.com/
Achievement: Epitaxial Growth • In the general case we now have two graphs with matching and non-matching nodes. • likely there are excess nodes in the CRIS graph (i.e. the CRIS contains all the CERIF elements, probably with different names, plus more entities / attributes suitable for local processing)
Achievement: Epitaxial GrowthCRIS Superset CRIS CERIF e e a a e a a e e a a e a a e e a a a a e e a a a a e a a a a e a a a a a a e e a a a a e = entity, a = attribute
Achievement: Epitaxial GrowthCRIS Subset CRIS CERIF e e a a e a a e a a a a e a a e a a a a e a a e a a e = entity, a = attribute
Achievement: Epitaxial Growth: Algorithm (sketch) 1 • For nodes in the CRIS schema that do not have corresponding nodes in the CERIF schema: • Their syntactic positions in the graph, related to nodes in the CRIS schema, are analysed • starting from the root and working down the graph, left to right at each level; • The language is flagged • so that this can be the variable value input in the CERIF language attribute; • The lexical term for the entity or attribute name is noted ready for translation to the canonical language for interoperation • with the CERIF translation attribute set to automatic; • The syntactic structure present in the CRIS but not in CERIF is then added to the CERIF schema as the epitaxial growth……...
Achievement: Epitaxial Growth: Algorithm (sketch) 2 • This is achieved in the following steps: • if the additional nodes are attributes under an existing entity ‘X’, a new CERIF entity ‘Xadditional’ is created with the attributes as nodes under it and ‘Xadditional’ is linked with a cardinality 1:1 with ‘X’ in CERIF using matched primary keys;
Achievement: Epitaxial GrowthAdditional Attributes CRIS ‘Y’ NEW CERIF Linking Relation ‘X’ ‘Xadditional’ CRIS Entity ‘Y=X’ CRIS Entity ‘Y≠X’ CERIF Entity ‘X’ NEW CERIF Entity ‘Xadditional’
Achievement: Epitaxial Growth: Algorithm (sketch) 3 • if the additional nodes include an additional entity ‘Y’, linking relations are created in CERIF to link ‘Y’ to existing CERIF entities and attributes of ‘Y’ are nodes under ‘Y’; • This step depends on the complexity of the CRIS: primary/foreign key relationships (with cardinality 1:n) can be analysed to generate the linking relations and if the CRIS already has n:m cardinality relationships using linking relations these can be mapped directly;
Achievement: Epitaxial GrowthAdditional Entity CRIS ‘Y’ NEW CERIF Linking Relations ‘Y’ To other entities NEW CERIF Linking Relations ‘Y’ To other entities Links to other entities NEW CERIF Linking Relations ‘Y’ To other entities CRIS Entity ‘Y≠X’ For all X NEW CERIF Entity ‘Y’
Achievement: Epitaxial Growth: Algorithm (sketch) 4 • A similar process is followed for nodes in the CERIF schema not having corresponding nodes in the CRIS schema.
With acknowledgement to http://www.mineral-forum.com/
Achievement: Other Challenges • Challenge 1: DISCOVERY • managed by the CRIS community knowledge (e.g. DRIS) and/or by web services searches UDDI/WSDL. • Challenge 2: DESCRIPTION • the schema of any structured (or semi-structured) CRIS. • {Challenges 3 and 4 : MATCHING and MAPPING • described above} • Challenge 5: CONVERSION • using technology from Hypermedata (SkKoBeJe1999). From the schema graphs of the CRIS and CERIF the convertor can be generated to convert instances under one schema to instances under another. • Challenges 6 and 7; QUERY and RESPONSE • using technology from MIPS (JeHuKaWiBeMa94) • Challenge 8; CHANGE • managed by repeating the above as required
History Hypothesis Achievement Conclusion CRIS Interoperation Requirement Classical Interoperation CRIS Interoperation Dataspaces Requirement Challenges Solution Concept Schema matching and mapping Other Challenges Structure
Conclusion 1 • questions facing society today can be solved, or partially solved using research information to guide actions and developments. • world-scale problems cannot be solved by local-scale research information. • base technology to support and move forward is interoperating (CERIF-)CRIS within a (European) e-infrastructure. • current technologies do not scale to ‘future internet’. • responsibility of the CRIS community to find, develop and promote a solution for interoperating CRIS that is reliable, scalable, economic and has appropriate security and privacy. • interoperating CRIS using epitaxial growth is a promising line of development.
Conclusion 2 And of course, it would be much more effective and efficient…….. ……..If all research information systems used CERIF as their storage and interoperation format