400 likes | 476 Views
Keith G Jeffery Director, IT & International Strategy CCLRC keith.g.jeffery@rl.ac.uk. Anne G S Asserson Research Department University of Bergen anne.asserson@fa.uib.no. Exploring the Cloud of Research Information Systematically. Structure. The Problem, Proposition, Requirement
E N D
Keith G Jeffery Director, IT & International Strategy CCLRC keith.g.jeffery@rl.ac.uk Anne G S Asserson Research Department University of Bergen anne.asserson@fa.uib.no Exploring the Cloud of Research Information Systematically
Structure • The Problem, Proposition, Requirement • Past Work, Analysis, Conclusion • Solution • Additional Aspects • Future Work • Conclusion
What I want The Problem ‘magic’ system presentation Research chemist DNA university industry PhD thesis entrepreneur synchrotron Classification scheme protein Journal paper Funding agency doctor conference The universe of information of relevance
Unstructured, semistructured Policy papers Research proposals (part) Publications Structured Research proposals (part) Research information records (commonly metadata) Research datasets Heterogeneity Character set Language Syntax Semantics The Problem: The Universe of Relevant Information Highly relevant relevant Partially relevant
person OrgUnit PERSON RelationshipUsual Relation PK FK PK OrgUnit ORGUNIT
Structured Approach Project Pr-OU Pr-Pe OU-OU Pe-Pe OrgUnit Person Pe-OU Linking relations expressing semantics
Research Text: un- or semi-structured text The research project aims to discover the particular properties of the umouzo worm that cause it to provide local tequila with its characteristic taste. The method will involve chemical analysis of the worm at various stages of its lifecycle related to the tequila: its early life in its usual forest habitat, its middle life when collected and held in artificial conditions as stock before it is placed into the bottle of tequila and finally within the tequila environment at various stages of maturity of the tequila. The research will be conducted by Professor Quoaxocoatl and his Analytical Biochemistry team, belonging to the Departments of Chemistry and Biology at the University of Chicken Itza, Mexico using the latest spectroscopic analysis techniques. The beneficiaries of the research will be the Chicken Itza Tequila company which expect to improve the flavour by understanding the chemical changes in the worms. The research will be conducted following the ethical code for experimentation on animals. The funding requested is 1,000,000 US Dollars over 3 years. Un- or Semi-structured Approach
Research Text: finding / classifying syntax The researchprojectaims to discover the particular properties of the umouzo worm that cause it to provide local tequila with its characteristic taste. The methodwill involve chemical analysis of the worm at various stages of its lifecycle related to the tequila: its early life in its usual forest habitat, its middle life when collected and held in artificial conditions as stock before it is placed into the bottle of tequila and finally within the tequila environment at various stages of maturity of the tequila. The research will be conductedbyProfessor Quoaxocoatland his Analytical Biochemistry team, belonging to the Departments of Chemistry and Biology at the University of Chicken Itza, Mexico usingthe latest spectroscopic analysis techniques. The beneficiaries of theresearch will be the Chicken Itza Tequila companywhich expect to improve the flavour by understanding the chemical changes in the worms. The research will be conductedfollowing the ethical code for experimentation on animals. The funding requested is 1,000,000 US Dollars over 3 years. Un- or Semi-structured Approach
Parsing to a defined schema for research information The research<project>aims to discover the particular properties of the umouzo worm that cause it to provide local tequila with its characteristic taste. </project> <abstract> the method will involve chemical analysis of the worm at various stages of its lifecycle related to the tequila: its early life in its usual forest habitat, its middle life when collected and held in artificial conditions as stock before it is placed into the bottle of tequila and finally within the tequila environment at various stages of maturity of the tequila. </abstract> <person>The research will be<relationship>conducted by </relationship> Professor Quoaxocoatl </person> And <relationship>his </relationship><orgunit> Analytical Biochemistry team </orggunit><relationship> belonging to the</relationship><orgunit> Departments of Chemistry </orgunit>and <orgunit> Biology </orgunit> <relationship>at the </relationship> <orgunit> University of Chicken Itza, Mexico </orgunit> <abstract>using the latest spectroscopic analysis techniques. </abstract> The <relationship> beneficiaries </relationship>of the research will be the <orgunit> Chicken Itza Tequila company </orgunit> which expect to<abstract> improve the flavour by understanding the chemical changes in the worms. </abstract>The research will be conducted<condition> following the ethical code for experimentation on Animals. </condition> The funding requested is<funding> 1,000,000 US Dollars </funding> over<duration>3 years. </duration> Un- or Semi-structured Approach
(1) how to assure quality data so that results are accurate; (2) how to assist the end-user in formulating correctly thequery to obtain the expected results; (3) how to structure the data to obtain the optimal response in terms of performance, recall and relevance including taming heterogeneity. What I want ‘magic’ system Problem: How to build the ‘magic system’
Proposition • 3 questions are related • Solution based on • Structured data (and using it as metadata) • Formal logic (reproducible) • Knowledge engineering techniques • Exposed to end user with • User-friendly assistant • Graphic metaphors
The end-user wants a homogeneous response to a query which may involve functional processing in addition to a simple retrieval. response in a reasonable time with all relevant information (recall) some indication of relevance (how closely he answer matches the query). Example: Total amount of funding by year by university spent on biomedical research For last 10 years by midday today Ensuring all universities are included With computed distance score from ‘biomedical’ in classification scheme Requirement
Real world Decision-making based on the computer system Decision-making quality depends critically on the data quality end-user presented with graphical representations of statistically-reduced or model-enhanced data Requirement: Data Quality
How does an end-user express this in a structured query or find the (relevant, complete) information via Google? Are the years calendar or financial? Are the patents measured by number or value of licences granted? Only publications in peer-reviewed journals? Universities in which countries Requirement: User Query “compare the research performance of my university against others across a range of metrics (products, patents, publications) over years”
Timeliness when required, usually immediately Relevance Of results to query Recall Completeness of results All require structured data Or structured metadata describing semi-structured or unstructured data What is the SQL for I want it now? How to match complex logic of query (with processing) to result set How to know what is in the ‘universe’ For precise matching / measurement / calculation Requirement: Data Structure
Structure • The Problem, Proposition, Requirement • Past Work, Analysis, Conclusion • Solution • Additional Aspects • Future Work • Conclusion
Related Work • Generally relevant • Information management, user interfaces, performance….. • Copious • Related directly to CRIS • limited and mainly in CRIS conferences
integrating heterogeneous distributed databases of CRIS information including open access institutional repositories research datasets represented by structured metadata into a homogeneous canonical form; harvesting semistructured information from the web to create a structured metadata index with limited information in a structured form May be only term and URL which points to the original semistructured data in its native form; Related Work - analysis
Related Work: Conclusions • unstructured or semistructured data is valuable only when indexed by structured metadata • satisfactory responses to end-user queries only produced when the data are structured, or indexed by structured metadata; • satisfactory response to end-user queries over heterogeneous distributed data requires knowledge-based techniques (And KB techniques rely on structured data) • semistructured harvesting/browsing techniques require much human effort and so do not scale • Structured data-based techniques allow for further processing, automating and scaling but more R&D is needed to make it work
Semistructured data indexed Satisfactory user query Basis for advanced KB techniques Scalability Further processing Related Work: Conclusion
Org Desc Org Desc CV CV Proj Desc Proj Desc Publ text Publ text Sci data Sci data Much human effort in selecting and integrating Human effort spent on decision-making Harvesting / browsing CERIF-CRIS
Structure • The Problem, Proposition, Requirement • Past Work, Analysis, Conclusion • Solution • Additional Aspects • Future Work • Conclusion
Solution: Quality Structured Information: Data • Data quality can only be assured if the input or edit of data attribute values is validated and – if necessary – supported with explanation of valid values • To achieve this requires structured data i.e. data arranged as attribute values in a structure to form information • validation techniques applicable to each data attribute and relationships between attributes [GoGlJe93]. • validation relies on first order logic and Boolean algebra. This demands structured data (information). • For CRIS: CERIF • data structured formally as information • Attribute values in controlled vocabularies (domain ontologies)
CERIF can act as metadata to other structured, semistructured and unstructured information Examples: OA IRs and research datasets and software. Being extended to financial, project and personnel information ensures such associated datasets are validated, retrieved and interpreted in a structured and logical context. Solution: Quality Structured Information: Metadata CERIF as metadata Repository Object e.g. document
Solution: Expert Advisor and Query Assistant • Much past work on this area. Process is: • Interact with user to discover not ‘what they say’ but ‘what they mean’ • Enhance the query appropriately for relevance, recall, performance • Replay (in natural language) query to user for confirmation • Execute query including taming heterogeneity • Provide user with answer integrated (automatically according to user preference) in preferred • Character set, Language, Syntax, Semantics, Media / Mode
Solution: Homogeneous View through Knowledge-Based Information Integration • Several known techniques for providing a homogeneous information (syntactic) & knowledge (semantic) view over heterogeneous data • http://epubs.cclrc.ac.uk/work-details?w=33728 • May be a stored view (data warehouse) • Problem of currency • Or on-demand view • Problem of performance • In any case requires • Schema matching • Query rewriting and distribution • Answer conversion and integration
Solution: Explanation • When user receives answer • Even if translated to canonical (CERIF) form • It may not be understandable so need explanation • This is done using knowledge-based techniques • The ‘reverse’ of query assistance / improvement • Using domain ontologies to associate additional information to the answer to explain
Structure • The Problem, Proposition, Requirement • Past Work, Analysis, Conclusion • Solution • Additional Aspects • Future Work • Conclusion
Analysis, Modelling, Visualisation • Decision making today is complex • Multiple parameters • Huge volumes of data, sometimes structured as information • Information of varying reliability • So typically the data is processed to represent in a succinct fashion (eg graph, map, video) • And computer models are used to produce predicted results for comparison with the recorded data • This cannot be done without structured information which is somehow made consistent
Push Technology • Push technology relies on a standard query which is executed when triggered by a (relevant) change in the observed database or an externally defined condition • Therefore it requires structured data (or structured metadata representing un- or semi-structured data)
Now, recall the example query • “compare the research performance of my university against others across a range of metrics (products, patents, publications) over years” • And let us imagine how it would be handled by the proposed system environment
Utilising the Solution: The Researcher • The researcher would find • an easy-to-use assisted interface; • behind which integration of information from multiple sources would be performed automatically, • even bringing into a homogeneous context semistructured information (notably publications and research datasets with associated software) via the structured metadata. • Thus thecloud of research informationbecomes awell-formedset of quality information.
Utilising the Solution: The Research Manager • The research manager in a funding organisation would find • an easy-to-use assisted interface with the required information from multiple heterogeneous sources integrated. • The interface provides appropriate functions for comparing the funding; the explanation engine provides background information on the way in which funding is calculated in each country. • quality information structured in context and with appropriate explanation to assist in interpretation. • interface assistance to formulate the query to ensure the correct year(s) and departments are selected and the count of publications is done correctly according to the criteria (e.g. against a list of acceptable peer-reviewed publication channels).
Utilising the Solution: The Innovative Entrepreneur • The innovative entrepreneur utilises the advanced interface • to formulate a homogeneous query over the heterogeneous national sources of information. • The query is improved by the inference engine and domain ontology to overcome the fact that terminology differs from country to country and she wishes to compare like with like in terms of patents, products and their generated wealth through licences or sales. • More complex is to compare track records in wealth creation of research groups; a graphical representation of value against time is used with annotation to explain the basis of the reasoning leading to the inclusion or exclusion of certain research outputs.
Structure • The Problem, Proposition, Requirement • Past Work, Analysis, Conclusion • Solution • Additional Aspects • Future Work • Conclusion
Future Work • As indicated more R&D required: • Faster and more effective schema integration • generation of software to then automatically manage the distributed heterogeneous queries and the answer homogenisation • Faster and more effective tools for building and maintaining domain ontologies (including consistency checking) • Tools for managing multiple languages and character sets
Structure • The Problem, Proposition, Requirement • Past Work, Analysis, Conclusion • Solution • Additional Aspects • Future Work • Conclusion
Conclusion The widespread use of web browsers gives the impression that information for decision makers is readily available. It is, but is of varying quality and requires heavy human involvement to use it. This does not scale. Structured data, or metadata representing un- or semi-structured data is the only reliable foundation. Upon this, advanced CRIS systems for decision-making can be built. Anne G S Asserson Research Department University of Bergen anne.asserson@fa.uib.no Keith G Jeffery Director, IT & International Strategy CCLRC keith.g.jeffery@rl.ac.uk
Another Entity Entity represented by a relation or table relationship One instance Structured Approach One attribute