310 likes | 472 Views
Turning information into knowledge: the challenges of integrating diverse information sources Alex Poulovassilis, Birkbeck, U. of London Co-Director of the London Knowledge Lab. The London Knowledge Lab. Institute of Education University of London. Birkbeck College University of London.
E N D
Turning information into knowledge: the challenges of integrating diverse information sourcesAlex Poulovassilis, Birkbeck, U. of LondonCo-Director of the London Knowledge Lab
The London Knowledge Lab Institute of Education University of London Birkbeck College University of London purpose designed building Science Research Infrastructure Fund: £ 6m Research staff and students: 50 Location: Bloomsbury Open: June 2004 Computer scientists Experts in information systems, information management, web technologies, personalisation, ubiquitous technologies … Social scientists Experts in education, sociology, culture and media, semiotics, philosophy, knowledge management ...
LKL mission to understand how digital technologies and media are transforming people’s relationships to information, learning and culture at home, work and play to design, build and evaluatesystems, processes and interfaces which enhance learning throughout life to examine critically the assumptions about knowledge and learning that underlie the different uses of digital technologies The starting point for our mission is that digital technologies and new media will change how we learn, work, collaborate and communicate
LKL research themes Our research is funded by projects from EU, EPSRC, ESRC, BBSRC, JISC, Wellcome Trust – currently about 25 projects. Four broad themes guide our work and inform our research: • new forms of knowledge • turning information into knowledge • the changing cultures of new media • creating empowering technologies for formal and informal learning
New forms of knowledge • What do children and adults of the twenty-first century need to know? • How can we learn in new and more effective ways? • What kinds of knowledge are emerging in the knowledge economy? • How can this knowledge be made more accessible to more people?
Turning information into knowledge • The need to cope with ubiquitous, complex, incomplete and inconsistent information is pervasive in our societies • How can people benefit from this information in their learning, working and social lives ? • What new techniques are necessary for managing, accessing, integrating and personalising such information ? • How to design and build tools that help people to understand such information and generate new knowledge from it ?
The changing cultures of new media • What are differences and continuities between ‘old’ media (books, film, TV) and ‘new’ media (internet, computer games, mobile phones) ? • How do children and adults use these media in different contexts, both as consumers and produces ? • How are they learning in, and from, this convergent media environment ? • What are the implications of these developments for formal and informal learning ?
Creating empowering technologies for learning • How are equity, participation, learner autonomy, and the structuring of learning impacted by digital technologies and new media? • Which media-enhanced approaches can help people to learn and collaborate? • How can the Internet, and ambient and mobile technologies create new learning opportunities?
Turning information into knowledge – information integration AutoMed (EPSRC) – developing tools for semi-automatic integration of heterogeneous information sources – can handle both structured and semi-structured (RDF/S, XML) data – can handle virtual, materialised and hybrid integration scenarios – application in biological data integration, e-learning, p2p data integration ISPIDER (BBSRC e-Science programme) – developing an integrated platform of proteomic data sources, enabled as Grid and Web services – collaboration with groups at EBI, Manchester, UCL
The AutoMed Project • Partners: Birkbeck and Imperial Colleges • Data integration based on schema equivalence/subsumption • Low-level metamodel, the Hypergraph Data Model (HDM), in terms of which higher-level data modelling languages are defined – extensible therefore with new modelling languages • Provides a set of primitive equivalence-preserving schema transformations for higher-level modelling languages: • addT(c,q) deleteT(c,q) renameT(c,n,n’) • Also two more primitive transformations for imprecise integration scenarios: • extendT(c,Range q q’) contractT(c,Range q q’)
Features of the AutoMed toolkit • Schema transformations are automatically reversible: • addT/deleteT(c,q) by deleteT/addT(c,q) • extendT(c,Range q1 q2) by contractT(c,Range q1 q2) • renameT(c,n,n’) by renameT(c,n’,n) • Hence bi-directional transformation pathways (more generally transformation networks) are defined between schemas • The queries within transformations allow automatic data and query translation • Schemas may be expressed in a variety of modelling languages
Schema transformation/integration networks GS id id id id id US1 US2 USi USn … … … … LS1 LS2 LSi LSn
Schema transformation/integration networks (cont’d) • On the previous slide: • GS is a global schema • LS1, …, LSn are local schemas • US1, …, USn are union-compatible schemas • the transformation pathways between each pair LSi and USi may consist of add, delete, rename, expand and contract primitive transformation, operating on any modelling construct defined in the AutoMed Model Definitions Repository • the transformation pathway between USi and GS issimilar • the transformation pathway between each pair of union-compatible schemas consists of id transformation steps
AutoMed architecture Schema and Transformations Repository (STR) Wrapper Schema Transformation and Integration Tools Global Query Processor Model Definitions Repository (MDR) Global Query Optimiser Model Definition Tool Schema Evolution Tool
Other data integration approaches: GAV & LAV • Global-As-View (GAV) approach: specify GS constructs by view definitions over LS constructs • Local-As-View (LAV) approach: specify LS constructs by view definitions over GS constructs
Evolution problems of GAV and LAV • GAV does not readily support evolution of local schemas e.g. adding a new attribute to a source table may invalidate some of the global view definitions • In LAV, changes to a local schema impact only the derivation rules defined for that schema • But conversely LAV has problems if one wants to evolve the global schema since all the view definitions defining local schema constructs in terms of the global schema would need to be reviewed • These evolution problems are exacerbated in P2P data integration scenarios where there is no distinction between local and global schemas
AutoMed vs GAV/LAV/GLAV • AutoMed schema transformation pathways capture at least the information available from GAV and LAV rules: • add/extend transformations correspond to GAV rules • delete/contract transformations correspond to LAV rules • Thus, GAV and LAV view definitions can be derived from a BAV network • GLAV rules e :- e’ are also captured, by BAV transformations of the form add(T,e); …;del(T,e’) • Thus, any reasoning or processing that is possible using GAV, LAV or GLAV is also possible using BAV
Schema Evolution in BAV New Global Schema S’ • Unlike GAV/LAV/GLAV, BAV readily supports the evolution of both localand global schemas. • The evolution of a global or local schema is specified by a schema transformation pathway T from the old schema S to the new schema S’ • The transformation network and schemas can then be systematically repaired (rather than having to be redefined) T Global Schema S New Local Schema S’ Local Schema S T
Global Query Processing • We handle query language heterogeneity by translation into/from a functional intermediate query language– IQL • A query Q expressed in a high-level query language on a global schema S is first translated into IQL (this functionality is not yet supported in the AutoMed toolkit) • View definitions are derived from the transformation pathways between S and the requested data source schemas • These view definitions are substituted into Q, reformulating it into an IQL query over source schema constructs
Global Query Processing (cont’d) • Query optimisation and query evaluation then occur • During query evaluation, the evaluator submits to wrappers sub-queries that they are able to translate into the local query language. Currently, AutoMed supports wrappers for SQL, OQL, XPath, XQuery and flat-file data sources • The wrappers translate sub-query results back into the IQL type system • Further query post-processing then occurs in the IQL evaluator
Other AutoMed research at BBK • As well as virtual integration of data sources, we have investigated using AutoMed for materialiseddata integrationi.e.a data warehousing approach • In particular, Hao Fan has worked on incremental view maintenance, data lineage tracing and schema evolution over AutoMed schema transformation pathways • Lucas Zamboulis has developed semi-automatic techniques for transforming and integrating heterogeneous XML data • In recent work he is investigating used correspondences to ontologies to enhance these techniques • Sandeep Mittal is working on update translation and update propagation along AutoMed pathways e.g. in P2P environments
Other AutoMed research at BBK (cont’d) • Dean Williams has been working on extracting structure from unstructured text sources • The aim here is to integrate information extracted from unstructured text with structured information available from other sources • Dean is using existing technology (the GATE tool) for the text annotation and IE part of this work • The information extracted from the text is matched with existing structured information to derive new instance data and perhaps also new schema fragments • AutoMed is being used for the schema and data integration aspects of this project
ISPIDER Project • Partners: Birkbeck, EBI, Manchester, UCL • Aims: • Vast, heterogeneous biological data • Need for interoperability • Need for efficient processing • Development of Proteomics Grid Infrastructure, use existing proteomics resources and develop new ones, develop new proteomics clients for querying, visualisation, workflow etc.
myGrid / DQP / AutoMed • myGrid: collection of services/components allowing high-level integration of data/applications for in-silico experiments in biology • DQP: • OGSA-DAI (Open Grid Services Architecture Data Access and Integration) • Distributed query processing over OGSA-DAI enabled resources • Ongoing research: • AutoMed / DQP interoperability • AutoMed / myGrid interoperability
DQP / AutoMed interoperability • Data sources wrapped with OGSA-DAI • AutoMed OGSA-DAI wrappers extract data sources’ metadata • Semantic integration of data sources using AutoMed transformation pathways into an integrated AutoMed schema • IQL queries submitted to this integrated schema are: • Reformulated to IQL queries on the data sources, using the AutoMed transformation pathways • Submitted to DQP for evaluation
Ongoing and future research • Heterogeneous data integration in Grid and P2P environments, with bioinformatics and e-learning as example application domains • Flexible combinations of virtual, materialised or hybrid integration • Flexible query processing in imprecise integration scenarios • P2P query processing over BAV pathways • P2P update processing over BAV pathways • Use of ECA rules and a P2P ECA rule execution engine for flexible update processing and data sharing