250 likes | 354 Views
Research Information Linked Open Data Store. euroCRIS members meeting, Bonn, may 2013. O verview. Needs & Drivers Information and data sources Structured Unstructerd Architecture Planned Realised Tools. Project. Partners Knowledge Management unit, EWI IBM Belgium Goals
E N D
Research Information Linked Open Data Store euroCRIS members meeting, Bonn, may 2013
Overview • Needs & Drivers • Information and data sources • Structured • Unstructerd • Architecture • Planned • Realised • Tools
Project • Partners • Knowledge Management unit, EWI • IBM Belgium • Goals • Merge all sources into one open environment. • Apply entity resolution technique to remove data silo’s • Crawling and content analysis of full text elements • Build and test the proposed Pilot Architecture • Information integration form structured and unstructured data in one container • Build a number of visualisations of the information • Develop a roadmap towards the Operational Architecture • Timing: • 4 months starting from January 20113 • Cost • 124k euro
Needs & drivers Better information: correct, actual , complete Open FRIS data for services and application devellopment Flemish government Open Data policy Maximum reuse of components Increase strategic intelligence Maximum reuse of data Policy monitoring: efficient & effective Connect data silo’s More information services Reduce system costs
Information and Data sources • Structured Data • FRIS research portal database • Format: CERIF2006 database • Coverage: All universities 1 university college • 4 university OAR’s • Format: MODS records • Coverage: X publication records, X full tekst resources • VABB-SSH: publication monitoring data set on Social Sciences and Humanities • Format: MODS records • Coverage: All universities • Semantics and information model • Business Semantics Glossary • FRIS model: CERIF2006 • Semantics: Entitiy Classifications
Information and Data sources • Unstructured Data • All textual information form the structured data • Project Abstracts • Publication Abstracts • Organisation Activity descriptions • Full text of Publication • Websites • Project • Researcher • Organisation
Links andLocators • Access to unstructured data • Textual elements in CERIF model • Project Abstracts • Publication Abstracts • Organisation Activity descriptions • Websites • URI fields in CERIF entities • Links to fulltext • Resource links in MODS records
Somenumbers • CERIF records: • Person:22.006 (FRIS) +1.454.208 (OAI without resolution) • Project:24.634 (FRIS) • Organisation:1.398 (OAI) + 2.022 (FRIS) • Publications: 3.596(FRIS) • MODS records • OAR’s:598.035 (OAI) + VABB database • Publication Full text :45.294 (OAI)
PlannedArchitecture Identifiers & EntityResolution Content Analysis Concept Extraction Visualisation Triple Store Structured Data input Operational Store Semantic control
OAR Harvesting Architecture Crawler management OAI-PMH Crawler UGent MODS to CERIF conversion CERIF database D2Rtransformation UHasselt … XML VABB
Architectuur – Tools & Standards BSG SBVR Jena HTTP D2R TDB REST Java Java SPARQL SKOS OWL RDFS WEB 2.0 APACHE FUSEKI Oracle TOMCAT RDF CERIF SILK R2R SIEVE ICA ICC HARVESTER OAI-PMH LDIF UIMA MODS
Somenumbers • Entities • Projecten: 24.634 (FRIS) • Personen: 22.006 (FRIS) +1.454.208 (OAI zonderresolutie!)) • Publicaties: 598.035 (OAI) + 3.596 (FRIS) • With full text: 45.294 (OAI) • OrgUnit: 1.398 (OAI) + 2.022 (FRIS) • Recognised author affiliation from full text: 55662 • Triple Store • Triples FRIS+OAI : 57M • Triples text mining (author recognition + lemmas) : 144M • --> Still without inference (no inference deduce triples)
Visualisations • Two test visualisations build sofar: • Word cloud for person • http://ewisclod3.vlaanderen.be/words/ • Persons related to Concepts • http://ewisclod3.vlaanderen.be/persons/ • New visualisations will be build on well defined use cases • Tuning the Content analytics to the case • Supervised learning for specific domains • Give an contextual overview of research from the last 10 years on social security issues in Belgium • Annual report on research in the domain of renewable energy
Entityresolution • A few tools tested • Silk Link Discovery Framework • used to map authors from the OAR harvest onto Persons form the CERIF sources. • Experimented with • manual construction of matching ruls via de Silk workbench • Active learning combined with the Silk generic algoritms • Several metrics on the tekst dimensions: Levenstein, tf-idf, Jaro, Jacard in combination with numerical and temporal dimensions • Results still have to be validated in detail. • Tests with OKKAM are planned
Architecture Roadmap Elements (optional) Replace D2R with standard: R2RML Full-CERIF automatic D2R template generation Support incremental CERIF/RDF loading Integration of Data Governance Center via he API Complete modelling of CERIF and Semantics in Data Governance Center Full-CERIF automatic ontology template generation manueel geautomatiseerd
D2R Views • FRIS: http://ewisclod3.vlaanderen.be/d2rq/fris/ • OAI-PMH: http://ewisclod3.vlaanderen.be/d2rq/oai/ • Text Mining: http://ewisclod3.vlaanderen.be/d2rq/tm/ • SPARQL • Test pagina: http://ewisclod3.vlaanderen.be/ewilod/html/sparql-test.html • Endpoint (enkel query): http://ewisclod3.vlaanderen.be/ewilod/sparql • RESTful API (GET) • Resource basis URL: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/resource/ • Ontologie basis URL: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/ontology • TriplestoregrafeURIs • FRIS: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/graphs#fris • OAI-PMH: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/graphs#oai • TextMining: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/graphs#tm • Mappings: http://ewisclod3.vlaanderen.be/ewilod/lod/0.1/graphs#ld • LDIF • Status monitor: http://ewisclod3.vlaanderen.be/ldif/status/ • Silk • Workbench: http://localhost:8080(via SSH tunnel) • Visualisaties • Index pagina: http://ewisclod3.vlaanderen.be/ewilod/html/vis/index.html • Hierbij de visualisaties: http://ewisclod3.vlaanderen.be/persons/http://ewisclod3.vlaanderen.be/words/
Hierbij de visualisaties: http://ewisclod3.vlaanderen.be/persons/http://ewisclod3.vlaanderen.be/words/