260 likes | 402 Views
Analyzing European Research Competencies in IST – Results from a European SSA Project – Brigitte Jörg, Jure Ferlez, Hans Uszkoreit, Mitja Jermol (DFKI) (IJS) (DFKI) (IJS). Project Information. Funding Organization: European Commission
E N D
Analyzing European Research Competencies in IST–Results from a European SSA Project – Brigitte Jörg, Jure Ferlez, Hans Uszkoreit, Mitja Jermol(DFKI) (IJS) (DFKI) (IJS)
Project Information • Funding Organization:European Commission • Funding Program: Sixth Framework Programme (FP6: IST (3rd Call)) • Project Type:Specific Support Action (SSA) • Duration:32 Months (April 2005 – November 2007) • Project Co-ordination:DFKI GmbH • Technical Co-ordination:Jozef Stefan Institute (IJS) • Technology Partners:DFKI, IJS, Ontotext, CCLRC • Project Consortium:15 partners from EU MS, NMS and ACC
Project Consortium • Deutsches Forschungszentrum für Künstliche Intelligenz, Germany • Institute Jozef Stefan, Slovenia • Ontotext Lab, Sirma AI EAD, Bulgaria • RTD Talos, Cyprus • Institute of Information Theory and Automation, Czech Republic • Archimedes Foundation, Estonia • Comp. and Autom. Research Inst., Hung. Academy of Sc., Hungary • Institute of Mathematics and Computer Science, Uni of Latvia • Lithuanian Innovation Centre, Lithuania • Projects in Motion, Malta • Technical University of Silesia, Poland • National Institute for R&D in Informatics, Romania • Slovak University of Technology, Poland • TUBITAK, Turkey • The Science and Technology Facilities Council, UK (formerly CCLRC, UK)
Technology Partners DFKI Co-ordinator “LT World” Portal Information Extraction Semantic Web Jozef Stefan Institute Technical Co-ordinator “Project Intelligence” Data Mining Social Network Analysis Ontotext “KIM Semantic Annotation Platform” euroCRIS “CERIF” Standard Access to Data
Project Objectives • Set up and populate an information portal on IST research • Provide information about RTD actors and their experience and expertise • Provide innovative and automated services • To promote RTD competencies in specific fields • To support partner search for IST proposals and commercial projects
Presentation Outline • Information Repository • Data Collection • Data Integration / Data Cleaning • Evaluation of Results • Analytic Tools • Overall Conclusion
Repository Features • Information Repository (CERIF 2004) containing • Organisation • Person • Project • Publications • Data Collection (CERIF XML) from • National CRISs • National Collections • Web Crawlings • Community Support • Data Integration into ONE single dataset • to enable analysis at European Level • Data Cleaning with • Supervised Machine Learning Methods (Active Learning)
Repository Data Analysis • Duplicate records inherent in single datasets • Even more duplicate records after merging single datasets • Most obvious duplicates for organisations and persons • no significant number of duplicate projects • publications have been ignored • Duplicate records are a known problem
Formal Problem Definition (Winkler 2006) • Problem: duplicate detection in record set A • Given: a set of records in A • Classify: every pair (a,b) A x A MU (set of true matches)(set of true non matches)
IST World Problem Definition • Heuristic Analysis ofRandom Samples: National Datasets / Cordis Datasets • most obvious duplicates found inside Cordis FP5 and Cordis FP6 datasets and across Cordis FP5 and FP6 datasets • not so many duplicates found in national datasets • a lot of duplicate person records across all datasets • no duplicate records found in project datasets • only some duplicate records across project datasts • publications have not been examined • Decision taken with respect to the IST World scope • not touching project records • ignore publication records • find a solution for person records (IST World Community) • concentrate on cleaning organisation records
Problems with Organisation Records Most entries had slightly different names caused by additional special characters or character modifications • Capitalization, Lowercase Letters • Blanks, extra Spaces • Hyphens • Quotes • Coma in Different Places • Article in Name • Full stop in Name • Incomplete Names • English Translation • Word Order • Language Specific Characters (Jorg instead of Jörg) • Special Characters (wrong encoding &, ?, ) • Mixture of Organisation Names and Department Names • Differences in Addresses Data Cleaning Application
IST World Dataset Integration Knowledge about Records Human Decision M = Match U = Non-Match - = unknown Organisation Names + Location (1) Name/Location Strings (Bag of Words) (2) Word/Character Order (String Kernels) (3) Spelling Errors (Edit Distance Measure) (4) Normalization of (1-3) Organisation Names: Fulltext Indexing Querying Machine Decision M = Match U = Non-Match Machine Learning (Support Vector Machine) M = Match U = Non-Match - = unknown
Evalution of Results in CORDIS FP6 dataset • human evaluation of 1000 organisation record pairs • 30 M correct; 934 U correct • 1 M incorrect; 35 U incorrect • 97% precision • 46% recall • integration approach worked well • can be used for large scale integration tasks • Result: semi-automated identification of 4000 duplicates with high accuracy and a reasonable recall
Analytic Tools • Advanced Tools • Collaboration Diagram • Competence Diagram • Experimental Tools • Collaobration Trends • Competence Trends • Consortia Prediction • Semantic Search
How to analyze or generate a Diagram • definition of a query in the IST World Portal • get a list of result records matching the query • generate diagrams based on results
Competence Diagram Aim: investigate the thematic range of SSA projects in FP6 Query: IST SSA projects within FP6 Projects (Red Dots)Linked with Full Record in Repository Thematic Areas (Blue Clouds): SEMANTIC HEALTH LEGALCHANGING ROADMAP SOFTWARE
Competence Diagram Aim: investigate the thematic range of SSA projects in FP6 Query: IST SSA projects within FP6 Goals (List of Keywords): DEMENTIA PEOPLE MEDICAL STANDARDS … Configuration of Result Space: 40% of result list 30 topics
Competence Diagram Aim: investigate the thematic range of SSA projects in FP6 Query: IST SSA projects within FP6 Themes Goals Configuration of Result Space: 40% of result list 30 topics
Collaboration Diagram Aim: investigate the collaboration of SSA partners in FP6 Query: IST SSA projects within FP6 Project Number of joint partners Configuration of Result Space: 20% of result list
Evaluation of Analytic Tools • IST World allowed to perform the tasks defined • for more details see the full paper in the Proceedings • All analytics depend on the data behind • The analytic tools are very powerful
Evaluation of Queries • Query execution performed in March 2008 • Queried datasets IST World / Cordis IST World Portal: http://www.ist-world.org/ CORDIS Search: http://cordis.europa.eu/en/home.html
Results of Query Evaluation Discovered inconsistencies with Cordis data: • „FP6“ string: 30 of 80 relevant records missed the string • „SSA“ string: 15 of 208 relevant records missed the string • „Specific Support Action“ string: 15 of 208 relevant records missed the string • Dates (Year of the call): not consistently recorded • Query 1: 22 projects contained the string „Coordination Action“, „Specific Targeted Action“, „Integrated Project“, others • An investigation of the results of the Query 1 in Cordis revealed:80 projects of the result list are missing in IST World
Overall Conclusion • Integration Method: • Could be further developed • Test data could be used to generate a better classification model • Feature generation could be improved by using ontological knowledge • Transfer learning methods might be helpful for re-use of the learned model • Evaluation of large Datasets: • very difficult • needs expert knowledge • Analytic Tools: • depend on quality data behind • are very powerful for investigation of large datasets
European Research Dataset (entries) • Europan Research: 55078 Orgs, 30489 Proj, 58164 Exp, 165795 Pubs • Bulgaria: 794 Orgs, 73 Proj, 10940 Exp, 19023 Pubs • Cyprus: 29 Orgs • Czech Republic: 183 Orgs, 163 Proj, 164 Exp • Estonia: 75 Orgs, 1256 Proj, 6726 Exp., 51376 Pubs • Hungary: 2665 Orgs, 1297 Proj, 2425 Exp • Latvia: 106 Orgs, 830 Proj, 701 Exp • Lithuania: 102 Orgs, • Malta: 58 Orgs, 27 Proj, 898 Exp, 180 Pubs • Poland: 1451 Orgs, 2179 Proj, 7392 Exp, 16086 Pubs • Romania: 169 Orgs, 68 Proj, 87 Exp • Serbia: 60 Orgs, 2278 Exp, 79130 Pubs • Slovenia: 1723 Orgs, 3748 Proj, 11655 Exp • Slovakia: 56 Orgs, 432 Proj, 683 Exp. • Turkey: 285 Orgs • EPRI-start: 286 Orgs, 275 Exp • Cordis FP5+FP6: 48988 Orgs, 20436 Proj, 13941 Exp • Community: 61 Orgs, 41 Proj, 435 Exp January 2008
Beyond the Project IST World is online: http://www.ist-world.org/ Registration is free Create your Competence Map / Collaboration Map Continuation is planned …