1 / 26

Project Information

Analyzing European Research Competencies in IST – Results from a European SSA Project – Brigitte Jörg, Jure Ferlez, Hans Uszkoreit, Mitja Jermol (DFKI) (IJS) (DFKI) (IJS). Project Information. Funding Organization: European Commission

glynis
Download Presentation

Project Information

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analyzing European Research Competencies in IST–Results from a European SSA Project – Brigitte Jörg, Jure Ferlez, Hans Uszkoreit, Mitja Jermol(DFKI) (IJS) (DFKI) (IJS)

  2. Project Information • Funding Organization:European Commission • Funding Program: Sixth Framework Programme (FP6: IST (3rd Call)) • Project Type:Specific Support Action (SSA) • Duration:32 Months (April 2005 – November 2007) • Project Co-ordination:DFKI GmbH • Technical Co-ordination:Jozef Stefan Institute (IJS) • Technology Partners:DFKI, IJS, Ontotext, CCLRC • Project Consortium:15 partners from EU MS, NMS and ACC

  3. Project Consortium • Deutsches Forschungszentrum für Künstliche Intelligenz, Germany • Institute Jozef Stefan, Slovenia • Ontotext Lab, Sirma AI EAD, Bulgaria • RTD Talos, Cyprus • Institute of Information Theory and Automation, Czech Republic • Archimedes Foundation, Estonia • Comp. and Autom. Research Inst., Hung. Academy of Sc., Hungary • Institute of Mathematics and Computer Science, Uni of Latvia • Lithuanian Innovation Centre, Lithuania • Projects in Motion, Malta • Technical University of Silesia, Poland • National Institute for R&D in Informatics, Romania • Slovak University of Technology, Poland • TUBITAK, Turkey • The Science and Technology Facilities Council, UK (formerly CCLRC, UK)

  4. Technology Partners DFKI Co-ordinator “LT World” Portal Information Extraction Semantic Web Jozef Stefan Institute Technical Co-ordinator “Project Intelligence” Data Mining Social Network Analysis Ontotext “KIM Semantic Annotation Platform” euroCRIS “CERIF” Standard Access to Data

  5. Project Objectives • Set up and populate an information portal on IST research • Provide information about RTD actors and their experience and expertise • Provide innovative and automated services • To promote RTD competencies in specific fields • To support partner search for IST proposals and commercial projects

  6. Presentation Outline • Information Repository • Data Collection • Data Integration / Data Cleaning • Evaluation of Results • Analytic Tools • Overall Conclusion

  7. Repository Features • Information Repository (CERIF 2004) containing • Organisation • Person • Project • Publications • Data Collection (CERIF XML) from • National CRISs • National Collections • Web Crawlings • Community Support • Data Integration into ONE single dataset • to enable analysis at European Level • Data Cleaning with • Supervised Machine Learning Methods (Active Learning)

  8. Repository Data Analysis • Duplicate records inherent in single datasets • Even more duplicate records after merging single datasets • Most obvious duplicates for organisations and persons • no significant number of duplicate projects • publications have been ignored • Duplicate records are a known problem

  9. Formal Problem Definition (Winkler 2006) • Problem: duplicate detection in record set A • Given: a set of records in A • Classify: every pair (a,b) A x A MU (set of true matches)(set of true non matches)

  10. IST World Problem Definition • Heuristic Analysis ofRandom Samples: National Datasets / Cordis Datasets • most obvious duplicates found inside Cordis FP5 and Cordis FP6 datasets and across Cordis FP5 and FP6 datasets • not so many duplicates found in national datasets • a lot of duplicate person records across all datasets • no duplicate records found in project datasets • only some duplicate records across project datasts • publications have not been examined • Decision taken with respect to the IST World scope • not touching project records • ignore publication records • find a solution for person records (IST World Community) • concentrate on cleaning organisation records

  11. Problems with Organisation Records Most entries had slightly different names caused by additional special characters or character modifications • Capitalization, Lowercase Letters • Blanks, extra Spaces • Hyphens • Quotes • Coma in Different Places • Article in Name • Full stop in Name • Incomplete Names • English Translation • Word Order • Language Specific Characters (Jorg instead of Jörg) • Special Characters (wrong encoding &, ?, ) • Mixture of Organisation Names and Department Names • Differences in Addresses Data Cleaning Application

  12. IST World Dataset Integration Knowledge about Records Human Decision M = Match U = Non-Match - = unknown Organisation Names + Location (1) Name/Location Strings (Bag of Words) (2) Word/Character Order (String Kernels) (3) Spelling Errors (Edit Distance Measure) (4) Normalization of (1-3) Organisation Names: Fulltext Indexing Querying Machine Decision M = Match U = Non-Match Machine Learning (Support Vector Machine) M = Match U = Non-Match - = unknown

  13. Active Learning Application

  14. Evalution of Results in CORDIS FP6 dataset • human evaluation of 1000 organisation record pairs • 30 M correct; 934 U correct • 1 M incorrect; 35 U incorrect • 97% precision • 46% recall • integration approach worked well • can be used for large scale integration tasks • Result: semi-automated identification of 4000 duplicates with high accuracy and a reasonable recall

  15. Analytic Tools • Advanced Tools • Collaboration Diagram • Competence Diagram • Experimental Tools • Collaobration Trends • Competence Trends • Consortia Prediction • Semantic Search

  16. How to analyze or generate a Diagram • definition of a query in the IST World Portal • get a list of result records matching the query • generate diagrams based on results

  17. Competence Diagram Aim: investigate the thematic range of SSA projects in FP6 Query: IST SSA projects within FP6 Projects (Red Dots)Linked with Full Record in Repository Thematic Areas (Blue Clouds): SEMANTIC HEALTH LEGALCHANGING ROADMAP SOFTWARE

  18. Competence Diagram Aim: investigate the thematic range of SSA projects in FP6 Query: IST SSA projects within FP6 Goals (List of Keywords): DEMENTIA PEOPLE MEDICAL STANDARDS … Configuration of Result Space: 40% of result list 30 topics

  19. Competence Diagram Aim: investigate the thematic range of SSA projects in FP6 Query: IST SSA projects within FP6 Themes Goals Configuration of Result Space: 40% of result list 30 topics

  20. Collaboration Diagram Aim: investigate the collaboration of SSA partners in FP6 Query: IST SSA projects within FP6 Project Number of joint partners Configuration of Result Space: 20% of result list

  21. Evaluation of Analytic Tools • IST World allowed to perform the tasks defined • for more details see the full paper in the Proceedings • All analytics depend on the data behind • The analytic tools are very powerful

  22. Evaluation of Queries • Query execution performed in March 2008 • Queried datasets IST World / Cordis IST World Portal: http://www.ist-world.org/ CORDIS Search: http://cordis.europa.eu/en/home.html

  23. Results of Query Evaluation Discovered inconsistencies with Cordis data: • „FP6“ string: 30 of 80 relevant records missed the string • „SSA“ string: 15 of 208 relevant records missed the string • „Specific Support Action“ string: 15 of 208 relevant records missed the string • Dates (Year of the call): not consistently recorded • Query 1: 22 projects contained the string „Coordination Action“, „Specific Targeted Action“, „Integrated Project“, others • An investigation of the results of the Query 1 in Cordis revealed:80 projects of the result list are missing in IST World

  24. Overall Conclusion • Integration Method: • Could be further developed • Test data could be used to generate a better classification model • Feature generation could be improved by using ontological knowledge • Transfer learning methods might be helpful for re-use of the learned model • Evaluation of large Datasets: • very difficult • needs expert knowledge • Analytic Tools: • depend on quality data behind • are very powerful for investigation of large datasets

  25. European Research Dataset (entries) • Europan Research: 55078 Orgs, 30489 Proj, 58164 Exp, 165795 Pubs • Bulgaria: 794 Orgs, 73 Proj, 10940 Exp, 19023 Pubs • Cyprus: 29 Orgs • Czech Republic: 183 Orgs, 163 Proj, 164 Exp • Estonia: 75 Orgs, 1256 Proj, 6726 Exp., 51376 Pubs • Hungary: 2665 Orgs, 1297 Proj, 2425 Exp • Latvia: 106 Orgs, 830 Proj, 701 Exp • Lithuania: 102 Orgs, • Malta: 58 Orgs, 27 Proj, 898 Exp, 180 Pubs • Poland: 1451 Orgs, 2179 Proj, 7392 Exp, 16086 Pubs • Romania: 169 Orgs, 68 Proj, 87 Exp • Serbia: 60 Orgs, 2278 Exp, 79130 Pubs • Slovenia: 1723 Orgs, 3748 Proj, 11655 Exp • Slovakia: 56 Orgs, 432 Proj, 683 Exp. • Turkey: 285 Orgs • EPRI-start: 286 Orgs, 275 Exp • Cordis FP5+FP6: 48988 Orgs, 20436 Proj, 13941 Exp • Community: 61 Orgs, 41 Proj, 435 Exp January 2008

  26. Beyond the Project IST World is online: http://www.ist-world.org/ Registration is free Create your Competence Map / Collaboration Map Continuation is planned …

More Related