Treatment of Semantic Heterogeneity ...

Treatment of Semantic Heterogeneity ... ... using Meta-Data Extraction and Query Translation Robert StrötgenSocial Science Information Centre, Bonn euroCRIS 2002, 29th August 2002

Outline • What is semantic heterogeneity? • Meta-Data extraction • Semantic relations • Query translation • Outlook

Project CARMEN • Metadata (Dublin Core Element Set in RDF, “Meta-Maker”, digital signatures) • Retrieval on structured documents and heterogeneous data types (search engine and gatherer for XML documents) • Methods for treatment of resisting semantic heterogeneity in CARMEN

Semantic Heterogeneity • Technical heterogeneity (different platforms, databases, formats) is not the issue of CARMEN • Semantic heterogeneity appears in different data collections using • different thesauri or classifications for content description • varying or no metadata at all • or when intellectually indexed documents meet completely un-indexed Internet pages

Material: Social Sciences • SOLIS/FORIS vs. Internet documents from social sciences • specialized documentation databases with high-quality content description like abstract, controlled keywords and classification • Internet documents in the majority of cases without any metadata, high semantic and formal heterogeneity

Extraction of Meta-Data

Meta-Data in Test Corpus • Size: 3,661 documents • File format: only HTML documents • TITLE: • Correct title tags: 96 % • Title, but incorrectly coded: 17.7 % of the rest • KEYWORD: • Correct keyword tags: 25.5 % • ABSTRACT: • Correct description tags: 21 % • Abstract, but incorrectly coded: 39,4 % of the rest

Extraction from HTML files - Some Problems • Missing or irregular use of Meta tags (author, keywords, DC-Tags) • Inconsistent use of semantic HTML tags (title, h1, h2, address etc.) • Irregular formatting style for context information (type size, type style, horizontal orientation etc.) • Missing context information (date, author, institution, etc.) • Not specification consistent use of HTML!

Converting HTML  XML • Advantages: • (syntactical) homogenisation of HTML files • XML allows the use of many existing tools for document analysis, particularly the query language XPath. • Disadvantage: • Poor performance of the converting process(not a big issue: extraction runs during gathering process, not at retrieval time)

HTML Heuristic : Title (part) • If (<title>-tag exists && <title> does not contain "untitled" && HMAX exists){ /* 'does not contain "untitled"' is to be searched as case insensitive substring in <title> */ If (<title>==HMAX) { <1> Title[1]=<title> } elsif (<title> contains HMAX) { /* ' contain' does always mean case insensitive substring */ <2> Title[0,8]=<title> } elsif (HMAX contains <title>) { <3> Title[0,8]=HMAX } else { <4> Title[0,8]=<title> + HMAX } } elsif (<title> exists && S exists) { /* i.e. <title> exists AND an item //p/b, //i/p etc. exists */ <5> Title[0,5]=<title> + S } elsif (<title> exits) { <6> Title[0,5]=<title> } elsif (<Hx> exits) { <7> Title[0,3]=HMAX } elsif (S exits) { <8> Title[0,1]= S }}

Results and Outlook • Extraction of Meta-Data • TITEL: 80 % extracted with medium or high quality • KEYWORDS: nearly 100 % extracted with high quality • ABSTRACTS: 90 % extracted with medium/high quality • Conclusion • In principle transferable on other domains • Expensive maintenance • Only compromise solution, until builders of web pages use Dublin Core or other Meta-Data standard

Semantic Relations • Intellectual transfers relations(Cross-Concordances) • Tools for creation: SIS-TMS for thesauri, CarmenX for classifications • Statistical transfer relations (Co-occurrence analysis)

Cross-Concordances in SIS-TMS

SIS-TMS Correlation Editor

Parallel Corpus

Corpus with Internet Documents • Social Sciences‘ Internet documents are not indexed using a thesaurus or classification

Simulating a Parallel Corpus

Result: Simulated Parallel Corpus

Term-Term-Matrix

Tool: Jester • Java Enviroment for Statistical TransfERs: Support and assistance for creating statistical transfer relations from a parallel corpus

Query Transformation

Binding of Query Languages • Plugable QueryParsers and QueryPrinters for different query languages make exploitation in other contexts easy.

CARMEN Transfer Architecture • Retrieval server (HyRex) identifies transferable parts of a query and sends them to the transfer service • Exchange of partial queries using XML/XIRQL • Transfer service runs as TomCat servlet server

Evaluation of Transfer Modules • Retrieval tests using transfer modules (using a corpus with Internet documents indexed with Fulcrum SearchServer) • Limitation: no use of weight information of transfer relations • Tested transfer: SOLIS/IZ-Thesaurus  SoWi Internet documents/free-terms • Comparison: search using IZ-Thesaurus terms vs. search using free-terms from transfer • 2 exemplary searches per 3 domains (women studies, migration, sociology of industry)

Exemplary Search: “Dominanz“ • „Dominanz“ (“dominance“): 16 relevant documents • 10 transfer terms (Dominanz, Messen, Mongolei, Nichtregierungsorganisation, Flugzeug, Datenaustausch, Kommunikationsraum, Kommunikationstechnologie, Medienpädagogik, Wüste):14 additive documents, thereof 7 relevant (50%, increase 44%) • Precision: 77%

Exemplary Search: „Leiharbeit“ • „Leiharbeit“ (“temporary work“): 10 relevant documents • 4 transfer terms (Leiharbeit, Arbeitsphysiologie, Organisationsmodell, Risikoabschätzung):10 additive documents, thereof 2 relevant (20%, increase 20%) • Precision: 60%

Results • All exemplary searches using transfers leads to additive relevant documents compared with a search without transfer • Quota of relevant documents from all new documents between 13% and 55% • Transfer terms not always evident (Example „Wüste“ (“desert”)) • Partly very many transfer terms (user parametrizing or better algorithms needed)

Outlook (What needs to be done?) • Improvement of dubble corpora: • Kind of documents • Diversity of document types • Diversity of institutions / web sites • Domain • Corpus size • Comparison of transfers using statistical relations intellectual relations • Improvement of algorithms • Effect of interactive, repetitive retrieval and user parametrizing / adjustment • User tests

Exploitation • Services (transfer) • Software (Java classes) • Projects: • Virtuelle Fachbibliothek Sozialwissenschaften (ViBSoz) • European Schools Treasury Browser (ETB) • Informationsverbund Bildung – Sozialwissenschaften – Psychologie (InfoConnex) • Contact: soe@bonn.iz-soz.de

Treatment of Semantic Heterogeneity ...