240 likes | 418 Views
Multilingual Issues in Information Retrieval and Resource Description Overview. Yuri Demchenko, TERENA demchenko@terena.nl. In this presentation. Multilingual Issues in TERENA Technical Programme Multilinguality: trends and developments Technical Issues/Background
E N D
Multilingual IssuesinInformation Retrieval and Resource DescriptionOverview Yuri Demchenko, TERENA demchenko@terena.nl Multilingual Issues in Information Retrieval and Resource Description
In this presentation • Multilingual Issues in TERENA Technical Programme • Multilinguality: trends and developments • Technical Issues/Background • Data presentation and resource description format • Standards Overview • Metadata and Cataloging • Recent Development in Subject Gateways and SE • Cross-language Information Retrieval • REIS/TAP Initiatives Multilinguality Framework Multilingual Issues in Information Retrieval and Resource Description
TERENA Multilingual Community and TERENA Technical Programme • TERENA has 43 members from 34 countries speaking 30 languages • Multilingual issues always were in the scope of TERENA Technical Program • WG-i18n - WG on Internationalisation issues • C3 Project on messaging transliteration tools • MAITS - initiated by WG-i18n • Multilingual E-Mail Agent Testing • Multilingual issues in Subject Gateways, Section 2.13 in SG Handbook • Multilingual Support in Internet/IT Applications. Information page - http://www.terena.nl/projects/multiling/ • Liaison with STD bodies • CEN/TC304 Character set technology - http://www.stri.is/TC304/default.html • IETF Multilingual Issues in Information Retrieval and Resource Description
Multilinguality: trends and developments • Storing, processing, presentation and exchange of information in many languages • Interactive (protocol based/negotiated ) applications and non-interactive (resource description and information presentation) • Multilingual Search and Retrieval • Multilingual Subject Gateways and Search Engines • CLIR testing at TREC • Data Resource Model and Multilinguality • One or Multiple languages • Data format • Metadata (not part of Data but part of Resource) • References, links • Professional Thesauri (Resource Context) - base for multiple languages and language unification Multilingual Issues in Information Retrieval and Resource Description
Internet Applications • None-interactive Application: Electronic Mail • Correct Message Composition and Rendering • Interactive applications • WWW: HTTP/HTML • http-equiv="Content-Type" Content="text/html; charset=euc-jp" • <META http-equiv="Content-Type" Content="text/html; charset=euc-jp"> • Content Negotiation Protocol • Media features, attributes • Direct and hop-by-hop communication • Operational Applications • (Internationalised) DNS • LDAP and X.500 (Language Support ?) Multilingual Issues in Information Retrieval and Resource Description
I18n and ML issues at IETF and other STD bodies • IETF Architectural Model of Multilingual support in Internet Applications - RFC 2130 • Language and Charset/Encoding tagging • Content negotiation framework (IETF/W3C) • Point-to-point vs hop-by-hop • Message based vs Interactive vs Streaming • Internationalised DNS (IDN) - Internationalised Domain Names • vs E-Mail (SMTP, IMAP) • vs Routing (Routing Policy Specification Language (RPSL)) • vs Network Management (SNMP textual presentation) • vs Network Security (TLS and IPSec) • Content Encoding normalisation (IETF/Unicode) • LSD-2 - Large Scale Services Deployment • IMAP language extension Multilingual Issues in Information Retrieval and Resource Description
Resolution Service / Directory(content MD) Presentation Culture Locale Presentation Culture Locale Language Language Resource Content Transfer Agent Content Transfer Agent Communication Protocol Communication/Network IETF Architectural Model of Multilingual support in Internet Applications • User Interface • Presentation • Culture • Locale • Language • On-the-wire • Coded Character Set - Repertoire of ISO-10646 • Character Encoding Scheme - UTF-8 (ml-text), US-ASCII (e-mail), ISO8859-1 • Transfer Encoding Scheme (Base64, QP) Multilingual Issues in Information Retrieval and Resource Description
Content Negotiation Framework (IETF/W3C) • Content Negotiation covers three elements • Expressing the capabilities of the sender and the data resource to be transmitted • Expressing the capabilities of a receiver • A protocol by which capabilities are exchanged • Abstract framework for content negotiation • (Content) (Transmit.data) (Data document) • [Author]----->-----[Sender]----->-----[Receiver]----->-----[User] • Transparent Content Negotiation in HTTP - RFC 2295 • Protocol-independent Content Negotiation Framework - RFC 2703 • Non-message resource transfer • End-to-end vs hop-by-hop negotiation • Use of directory and resolution services • CC/PP exchange protocol based on HTTP Extension Framework (W3C) • Composite Capability/Preference Profile: A user side framework for content negotiation Multilingual Issues in Information Retrieval and Resource Description
Charset and Language tagging • MIME types (RFC 2045-2049) • text, img, audio, video • Charset = Character Set + Character Encoding Scheme • Transfer Encoding Scheme • base64 • quoted-printable • Other media attributes and features (e.g., resolution, color, language, etc.) • Language • RFC 1766 • ISO639-2 Multilingual Issues in Information Retrieval and Resource Description
WWW: HTTP/HTML • HTTP header includes information about the type of the transferred information and the character encoding for text-based information: • http-equiv="Content-Type" Content="text/html; charset=euc-jp" • The Content-Language entity header field describes the natural language(s) of the intended audience for the enclosed document: • http-equiv="Content-Type" Content-Language=se • Character encoding information in the META information of the HTML document: • <META http-equiv="Content-Type" Content="text/html; charset=euc-jp"> Multilingual Issues in Information Retrieval and Resource Description
XML: Character Set tagging • Character is atomic unit of text • All ISO 10646 characters + TAB, CR, LF • The mechanism for Encoding can vary for different characters • All XML processors must accept UTF-8 and UTF-16 • Character Encoding declaration in XML documents or entities (section 4.3.3) • EncodingDecl : : = S ‘encoding’ Eq ‘ ” ’ EncName ‘ “ ‘ | “ ‘ “ EncName “ ‘ “ )<? xml encoding+’UTF-8’?><? xml encoding+’EUC-JP’?> • Default Character Set Encoding - UTF-8 and UTF-16 • Autodetection of Character Encoding Multilingual Issues in Information Retrieval and Resource Description
XML: Language tagging • Language identification (section 2.12) • Labelling language of the whole document, entity or item • Tag for identification of languages • LanguageID : : = Langcode (‘-’ Subcode) • Langcode : : = ISO639Code | IanaCode | UserCode • Examples: <p xml:lang="en">The quick brown fox jumps over the lazy dog.</p> • <p xml:lang="en-GB">What colour is it?</p> • <p xml:lang="en-US">What color is it?</p> • <sp who="Faust" desc='leise' xml:lang="de"> • <l>Habe nun, ach! Philosophie,</l> • <l>Juristerei, und Medizin</l> • <l>und leider auch Theologie</l> • <l>durchaus studiert mit heißem Bemüh'n.</l> • </sp> Multilingual Issues in Information Retrieval and Resource Description
Unicode Technical Reports • The Unicode Standard, Version 3.0 - Just published! - http://www.unicode.org/unicode/uni2book/u2.html • Unicode 2.0 test page http://www.terena.nl/projects/multiling/euroml/tests/test-ucspages1ucs.html • Multilingual European Subsets of ISO/IEC 10646-1http://www.stri.is/TC304/p10_1998_05_30.pdf • Unicode technical Reports • UTR #15: Unicode Normalization Forms, Version 18.0 I-D by Martin Duerst • UTR #17: Character Encoding Model • UTR #16: UTF-EBCDIC • UTR #10: Unicode Collation Algorithm • UTR #7: Plane 14 Characters for Language Tags Multilingual Issues in Information Retrieval and Resource Description
Language Definition in DC Metadata set - DC.Language Format <meta name = "DC.Language" content = "en"> <meta name = "DC.Language" scheme = "rfc1766" content = "en"> <meta name = "DC.Language" scheme = "ISO639-2” content = "eng"> <meta name = "DC.Language” scheme = "rfc1766” content = "en-US"> <meta name = "DC.Language” content = "zh"> <meta name = "DC.Language" content = "ja"> <meta name = "DC.Language” content = "es"> <meta name = "DC.Language” content = "german"> <meta name = "DC.Language” lang = "fr” content = "allemand"> Multilingual Issues in Information Retrieval and Resource Description
Language Definition in DC Metadata set - Field content language labelling/attributing • A work in Spanish may be assigned the following metadata: • <meta name = "DC.Language” scheme = "rfc1766” content = "es"> • <meta name = "DC.Title" • lang = "es" • content = "La Mesa Verde y la Silla Roja"> • <meta name = "DC.Title" • lang = "en" • content = "The Green Table and the Red Chair"> Multilingual Issues in Information Retrieval and Resource Description
DC in Multiple Languages • The reference language of Int’l DC community is English, however the semantics od DC elements are in principle expressed equally well in any modern language • The versions of DC elements in various languages should share a single name space using tokens that look like English words but stand for universal elements - http://purl.org/dc/elements/1.1/ • DC in Multiple Languages Registry project - http://purl.org/dc/groups/languages.htm • Uses RDF schemas to share machine-readable tokens for translation of DC terms in multiple languages (26 languages to date) • Linkage to and from central DC namespace server • Registry as Dictionary/Thesauri - use Interlinguas to link different translations • Formal recognition and standardization procedure Multilingual Issues in Information Retrieval and Resource Description
Document Description with Unqualified DC and RDF syntax • <?xml:namespace ns="http://purl.org/metadata/dublin_core_elements" prefix="DC"?> • <RDF:RDF> • <RDF:DESCRIPTION RDF:HREF="http://www.biblio.de/buecher/kleist.html"> • <DC:Title XML:lang="de">Das Erdbeben in Chili</DC:Title> • <DC:Creator>Heinrich von Kleist</DC:Creator> • </RDF:Description> • </RDF:RDF> • XML Encoding (Character set) declaration • UTF-8/UTF-16 as default encoding Multilingual Issues in Information Retrieval and Resource Description
Recent Developments in Subject Gateways, Indexing, Searching • NRENs projects • Subject gateways • Commercial Search Engines • Multilingual Text Retrieval and Processing • TUSTEP system - using “fuzzy” multilingual seaching • Cross-Language Information Retrieval (CLIR) testing at TREC-7/TREC-8 Conferences by NIST Multilingual Issues in Information Retrieval and Resource Description
Multilingual Subject Gateway (DESIRE) • Developing multilingual subject gateways (SOSIG as example) • SOSIG accept any languages evaluated for quality • Translation should be coherent and checked • Different language version should be equally well maintained • SOSIG Cataloguing rules • TITLE will be displayed in the first language • ALTERNATIVE TITLE in other languages • DESCRIPTION will mention different languages in which resource is available • URI of all language versions • Labeling URI language • Library standards for multilingual provision • NISO Z39.53 Language codes • USMARC Language codes Multilingual Issues in Information Retrieval and Resource Description
Multilingual provision in popular Internet Search Engines • Multilingual SE • AltaVista - http://www.altavista.com/ - 28 languages • Documents indexed as is • Automatic translation - very simple and naive • Euroseek - http://www.euroseek.com/ - 30 languages • FAST Advanced Search - http://www.alltheweb.com - 31 languages • Google - http://www.google.com/ - 11 languages • Other sites that have dedicated national sites • interface language • language resources • no special language policy • Excite - 11 countries • Lycos - 23 countries Multilingual Issues in Information Retrieval and Resource Description
TUSTEP TUebingen System of Text Processing Programs • 1. File structure • 2. Multilingual capabilities • 3. Internal data presentation • 4. Database publishing/output data presentation • 5. CGI • 6. Sample implementation • http://lddv.zdv.uni-tuebingen.de/cgi-bin/opac/zdvlit • Try entries like Smith or Meier or... • http://lddv.zdv.uni-tuebingen.de/cgi-bin/km/npquery Multilingual Issues in Information Retrieval and Resource Description
Cross-Language Information Retrieval (CLIR) testing at TREC-7/TREC-8 • TREC - Text REtrieval Conference - http://trec.nist.gov/ • Cross-Language Information Retrieval (CLIR) technologies • Using Intermediary or Interlingual representation • Latent Semantic Indexing • Generalised Vector Space Model, etc. • Computer translation • Machine-readable bilingual dictionaries • MultilingualThesauri • Participants: ETH/Eurospider, IBM, Xerox, Cornell, New Mexico Univ, TNO, others Multilingual Issues in Information Retrieval and Resource Description
REIS Project/Initiative Multilinguality framework - First attempt • Multiple language indexing • multiple language documents/indexes • Cross-language Searching • Automatic Query forwarding based on thesauri or ML dictionary • Using “fuzzy” multilingual searching/matching • Multilingual information retrieval • Automatic translation (if requested) • Translation Request Protocol • Internal Data/Indexes presentation • Language and Character Encoding tagging • XML as internal presentation of data and XML language and charset tagging • Text/Charset normalisation (Unicode or TUSTEP-like) • Metadata and Resource Description • DC.Language definition and XML/RDF/DC Language tagging Multilingual Issues in Information Retrieval and Resource Description
Multilinguality Framework for Multilingual Indexing/Search Services To be developed yet Multilingual Issues in Information Retrieval and Resource Description