230 likes | 413 Views
Multilinguality and cross-language searching. Multilingual aspects in Indexing, Searching and Metadata (Resource Description). Multilingual aspects in Indexing, Searching and Metadata. IETF Model of Multilingual support in Internet Applications Electronic Mail Interactive applications
E N D
Multilingualityand cross-language searching Multilingual aspects in Indexing, Searching and Metadata (Resource Description) Multilinguality in Indexing, Searching and Metadata
Multilingual aspects in Indexing, Searching and Metadata • IETF Model of Multilingual support in Internet Applications • Electronic Mail • Interactive applications • Charset and Language tagging • MIME types • XML Language and Charset tagging • DC language definition • Metadata and RDF • DC.Language • Existing solutions • TUSTEP • Search Engines and Subject Gateways • Multilingual framework for the REIS Project Multilinguality in Indexing, Searching and Metadata
IETF Model of Multilingual support in Internet Applications • Electronic Mail • Language • Character Encoding Scheme • Transfer Encoding Scheme • Interactive applications • WWW: HTTP/HTML • http-equiv="Content-Type" Content="text/html; charset=euc-jp" • <META http-equiv="Content-Type" Content="text/html; charset=euc-jp"> • XML/DOM • LDAP and X.500 (?) Multilinguality in Indexing, Searching and Metadata
XML:Language and Charset tagging • Character is atomic unit of text • All ISO 10646 characters + TAB, CR, LF • The mechanism for Encoding can vary for different characters • All XML processors must accept UTF-8 and UTF-16 • Character Encoding in Entities (XML 4.3.3) • EncodingDecl : : = S ‘encoding’ Eq ‘ ” ’ EncName ‘ “ ‘ | “ ‘ “ EncName “ ‘ “ )<?xml encoding+’UTF-8’?> <?xml encoding+’EUC-JP’?> • Autodetection of Character Encoding • Language identification (XML 2.12) • Tag for identification of languages • LanguageID : : = Langcode (‘-’ Subcode) • Langcode : : = ISO639Code | IanaCode | UserCode Multilinguality in Indexing, Searching and Metadata
Charset and Language tagging • MIME types • text, img, audio, video • Charset = Character Set + Character Encoding Scheme • Transfer Encoding Scheme • base64 • quoted-printable • Language • RFC 1766 • ISO639-2 Multilinguality in Indexing, Searching and Metadata
Language Definition in DC Metadata set • <meta name = “DC.language” • scheme= “rfc1766” “ISO639-2” • content= “es”> • <meta name = “DC.title” • lang = “es” • content= “La Mesa y Silla Roja”> Multilinguality in Indexing, Searching and Metadata
Multilingual Subject Gateway • Developing multilingual subject gateways (SOSIG as example) • SOSIG accept any languages evaluated for quality • Translation should be coherent and checked • Different language version should be equally well maintained • SOSIG Cataloguing rules • TITLE will be displayed in the first language • ALTERNATIVE TITLE in other languages • DESCRIPTION will mention different languages in which resource is available • URI of all language versions • Labeling URI language • Library standards for multilingual provision • NISO Z39.53 Language codes • USMARC Language codes Multilinguality in Indexing, Searching and Metadata
Multilingual provision in popular Internet Search Engines • AltaVista • Search in 25 languages • Documents indexed as is • Automatic translation - very simple and naive • Other sites that have dedicated national sites • interface language • language resoures • no special language policy • Euroseek • Excite • Lycos • Infoseek Multilinguality in Indexing, Searching and Metadata
New Developments in Subject Gateways, Indexing, Searching • NRENs projects • Subject gateways • Commercial Search Engines • Multilingual Text Retrieval and Processing • TUSTEP system Multilinguality in Indexing, Searching and Metadata
NREN projects • Social Science Information Gateway http://sosig.esrc.bris.ac.uk/ • ROADS Project Software/Documentation Server - http://www.roads.lut.ac.uk/ • CHIP-Pilot (Clearing House for Internet Projects) - http://www.terena.nl/chip/ • IMesh - International Collaboration on Internet Subject Gateways - http://www.desire.org/html/subjectgateways/community/imesh/ • DFN Indexing and Searching projects - http://www.dfn.de/links/suchen.html • X.500 Directory E-mail Addresses Search (AMBIX-D) - http://ambix.uni-tuebingen.de:8889 • TUSTEP Munltilingual Textdata Processing and Fuzzy Searching - http://www.uni-tuebingen.de/zdv/tustep/tdv_eng.html • IKEM Toolkit - http://bikit.rug.ac.be:80/ikem/ • DRUID Classification Tools, University of Twente - http://twentyone.tpd.tno.nl/druid/ Multilinguality in Indexing, Searching and Metadata
Search Engines news • CLEVER project at IBM Almaden Research Center - http://www.almaden.ibm.com/cs/k53/clever.html • Cora Search Engine - http://www.cora.justresearch.com/about.html • Google Search Engine - http://www.google.com/why_use.html • Free AltaVista Search Intranet v2.3A Entry Level Software http://www.altavista.software.digital.com/search/intranet/free_3k/index.asp • Ultraseek Server for Linux Platformshttp://software.infoseek.com/products/ultraseek/linux/ultrareq.htm Multilinguality in Indexing, Searching and Metadata
TUSTEP TUebingen System of Text Processing Programs • 1. File structure • 2. Multilingual capabilities • 3. Internal data presentation • 4. Database publishing/output data presentation • 5. CGI • 6. Sample implementation • http://lddv.zdv.uni-tuebingen.de/cgi-bin/opac/zdvlit • Try entries like Smith or Meier or... • http://lddv.zdv.uni-tuebingen.de/cgi-bin/km/npquery Multilinguality in Indexing, Searching and Metadata
TUSTEP: File structure • TUSTEP can handle basically all kinds of (explicitely or implicitely) structured text files) • Special support for XML • "Databases" (i. e. files with a repeated and regular structure) are only a special case of this. • Fuzzy search and other retrieval actions can then be used to access the data Multilinguality in Indexing, Searching and Metadata
TUSTEP: Multilingual capabilities • TUSTEP supports the following scripts: • - Latin • - Cyrillic • - Greek (classical and modern) • - Hebrew (with support for Yiddish) • - Arabic • - Estrangelo • - Coptic • - Old Church Slavonic • More: • Phonetics, Egyptian hieroglyphs • allows use of combining diacritics • Experimental: Indic scripts and Armenian Multilinguality in Indexing, Searching and Metadata
TUSTEP: Internal data presentation and transformation • TUSTEP uses internally a script tagging system with transliteration into ASCII which allows all data to be encoded in a human-readable and easily transmittable form • TUSTEP has a module for importing from and exporting into the UCS (UTF8 and UTF16) • Example: #r+Novij rafiqnij clovnik ykra^ins^bko%:^i movi#r- • Transformation module allows use of other tagging systems and other transliteration schemes Multilinguality in Indexing, Searching and Metadata
TUSTEP: Database publishing • TUSTEP's typesetting module • offers a high-quality, fast and easy way of publishing all or part of the database in paper (or pdf) form Multilinguality in Indexing, Searching and Metadata
TUSTEP: CGI • Complete control over input and output forms • Possibility to configure exactly the kind of search(es), e.g. • exact matches only • SoundEX • "intelligent" fuzzy search • "brute" fuzzy search that allows a number of different letters. Multilinguality in Indexing, Searching and Metadata
Multilinguality framework of the project • Multiple language indexing • multiple language documents/indexes • Cross-language Searching • Multiple language indexes/documents • Automatic Query forwarding based on thesauri • Automatic translation • Multilingual information retrieval • Translation Request Protocol • Language and Character Encoding tagging • XML as internal presentation of data • Using XML language and charset tagging • Metadata • DC.Language definition Multilinguality in Indexing, Searching and Metadata