220 likes | 380 Views
Linking legal thesaurii to enable semi-automated multilingual searching . Philip Chung, Graham Greenleaf & Andrew Mowbray Co-Directors, AustLII Law via the Internet Conference Jersey, Channel Islands September 2013. Outline. Cross-lingual searching: Issues Document vs query translation
E N D
Linking legal thesaurii to enable semi-automated multilingual searching Philip Chung, Graham Greenleaf & Andrew Mowbray Co-Directors, AustLII Law via the Internet ConferenceJersey, Channel IslandsSeptember 2013
Outline Cross-lingual searching: Issues Document vsquery translation Proposed approach to query translation Searching in multiple languages (multi-lingual) Extending SINO using the u16a representation Using SINO’s synonym function Discussion and future work
Cross-lingual searching • Cross-lingual searching = Retrieval of documents in a language other than the language of the query • Main motivations: • Allow monolingual searchers to be aware of the existence of relevant documents in other languages • Assist users who are more familiar with one language to find documents in other languages • Avoid the need to enter search queries in different languages • Key issue: translation • Document translation • Query translation
Document translation • Translate documents into the language of the query • eg translating all documents into English • Searching can then be done using the language of the query directly • Users may also be able to read and use the translated file (assuming a good translation) • However, very resource intensive – impossible to translate documents into all languages • Unless documents are already translated, this approach is not feasible/practical for free-access LIIs
Query translation Translate the search query into the languages of the documents contained in the system Less text/words to be translated More flexible (may be dynamically generated) However, may not be able to handle complex queries Document retrieved may then need to be translated into a language that the user understands This approach is more feasible from a free-access LII’s perspective
A possible approach to query translation • Creating new bilingual mappings of legal terms is too expensive • Use of existing bilingual dictionaries/glossaries is more practical, where they exist • For cross-lingual searches across multiple languages, use one language as a ‘link language’ • egEnglish to construct mapping tables • Each term in the query is then expanded based on the equivalent entries in the mapping table • Search is then conducted over the corpus based on the expanded term(s)
Legal Dictionaries relevant to East Asia(in likely order of availability) • Hong Kong: Chinese (HK) <-> English • Official translation dictionary of Hong Kong government available • Eurovoc - 22 European languages <-> English • Available for use • Indonesia: Bahasa Indonesia <-> English • Dictionary of basic legal terms developed by AustLII • Japan: Japanese <-> English • Japanese Law Translation dictionary (Nagoya project) is available for 3rd party use – various download options available • South Korea: Korean <-> English • MOLEG and/or KLRI has developed dictionary • Taiwan: Chinese (Tw) <-> English • Prof Amy Shee’s group may be developing a dictionary • Vietnam: Vietnamese <-> English • Law Science Institute (Hanoi) has developed, but availability is uncertain
Example 1: Indonesian <-> English Bahasa Indonesia English genosida kompensasi kejahatanterhadapkemanusiaan genocide compensation crime against humanity
Example 2: Chinese <-> English Chinese (HK) English 危害種族 補償 反人道罪 genocide compensation crimes against humanity
EuroVoc • Mapping table may be extended using EuroVoc • The Council of Europe’s official multilingual thesaurus • It contains a sub-section for legal terms • EuroVoc – Contains terms in 23 EU languages • Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish and Swedish • plus Serbian • Link languages • English can then be the link language between European and some Asian languages in many Asian jurisdictions • Portuguese can also be a link language to some Asian jurisdictions and other Asian languages
EuroVoc example bgпрестъпление против човечеството escrimen contra la humanidad cstrestný čin proti lidskosti da forbrydelse mod menneskeheden de Verbrechen gegen die Menschlichkeit et inimsusvastane kuritegu el έγκλημα κατά της ανθρωπότητας en crime against humanity frcrime contre l'humanité it crimine contro l'umanità lv noziegums pret cilvēci ltnusikaltimas žmogiškumui huemberiség elleni bűncselekmény mt crime against humanity (under translation) nlmisdaad tegen de menselijkheid plzbrodnia przeciwko ludzkości ptcrime contra a humanidade rocrime împotriva umanității skzločin proti ľudskosti slzločin proti človečnosti fi rikos ihmisyyttä vastaan svbrott mot mänskligheten hrzločin protiv čovječnosti srзлочин против човечности
Simultaneous searching in multiple languages • AustLII’s SINO search engine • Open source, free-text search engine • Speed, flexibility, portability and reliability • build performance: 20GB per hour on commodity hardware • search performance: Single word searches return in under 0.050 seconds • ‘Size is no object’ – trade-off between disk space and speed of retrieval • concordance ratio: 55% approx – relatively large but concordance is easy to read and minimises unnecessary file input/output • Used by many LIIs from around the world: BAILII, PacLII, SAFLII, LIIofIndia, HKLII, NZLII, LiberLII, CyLaw, SamLII
Simultaneous searching in multiple languages (2) SINO was developed initially for English and has been extended to other western languages extending SINO to handle UTF-8 encoding for multilingual searching
SINO’s u16a representation • SINO’s u16a representation • Any non-ASCII UTF-8 character (eg Chinese, Korean, Thai) can be converted into an alpha-numeric (flat) representation • Hexadecimal form – 0 to 9 and A to F • Resulting form may be confused with numeric words in western languages • ‘春’ is ‘6625’ in hexadecimal form
SINO’s u16a representation (2) • The characters ‘u16a’ are added to any such representation to create a unique string • ‘u16a’ is rare to non-existent in natural language • These u16a ‘shadow files’ are then used for SINO to search (as a proxy for the original) • text in the original language is presented to the user • Example: bankrupt* or insolven* or การล้มละลาย or kepailitan or pailit or 破產 or 破产 or Phásản
SINO and synonyms • Possible implementationofquery translation • Synonyms can be defined via the .sino_synonymsfile • Consists of zero or more lines each with a comma separated list of words and/or phrases. • For example: • unsw, “university of new south wales” • small, tiny, little • Use of a .sino_synonyms file as a starting point for automating cross-lingual searching
Discussion and future work • What are the criteria for success of cross-lingual searching? • What extent of false positives are allowable? • What testing would be most useful? • Extracting and mapping legal terms from multiple dictionaries • Developing an interface to manage the addition of new legal terms
Discussion and future work (2) • What if a search contains non-legal terms? • Could automated translations supplement dictionaries? • Addition of general (non-legal) terms to dictionaries • coverage vs performance • Possible performance improvement: Expand legal terms at concordance time • Rather than simply indexing on the words of the original text • Include in the concordance the expanded list of legal terms in multiple languages