1 / 0

Searching law in multiple Asian languages

Searching law in multiple Asian languages. Philip Chung Executive Director, AustLII Senior Lecturer in Law, UNSW. Languages spoken. I nternet users by language. Issues in comparative law searches across Asia.

yaholo
Download Presentation

Searching law in multiple Asian languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Searching law in multiple Asian languages

    Philip Chung Executive Director, AustLII Senior Lecturer in Law, UNSW
  2. Languages spoken
  3. Internet users by language
  4. Issues in comparative law searches across Asia 17/28 Asian countries use languages with double-byte encodings for their legal materials Other 11 use only English or Bahasa Indonesian or Portuguese Further complicated by multiple encodings of some languages
  5. Character sets – brief history ASCII Unaccented English letters Represented every character using a number between 32 and 127 All can be stored in 7 bits with 1 bit to spare
  6. ANSI and DBCS ‘Free-for-all’ OEM character sets (using codes 128-255) flourished esp European ANSI standard Everyone agree on what is represented below 128 (basically ASCII) Extensions using code pages Double byte character sets (DBCS) Chinese, Japan and Korean
  7. Character sets and encoding schemes Most Asian languages not easily represented by the standard ASCII codes (128 code points) Variants developed: VSCII (variant of ASCII for Vietnamese – removing some not commonly used ASCII codes) but still 7 bit Chinese: Big-5 and GB 1230 (8 bit encoding)
  8. Development of Unicode Unicode A single character set that include every ‘reasonable’ writing system (incl Klingon) Each ‘letter’ is assigned a number – code point eg ‘A’ -> U+0041 Encodings UTF-16 (2 bytes – traditional but BE or LE) UTF-32 (4 bytes – same number of bytes) UTF-8 (popular for web documents)
  9. AustLII’s SINO search engine Open source, free text search engine Speed, flexibility, portability and reliability ‘Size is no object’ – trade-off between disk space and speed of retrieval Developed initially for English and has been extended to other western languages Used by many LIIs from around the world
  10. Development of SINO for Asian languages Desirability to provide a consistent search facility in both English and an Asian language(s) AsianLII as a comparative law research system English (incl translations) Enhanced by documents in a country’s language Uniform search interface (operators, relevance ranking)
  11. SINO’s u16a representation Any non-ASCII UTF-8 character (eg Chinese, Korean) can be converted into an alpha-numeric (flat) representation Hexadecimal form – 0 to 9 and A to F Resulting form may be confused with numeric words in western languages ‘春’ is ‘6625’ in hexadecimal form
  12. SINO’s u16a representation (2) The characters ‘u16a’ are added to any such representation to create a unique string ‘u16a’ is rare to non-existent in natural language These u16a ‘shadow files’ are then used for SINO to search (as a proxy for the original) text in the original language is presented to the user
  13. U16a representation – example 1 Chinese: 鑑於第539/2001號歐盟委員會理事會條例(Council Regulation(EC)No 539/2001),已藉通知被實施於冰島的國家法律中 U16a representation: 9451U16A 65BCU16A 7B2CU16A 539/2001 865FU16A 6B50U16A 76DFU16A 59D4U16A 54E1U16A 6703U16A 7406U16A 4E8BU16A 6703U16A 689DU16A 4F8BU16A (Council Regulation ( EC ) No 539/2001 )FF0CU16A 5DF2U16A 85C9U16A 901AU16A 77E5U16A 88ABU16A 5BE6U16A 65BDU16A 65BCU16A 51B0U16A 5CF6U16A 7684U16A 570BU16A 5BB6U16A 6CD5U16A 5F8BU16A 4E2DU16A
  14. U16a representation – example 2 Thai: กรมตรวจบัญชีสหกรณ์ได้มีหนังสือที่กษ ๐๔๐๑/๑๑๐๐๔ U16a representation: 0E01U16A 0E23U16A 0E21U16A 0E15U16A 0E23U16A 0E27U16A 0E08U16A 0E1AU16A 0E31U16A 0E0DU16A 0E0AU16A 0E35U16A 0E2AU16A 0E2BU16A 0E01U16A 0E23U16A 0E13U16A 0E4CU16A 0E44U16A 0E14U16A 0E49U16A 0E21U16A 0E35U16A 0E2BU16A 0E19U16A 0E31U16A 0E07U16A 0E2AU16A 0E37U16A 0E2DU16A 0E17U16A 0E35U16A 0E48U16A 0E01U16A 0E29U16A 0E50U16A 0E54U16A 0E50U16A 0E51U16A / 0E51U16A 0E51U16A 0E50U16A 0E50U16A 0E54U16A
  15. How universal is the u16a representation? In theory, of universal application How well it works differs between languages - particularly in search speed Storage overhead: doubles the storage needed for double-byte languages Chinese: reasonable searching because there are 5,000 or so (commonly used) unique characters encoded Big-5 supports 13,053 characters GB 2312 supports 6,763 characters (cf GB 18030)
  16. U16a representation Thai: more difficult because only 44 alphabetic characters, and no word delimiters However, searching scales up over 120K docs. Vietnamese: Complexity due to 2 methods of encoding (VSCII and Unicode) Requires more than provided under ASCII Combination of Latin characters and extended character set with diacritic
  17. Experience of HKLII Hong Kong Legal Information Institute Bilingual system (English and Traditional Chinese) All legislation in both Chinese and English 25% decisions are now written in Chinese Used two separate search engines: one for English (SINO), one for Chinese (mnoGoSearch) Fung et al (2011)
  18. HKLII as a testbed Indexing speed
  19. HKLII as a testbed (2) Search speed Top 500 Chinese search phrases Random sampling when using search connectors
  20. Searching AsianLII in multiple Asian languages English & Thai bankrupt* or insolven* or การล้มละลาย English, Thai, Bahasa Indonesian, Chinese & Vietnamese bankrupt* or insolven* or การล้มละลาย or kepailitan or pailit or 破產 or 破产 or Phásản Thai, Bahasa Indonesian, Chinese, Vietnamese & Korean การล้มละลาย or kepailitan or pailit or 破產 or 破产 or Phásảnor 파산 Search operators title(鍾 or chung or zhong)
  21. Future work Word segmentation issue Identifying ‘word’ boundaries without explicit space delimiters Cross-lingual searching Retrieval of documents in a language other than the language of the query Query translation or document translation Use of bilingual dictionaries and a ‘link language’ eg English
  22. Word segmentation problem Example (Pun, Chong & Chan 2003): 我們要發展中國家用電器 One way to segment this sentence: Another way:
  23. Thank you Comments / Questions? philip@austlii.edu.au
More Related