Searching law in multiple Asian languages. Philip Chung Executive Director, AustLII Senior Lecturer in Law, UNSW. Languages spoken. I nternet users by language. Issues in comparative law searches across Asia.
An Image/Link below is provided (as is) to download presentationDownload Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.Content is provided to you AS IS for your information and personal use only. Download presentation by click this link.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.During download, if you can't get a presentation, the file might be deleted by the publisher.
E N D
Presentation Transcript
Searching law in multiple Asian languages
Philip Chung Executive Director, AustLII Senior Lecturer in Law, UNSW
Languages spoken
Internet users by language
Issues in comparative law searches across Asia 17/28 Asian countries use languages with double-byte encodings for their legal materials Other 11 use only English or Bahasa Indonesian or Portuguese Further complicated by multiple encodings of some languages
Character sets – brief history ASCII Unaccented English letters Represented every character using a number between 32 and 127 All can be stored in 7 bits with 1 bit to spare
ANSI and DBCS ‘Free-for-all’ OEM character sets (using codes 128-255) flourished esp European ANSI standard Everyone agree on what is represented below 128 (basically ASCII) Extensions using code pages Double byte character sets (DBCS) Chinese, Japan and Korean
Character sets and encoding schemes Most Asian languages not easily represented by the standard ASCII codes (128 code points) Variants developed: VSCII (variant of ASCII for Vietnamese – removing some not commonly used ASCII codes) but still 7 bit Chinese: Big-5 and GB 1230 (8 bit encoding)
Development of Unicode Unicode A single character set that include every ‘reasonable’ writing system (incl Klingon) Each ‘letter’ is assigned a number – code point eg ‘A’ -> U+0041 Encodings UTF-16 (2 bytes – traditional but BE or LE) UTF-32 (4 bytes – same number of bytes) UTF-8 (popular for web documents)
AustLII’s SINO search engine Open source, free text search engine Speed, flexibility, portability and reliability ‘Size is no object’ – trade-off between disk space and speed of retrieval Developed initially for English and has been extended to other western languages Used by many LIIs from around the world
Development of SINO for Asian languages Desirability to provide a consistent search facility in both English and an Asian language(s) AsianLII as a comparative law research system English (incl translations) Enhanced by documents in a country’s language Uniform search interface (operators, relevance ranking)
SINO’s u16a representation Any non-ASCII UTF-8 character (eg Chinese, Korean) can be converted into an alpha-numeric (flat) representation Hexadecimal form – 0 to 9 and A to F Resulting form may be confused with numeric words in western languages ‘春’ is ‘6625’ in hexadecimal form
SINO’s u16a representation (2) The characters ‘u16a’ are added to any such representation to create a unique string ‘u16a’ is rare to non-existent in natural language These u16a ‘shadow files’ are then used for SINO to search (as a proxy for the original) text in the original language is presented to the user
How universal is the u16a representation? In theory, of universal application How well it works differs between languages - particularly in search speed Storage overhead: doubles the storage needed for double-byte languages Chinese: reasonable searching because there are 5,000 or so (commonly used) unique characters encoded Big-5 supports 13,053 characters GB 2312 supports 6,763 characters (cf GB 18030)
U16a representation Thai: more difficult because only 44 alphabetic characters, and no word delimiters However, searching scales up over 120K docs. Vietnamese: Complexity due to 2 methods of encoding (VSCII and Unicode) Requires more than provided under ASCII Combination of Latin characters and extended character set with diacritic
Experience of HKLII Hong Kong Legal Information Institute Bilingual system (English and Traditional Chinese) All legislation in both Chinese and English 25% decisions are now written in Chinese Used two separate search engines: one for English (SINO), one for Chinese (mnoGoSearch) Fung et al (2011)
HKLII as a testbed Indexing speed
HKLII as a testbed (2) Search speed Top 500 Chinese search phrases Random sampling when using search connectors
Searching AsianLII in multiple Asian languages English & Thai bankrupt* or insolven* or การล้มละลาย English, Thai, Bahasa Indonesian, Chinese & Vietnamese bankrupt* or insolven* or การล้มละลาย or kepailitan or pailit or 破產 or 破产 or Phásản Thai, Bahasa Indonesian, Chinese, Vietnamese & Korean การล้มละลาย or kepailitan or pailit or 破產 or 破产 or Phásảnor 파산 Search operators title(鍾 or chung or zhong)
Future work Word segmentation issue Identifying ‘word’ boundaries without explicit space delimiters Cross-lingual searching Retrieval of documents in a language other than the language of the query Query translation or document translation Use of bilingual dictionaries and a ‘link language’ eg English
Word segmentation problem Example (Pun, Chong & Chan 2003): 我們要發展中國家用電器 One way to segment this sentence: Another way:
Thank you Comments / Questions? philip@austlii.edu.au