430 likes | 667 Views
پاکستانی زبانوں کی ترتیب تہجی اور متعلقہ مسائل. Collation Sequences and Related Issues for Pakistani Languages. سرمد حسین. F. Center For Research in Urdu Language Processing National University of Computer and Emerging Sciences. Purpose of Presentation. Briefly discuss character sets
E N D
پاکستانی زبانوں کی ترتیب تہجی اور متعلقہ مسائل Collation Sequences and Related Issues forPakistani Languages سرمد حسین F Center For Research in Urdu Language Processing National University of Computer and Emerging Sciences
Purpose of Presentation • Briefly discuss character sets • Discuss Urdu Collating sequence • Propose a possible Urdu collation sequence • Overview collation of other languages of Pakistan
Sources • Urdu • Akhbar-e-Urdu (Special Supplement on Urdu Software; Jan-Feb. 2002), National Language Authority, Islamabad • Balochi • Fax communication (Sept. 2002), Balochi Academy, Quetta • Pashto • Fax communication (Sept. 2002), Pashto Academy, Peshawar • Punjabi • Punjabi Qaida (Experimental), Punjabi Adabi Board, Lahore • Sindhi • Sindhi Boli (July-Dec. 2001) and SLA Letter Circulation of Sindhi Collation (June 2002), Sindhi Language Authority, Hyderabad
اردو آابپتٹثجچحخ دڈذرڑزژ سشصض طظعغ فقکگ لم نوہءیے -اردو قائدہ ، فیروز سنز ، لاہور
Urdu Alphabet: State of Affairs • Are the following letters of Urdu? • آ • أٶ • بھ پھ تھ ۔ ۔ ۔ ... • ں • ة • لھ مھ نھ ںھ وھ • If yes, where are they placed in the alphabet?
Sources • Data from eight dictionaries of Urdu • فیروزاللغات جامع، فیروز سنز، لاہور(FLJ) • Standard Twentieth Century Dictionary: Urdu to English, Educational Publishing House, New Dehli, India (STCD) • فرہنگِِِِ تلفظ ، مقتدرہ قومی زبان، اسلامآباد(FT) • جدید اردو لغت ، مقتدرہ قومی زبان، اسلامآباد (JUL) • اردو لغت ، اردو لغت بورڈ ، کراچی(UL) • A Dictionary of Urdu, Classical Hindi and English, Crosby Lockwood and Son, London (1911) (UHE) • فرہنگ آصفیہ، دہلی (1918)(FA) • نوراللغات، سنگ میل، لاہور (NL)
Urdu Alphabet: State of Affairs • FT, JUL , UL اآببھپپھتتھٹٹھثججھچچھحخ د دھڈڈھذررھڑڑھزژ سشصضطظعغ ف قککھگگھللھممھںںھننھوہءیے • FLJ, NL آابپتٹثجچحخ دڈذرڑزژ س شصضطظعغ فقکگلمںنو ہ ھءیے • UHE, FA , STCD ابپتٹثجچحخ دڈذرڑزژ س شصضطظعغ فقکگلمنو ہ ھءیے
Conclusions: Urdu Character Set • No general agreement on Urdu Character Set by dictionary publishers • Standard Character Set defined by National Language Authority and Urdu Dictionary Board • not traditional • not well-publicized • not completely adopted • GoP Computing Standard for Computing, UZT 1.01 implements the NLA-defined character and symbol set • UZT 1.01 will soon be fully represented in Unicode/ISO IEC 10646
Character Set • Alphabet • Harakat (Aerab) • Other Symbols
Jazm ْد Zabar دَ Zer ِد Pesh دُ Khari zabarد Khari zer د Ulta pesh د “Familiar” Harakaat (Aerab) Do zabar دً Do zerدٍ Do pesh دُ Tashdeed دّ Noon ghunna ن
Numbers 0 ۰ 1 ١ 2 ٢ 3 ٣ 4 5 ۵ 6 ٦ 7 8 ٨ 9 ٩ Punctuation ؟ ؛ ٬ - “Common” Other Symbols Other Symbols Honorifics ס
Logical Sections of UZT 1.01 • Alphabet (80 – 122) • Aerab/diacritics/harakat (66 – 79, 123 – 126) • Other characters • Punctuation and arithmetic symbols (32 – 47, 58 – 65) • Digits (48 – 57) • Special symbols (160 – 176, 192 – 199) • Miscellaneous • Control characters (0 – 31, 127) • Reserved control space (128 – 159, 255) • Reserved expansion space (177 – 191, 200 – 207, 240 – 253) • Vendor area (208 – 239) • Toggle character (254)
Urdu Collation Sequence • How do the following figure in? • Basic Letters • Other Letters • Basic Aerab • Other Aerab • Others • Arguments should be consistent and simple
Character = written content = letters Phoneme = linguistic content in word “phone” 5 Characters = p h o n e 3 Phonemes = f o n Character vs. Phoneme
Urdu Collating Sequence: Letters • What is the status and sequence of following characters? • ا آ • أٶ • ن ں • ہ ھ • ةہ ت • ی ے
FLJ آب= ااب آپ= ااپ اب ایوان FT, JUL, UL اب ایوان آب= ااب آپ= ااپ ا آ Variation • آ = ا ا • STCD, UHE, FA, NL • ا • آب • آپ • اب • ایوان • stylistic variation of ا ا • adds a character to single alif • not a character in the pure sense
Not a character in ANY dictionary including dictionaries by National Language Authority Urdu Dictionary Board Has same bearing on collation sequences as ء ا ء و Included in UZT 1.01 as per terms of reference given by NLA May be made by combination of ء followed by ا ، و Should be taken out of UZT1.01 in its next version أٶStatus
FLJ, FT, STCD, NL, FA, UHE ماں مان JUL, UL مان ماں ن ں Variation • ں is a vowel modifier which nasalizes the vowel but DOES NOT add any “phonemic content” • not a phoneme • is a character • does not represent any other character or combination • written adjacent to ن • lighter goes up! • would come before ن • ماC V= • ماں C V = • مان C V C =
FLJ, UHE, FA, NL ( بھ not character; ہ then ھ) باپ بھابی بہن بہنگی بھنگی بیٹا STCD ( بھ not character; ھ then ہ) باپ بھابی بہن بھنگی بہنگی بیٹا ہ ھ Variation • FT, JUL, UL ( بھ character) • باپ • بہن • بہنگی • بیٹا • بھابی • بھنگی
Like ں is a vowel modifier ھ is a consonant modifier and DOES NOT add any “phonemic content” as with ں , ھ not a phoneme written adjacent to ہ lighter goes up! would come before ہ ہ ھ Variation • بC = • بھ C = • بہ C V C =
Urdu Dictionary Board and National Language Authority assert that these are phonemes therefore the character combination should be made a character If character combinations which are phonemes are to be promoted as characters then the following combinations should also be made characters to be consistent یں، وں ، اں However, it is common in languages that character combinations represent phonemes ph f (in English), so پ ھ پھ (in Urdu) ھ may remain a character like ں, even if it is not a phoneme بھ ، پھ، ۔۔۔ not characters but character combinations بھ، پھ،۔۔۔ Status as “Character”
Not a character in ANY dictionary including dictionaries by National Language Authority Urdu Dictionary Board Stylistic variation of ت (e.g. STCD, NL, …) زکوة زکوت Not a character ةStatus as “Character”
FJL,FT, JUL, UL, NL بی بی بی بے بیابان STCD, UHE, FA بی بے بیابان بیبی یے Variation • Middle ے or یpredicament • بیکار = بے کار • ٹیلیوژن = ٹیلی وژن
Like ا،و،یthe character ے is a vowel (phoneme) unlike ں,ے is not a vowel modifier ے different from ں because ے replaces :ی بے بی ں adds onto ا : ما ماں placed at the end of the alphabet (based on traditional collation) Collated as “heavier” than ی at ligature endings but “equal to”ی ligature medially یے Variation
Role of Aerab in Sorting • Aerab ignored in the first (primary) pass of sorting an Urdu string • only characters are considered • بِہار (= بِ ہار) • بَہانہ (= بَ ہانہ) • بِہائ (= بِ ہاءی) • However, aerab are relevant in second pass, when first pass gives an exact match • بَن بِن بُُن • سَن سِن سُُن
Vocalic Aerab - Zabar, Zer, Pesh • بَہَر • بَہِر • بَہُر • بَہ۫ر • بُہ۫ر (UL) • بَیر • بِیَر • بِیر • بیر • FT, FLJ, JUL, UL • بَن • بِن • بُُن • بِیر • بیر • STCD • بَن • بُُن • بِن • سَن • سِن • سُُن
Vocalic Aerab – Khari Zabar • No effect at primary level sorting • اعلا مَوسی • اعلان مُوسی • اعلم • اعلی • No minimal pairs found on secondary level so involvement could not be determined
Consonantal Aerab - Tashdeed • Ignored are primary level (FT, UL, NL, …) • Effects secondary level sorting • “heavier” • lighter goes up • بدی • بدّی • بدّیا • بَرانا • برّانا • بَرایا • َپتا • َپتّا • ِپتا
Hex 41 (UZT) and Hex 200B (Unicode) Ignored at primary level and secondary level ٹیلیوژن ، ٹیلی وژن ٹیلیفون ، ٹیلی فون بے کار ، بیکار But given each pair, which word first? Tertiary level decision lighter goes up! single word without break comes first? Ligature-Break (Half Space)
Word-Break (Normal Space) • Ignored at primary level ? • American Heritage Dictionary (2nd Collegiate ed.) • black art • black bear • blackberry • black box • blacken • Black Death • black gold • Space ignored at primary level
Word-Break (Normal Space) • FLJ, UL • بانگ • بانگِ درا • بانگ دینا • If sorting is done at word break then 1,3,2 • So sorting ignores word break
آابپتٹثجچحخ دڈذرڑزژ سش صض طظعغ فقکگ لمں نوھ ہءیے Conclusions: Urdu Character Set • Two levels of characters • Core Characters • Non-core characters
Multi-level Complex Problem Pre-processing Contractions (ب ھ بھ) Insert un-written aerab Primary Level characters Secondary Level aerab Others (?) Tertiary Level Ligature Break Others (?) Ignorable Space secondary aerab (?) Symbols (?) Others (?) Conclusions: Urdu Collating Sequence
What Needs to be Done for Urdu • Debate and standardize • Character Set • Develop computational model to implement sorting • Culturally acceptable Collation Element Table to generate sort keys • Standardize and publicize this computational model for Urdu sorting
What Needs to be Done • Take national standards to International forums: Unicode/ISO • Complete similar work for all other local languages of Pakistan • Character set • Script • Collating Sequence
Relevant National and Provincial Government Organizations • National • Urdu and Regional Languages’ Software Development Forum (URLSDF), Ministry of Science and Technology (MoST), Islamabad • National Language Authority (NLA), Islamabad (Urdu) • Pakistan Standards and Quality Control Authority (PSQCA), Karachi • Provincial • Balochi Academy, Quetta • Pashto Academy, Peshawar • Punjabi Adabi Board, Lahore • Sindhi Language Authority (SLA), Hyderabad