1 / 43

پاکستانی زبانوں کی ترتیب تہجی اور متعلقہ مسائل

پاکستانی زبانوں کی ترتیب تہجی اور متعلقہ مسائل. Collation Sequences and Related Issues for Pakistani Languages. سرمد حسین. F. Center For Research in Urdu Language Processing National University of Computer and Emerging Sciences. Purpose of Presentation. Briefly discuss character sets

Download Presentation

پاکستانی زبانوں کی ترتیب تہجی اور متعلقہ مسائل

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. پاکستانی زبانوں کی ترتیب تہجی اور متعلقہ مسائل Collation Sequences and Related Issues forPakistani Languages سرمد حسین F Center For Research in Urdu Language Processing National University of Computer and Emerging Sciences

  2. Purpose of Presentation • Briefly discuss character sets • Discuss Urdu Collating sequence • Propose a possible Urdu collation sequence • Overview collation of other languages of Pakistan

  3. اردو

  4. بلوچی

  5. پشتو

  6. پنجابی

  7. سندھی

  8. Sources • Urdu • Akhbar-e-Urdu (Special Supplement on Urdu Software; Jan-Feb. 2002), National Language Authority, Islamabad • Balochi • Fax communication (Sept. 2002), Balochi Academy, Quetta • Pashto • Fax communication (Sept. 2002), Pashto Academy, Peshawar • Punjabi • Punjabi Qaida (Experimental), Punjabi Adabi Board, Lahore • Sindhi • Sindhi Boli (July-Dec. 2001) and SLA Letter Circulation of Sindhi Collation (June 2002), Sindhi Language Authority, Hyderabad

  9. اردو آابپتٹثجچحخ دڈذرڑزژ سشصض طظعغ فقکگ لم نوہءیے -اردو قائدہ ، فیروز سنز ، لاہور

  10. Urdu Alphabet: State of Affairs • Are the following letters of Urdu? • آ • أٶ • بھ پھ تھ ۔ ۔ ۔ ... • ں • ة • لھ مھ نھ ںھ وھ • If yes, where are they placed in the alphabet?

  11. Sources • Data from eight dictionaries of Urdu • فیروزاللغات جامع، فیروز سنز، لاہور(FLJ) • Standard Twentieth Century Dictionary: Urdu to English, Educational Publishing House, New Dehli, India (STCD) • فرہنگِِِِ تلفظ ، مقتدرہ قومی زبان، اسلامآباد(FT) • جدید اردو لغت ، مقتدرہ قومی زبان، اسلامآباد (JUL) • اردو لغت ، اردو لغت بورڈ ، کراچی(UL) • A Dictionary of Urdu, Classical Hindi and English, Crosby Lockwood and Son, London (1911) (UHE) • فرہنگ آصفیہ، دہلی (1918)(FA) • نوراللغات، سنگ میل، لاہور (NL)

  12. Urdu Alphabet: State of Affairs • FT, JUL , UL اآببھپپھتتھٹٹھثججھچچھحخ د دھڈڈھذررھڑڑھزژ سشصضطظعغ ف قککھگگھللھممھںںھننھوہءیے • FLJ, NL آابپتٹثجچحخ دڈذرڑزژ س شصضطظعغ فقکگلمںنو ہ ھءیے • UHE, FA , STCD ابپتٹثجچحخ دڈذرڑزژ س شصضطظعغ فقکگلمنو ہ ھءیے

  13. Conclusions: Urdu Character Set • No general agreement on Urdu Character Set by dictionary publishers • Standard Character Set defined by National Language Authority and Urdu Dictionary Board • not traditional • not well-publicized • not completely adopted • GoP Computing Standard for Computing, UZT 1.01 implements the NLA-defined character and symbol set • UZT 1.01 will soon be fully represented in Unicode/ISO IEC 10646

  14. Character Set • Alphabet • Harakat (Aerab) • Other Symbols

  15. Jazm ْد Zabar دَ Zer ِد Pesh دُ Khari zabarد Khari zer د Ulta pesh د “Familiar” Harakaat (Aerab) Do zabar دً Do zerدٍ Do pesh دُ Tashdeed دّ Noon ghunna ن

  16. Numbers 0 ۰ 1 ١ 2 ٢ 3 ٣ 4 ‌ 5 ۵ 6 ٦ 7 8 ٨ 9 ٩ Punctuation ؟ ؛ ٬ - “Common” Other Symbols Other Symbols Honorifics ס

  17. Current GoP Standard: UZT 1.01

  18. Logical Sections of UZT 1.01 • Alphabet (80 – 122) • Aerab/diacritics/harakat (66 – 79, 123 – 126) • Other characters • Punctuation and arithmetic symbols (32 – 47, 58 – 65) • Digits (48 – 57) • Special symbols (160 – 176, 192 – 199) • Miscellaneous • Control characters (0 – 31, 127) • Reserved control space (128 – 159, 255) • Reserved expansion space (177 – 191, 200 – 207, 240 – 253) • Vendor area (208 – 239) • Toggle character (254)

  19. Urdu Collation Sequence • How do the following figure in? • Basic Letters • Other Letters • Basic Aerab • Other Aerab • Others • Arguments should be consistent and simple

  20. Character = written content = letters Phoneme = linguistic content in word “phone” 5 Characters = p h o n e 3 Phonemes = f o n Character vs. Phoneme

  21. Urdu Collating Sequence: Letters • What is the status and sequence of following characters? • ا آ • أٶ • ن ں • ہ ھ • ةہ ت • ی ے

  22. FLJ آب= ااب آپ= ااپ اب ایوان FT, JUL, UL اب ایوان آب= ااب آپ= ااپ ا آ Variation • آ = ا ا • STCD, UHE, FA, NL • ا • آب • آپ • اب • ایوان • stylistic variation of ا ا • adds a character to single alif • not a character in the pure sense

  23. Not a character in ANY dictionary including dictionaries by National Language Authority Urdu Dictionary Board Has same bearing on collation sequences as ء ا ء و Included in UZT 1.01 as per terms of reference given by NLA May be made by combination of ء followed by ا ، و Should be taken out of UZT1.01 in its next version أٶStatus

  24. FLJ, FT, STCD, NL, FA, UHE ماں مان JUL, UL مان ماں ن ں Variation • ں is a vowel modifier which nasalizes the vowel but DOES NOT add any “phonemic content” • not a phoneme • is a character • does not represent any other character or combination • written adjacent to ن • lighter goes up! • would come before ن • ماC V= • ماں C V = • مان C V C =

  25. FLJ, UHE, FA, NL ( بھ not character; ہ then ھ) باپ بھابی بہن بہنگی بھنگی بیٹا STCD ( بھ not character; ھ then ہ) باپ بھابی بہن بھنگی بہنگی بیٹا ہ ھ Variation • FT, JUL, UL ( بھ character) • باپ • بہن • بہنگی • بیٹا • بھابی • بھنگی

  26. Like ں is a vowel modifier ھ is a consonant modifier and DOES NOT add any “phonemic content” as with ں , ھ not a phoneme written adjacent to ہ lighter goes up! would come before ہ ہ ھ Variation • بC = • بھ C = • بہ C V C =

  27. Urdu Dictionary Board and National Language Authority assert that these are phonemes therefore the character combination should be made a character If character combinations which are phonemes are to be promoted as characters then the following combinations should also be made characters to be consistent یں، وں ، اں However, it is common in languages that character combinations represent phonemes ph  f (in English), so پ ھ  پھ (in Urdu) ھ may remain a character like ں, even if it is not a phoneme بھ ، پھ، ۔۔۔ not characters but character combinations بھ، پھ،۔۔۔ Status as “Character”

  28. Not a character in ANY dictionary including dictionaries by National Language Authority Urdu Dictionary Board Stylistic variation of ت (e.g. STCD, NL, …) زکوة زکوت Not a character ةStatus as “Character”

  29. FJL,FT, JUL, UL, NL بی بی بی بے بیابان STCD, UHE, FA بی بے بیابان بیبی یے Variation • Middle ے or یpredicament • بیکار = بے کار • ٹیلیوژن = ٹیلی وژن

  30. Like ا،و،یthe character ے is a vowel (phoneme) unlike ں,ے is not a vowel modifier ے different from ں because ے replaces :ی بے  بی ں adds onto ا : ما  ماں placed at the end of the alphabet (based on traditional collation) Collated as “heavier” than ی at ligature endings but “equal to”ی ligature medially یے Variation

  31. Role of Aerab in Sorting • Aerab ignored in the first (primary) pass of sorting an Urdu string • only characters are considered • بِہار (= بِ ہار) • بَہانہ (= بَ ہانہ) • بِہائ (= بِ ہاءی) • However, aerab are relevant in second pass, when first pass gives an exact match • بَن بِن بُُن • سَن سِن سُُن

  32. ‎Vocalic Aerab - Zabar, Zer, Pesh • بَہَر • بَہِر • بَہُر • بَہ۫ر • بُہ۫ر (UL) • بَیر • بِیَر • بِیر • بیر • FT, FLJ, JUL, UL • بَن • بِن • بُُن • بِیر • بیر • STCD • بَن • بُُن • بِن • سَن • سِن • سُُن

  33. Vocalic Aerab – Khari Zabar • No effect at primary level sorting • اعلا مَوسی • اعلان مُوسی • اعلم • اعلی • No minimal pairs found on secondary level so involvement could not be determined

  34. Consonantal Aerab - Tashdeed • Ignored are primary level (FT, UL, NL, …) • Effects secondary level sorting • “heavier” • lighter goes up • بدی • بدّی • بدّیا • بَرانا • برّانا • بَرایا • َپتا • َپتّا • ِپتا

  35. Hex 41 (UZT) and Hex 200B (Unicode) Ignored at primary level and secondary level ٹیلیوژن ، ٹیلی وژن ٹیلیفون ، ٹیلی فون بے کار ، بیکار But given each pair, which word first? Tertiary level decision lighter goes up! single word without break comes first? Ligature-Break (Half Space)

  36. Word-Break (Normal Space) • Ignored at primary level ? • American Heritage Dictionary (2nd Collegiate ed.) • black art • black bear • blackberry • black box • blacken • Black Death • black gold • Space ignored at primary level

  37. Word-Break (Normal Space) • FLJ, UL • بانگ • بانگِ درا • بانگ دینا • If sorting is done at word break then 1,3,2 • So sorting ignores word break

  38. آابپتٹثجچحخ دڈذرڑزژ سش صض طظعغ فقکگ لمں نوھ ہءیے Conclusions: Urdu Character Set • Two levels of characters • Core Characters • Non-core characters

  39. Multi-level Complex Problem Pre-processing Contractions (ب ھ بھ) Insert un-written aerab Primary Level characters Secondary Level aerab Others (?) Tertiary Level Ligature Break Others (?) Ignorable Space secondary aerab (?) Symbols (?) Others (?) Conclusions: Urdu Collating Sequence

  40. What Needs to be Done for Urdu • Debate and standardize • Character Set • Develop computational model to implement sorting • Culturally acceptable Collation Element Table to generate sort keys • Standardize and publicize this computational model for Urdu sorting

  41. What Needs to be Done • Take national standards to International forums: Unicode/ISO • Complete similar work for all other local languages of Pakistan • Character set • Script • Collating Sequence

  42. Relevant National and Provincial Government Organizations • National • Urdu and Regional Languages’ Software Development Forum (URLSDF), Ministry of Science and Technology (MoST), Islamabad • National Language Authority (NLA), Islamabad (Urdu) • Pakistan Standards and Quality Control Authority (PSQCA), Karachi • Provincial • Balochi Academy, Quetta • Pashto Academy, Peshawar • Punjabi Adabi Board, Lahore • Sindhi Language Authority (SLA), Hyderabad

  43. شکر یہ

More Related