1 / 28

Urdu Character Set and Collating Sequence

Urdu Character Set and Collating Sequence. Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of Computer and Emerging Sciences. Purpose of Presentation. Indicate the “state of affairs” Character set Collating sequence

landon
Download Presentation

Urdu Character Set and Collating Sequence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِاردو Center for Research in Urdu Language Processing FAST National University of Computer and Emerging Sciences

  2. Purpose of Presentation • Indicate the “state of affairs” • Character set • Collating sequence • Show what has been done regarding the standardization • Identify what needs to be done مرکزتحقیقات اردو

  3. Sources • Data from four dictionaries of Urdu • فیروزاللغات جامع ، فیروز سنز، لاہور(FLJ) • Standard Twentieth Century Dictionary: Urdu to English, Educational Publishing House, New Dehli, India (STCD) • فرہنگِِِِ تلفظ ، مقتدرہ قومی زبان ، اسلامآباد(FT) • جدید اردو لغت ، مقتدرہ قومی زبان ، اسلامآباد (JUL) مرکزتحقیقات اردو

  4. Character Set • Alphabet • Harakat (Aerab) • Other Symbols مرکزتحقیقات اردو

  5. “Typical” Alphabet آابپتٹثجچحخ دڈذرڑزژ سشصض طظعغ فقکگ لم نوہءیے -اردو قاءدہ ، فیروز سنز ، لاہور مرکزتحقیقات اردو

  6. Jazm ْد Zabar دَ Zer ِد Pesh دُ Khari zabarد Khari zer د Ulta pesh د “Familiar” Harakaat (Aerab) Do zabar دً Do zerدٍ Do pesh دُ Tashdeed دّ Noon ghunna ن مرکزتحقیقات اردو

  7. Numbers 0 ۰ 1 ١ 2 ٢ 3 ٣ 4 ‌ 5 ۵ 6 ٦ 7 8 ٨ 9 ٩ Punctuation ؟ ؛ ٬ - “Common” Other Symbols Other Symbols Honorifics ס مرکزتحقیقات اردو

  8. Urdu Alphabet: State of Affairs • FT, JUL • اآببھپپھتتھٹٹھثججھچچھحخ ددھڈڈھذررھڑڑھزژ سشصضطظعغ فقککھگگھللھممھںںھننھووھہءیے • FLJ, STCD • آابپتٹثجچحخ دڈذرڑزژ سشصضطظعغ فقکگلمںنو ہھءیے مرکزتحقیقات اردو

  9. Current GoP Standard: UZT 1.01 مرکزتحقیقات اردو

  10. Logical Sections of UZT 1.01 • Alphabet (80 – 122) • Aerab/diacritics/harakat (66 – 79, 123 – 126) • Other characters • Punctuation and arithmetic symbols (32 – 47, 58 – 65) • Digits (48 – 57) • Special symbols (160 – 176, 192 – 199) • Miscellaneous • Control characters (0 – 31, 127) • Reserved control space (128 – 159, 255) • Reserved expansion space (177 – 191, 200 – 207, 240 – 253) • Vendor area (208 – 239) • Toggle character (254) مرکزتحقیقات اردو

  11. Conclusions: Standard Urdu Character Set • No general agreement on Urdu Character Set by dictionary publishers • Standard Character Set defined by National Language Authority • not well-publicized • not widely adopted • GoP Computing Standard for Computing, UZT 1.01 implements the NLA-defined character and symbol set • Will soon be fully represented in Unicode/ISO 10646 مرکزتحقیقات اردو

  12. Urdu Collating Sequence: State of Affairs • FT, JUL • اآببھپپھتتھٹٹھثججھچچھحخ ددھڈڈھذررھڑڑھزژ سشصضطظعغ فقککھگگھللھممھںںھننھووھہءیے • FLJ • آابپتٹثجچحخ دڈذرڑزژ سشصضطظعغ فقکگلمںنوہھءیے • STCD • آ ابپتٹثجچحخ دڈذرڑزژ سشصضطظعغ فقکگلمنںوھ ہءیے مرکزتحقیقات اردو

  13. STCD and FLJ آب آپ اب ایوان FT and JUL اب ایوان آب آپ ا آ Variation مرکزتحقیقات اردو

  14. FLJ, FT & STCD ماں مان JUL مان ماں ن ں Variation مرکزتحقیقات اردو

  15. FLJ باپ بہن بہنگی بھابی بھنگی بیٹا STCD باپ بھابی بہن بھنگی بہنگی بیٹا • بانو • بانھ • بانی ہھ Variation • FT & JUL • باپ • بہن • بہنگی • بیٹا • بھابی • بھنگی مرکزتحقیقات اردو

  16. FJL,FT & JUL بی بی بی بے بیابان STCD بی بے بیابان بی بی یے Variation • Middle “yay” predicament: ے or ی • بیکار = ب ے ک ا ر • ٹیلیوژن = ٹ ی ل ی و ژ ن مرکزتحقیقات اردو

  17. Role of Aerab in Sorting • Aerab ignored in the first (primary) pass of sorting an Urdu string • بِہار (= بِ ہار) • بَہانہ • بِہاءی (= بِ ہاءی) • However, aerab are relevant in second pass, when first pass gives an exact match • بَن بِن بُُن • سَن سِن سُُن مرکزتحقیقات اردو

  18. ‎Vocalic Aerab - Zabar, Zer, Pesh • FT, FLJ, JUL • بَن • بِن • بُُن • بَیر • بِیر • بیر • بِیر • بیر • STCD • بَن • بُُن • بِن • سَن • سِن • سُُن مرکزتحقیقات اردو

  19. Vocalic Aerab – Khari Zabar • No effect at primary level sorting • اعلا مَوسی • اعلان مُوسی • اعلم • اعلی • No minimal pairs found so secondary level so involvement could not be determined مرکزتحقیقات اردو

  20. Consonantal Aerab - Hamza • Ignored at primary level • Minimal pairs not found to determine secondary level effect • مرا • مرٲت • مراتب • مرام • مرآت • باوا • باٶٹا • باون مرکزتحقیقات اردو

  21. Consonantal Aerab - Tashdeed • Ignored are primary level • Effects secondary level sorting • “heavier than null” • Interacts with vocalic aerab • بدی • بدّی • بدّیا • بَرانا • برّانا • بَرایا • بدو • بدّ ُو • بدّیا allexamples from FT مرکزتحقیقات اردو

  22. Ignored at primary level and secondary level ٹیلیوژن ، ٹیلی وژن ٹیلیفون ، ٹیلی فون بے کار ، بیکار But given each pair, which word first? Tertiary level decision Ligature-Break (Half Space) مرکزتحقیقات اردو

  23. Word-Break (Normal Space) • Ignored at primary level ? • American Heritage Dictionary (2nd Collegiate ed.) • black art • black bear • blackberry • black box • blacken • Black Death • black gold • Space ignored at primary level مرکزتحقیقات اردو

  24. Word-Break (Normal Space) - II • FLJ • بانگ • بانگِ درا • بانگ دینا • If sorting is done at word break then 1,3,2 • So sorting ignores word break مرکزتحقیقات اردو

  25. Multi-level Complex Problem Pre-processing Contractions (ب ھ  بھ) Primary Level characters Secondary Level Vocalic aerab Consonantal aerab Interaction of Vocalic and Consonantal aerab Others (?) Tertiary Level Ligature Break Others (?) Conclusions: Urdu Collating Sequence مرکزتحقیقات اردو

  26. What Needs to be Done: Urdu • If required revisit and revise the Urdu character set • Extensive work on sorting done at linguistic level by NLA and UDB. Need to • Standardize it • Publicize it • Need to develop at computational level to build • Collation Element Table to generate sort keys • Standardize it • Publicize it مرکزتحقیقات اردو

  27. What Needs to be Done: Other Languages of Pakistan • Need to work towards standardization of • Character set • Collating Sequence • Need to do gap analysis of character sets with Unicode/ISO 10646 for international standardization • Need to develop Collation Element Tables for these Languages for sorting مرکزتحقیقات اردو

  28. Thank you Questions? مرکزتحقیقات اردو

More Related