300 likes | 770 Views
Urdu Character Set and Collating Sequence. Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of Computer and Emerging Sciences. Purpose of Presentation. Indicate the “state of affairs” Character set Collating sequence
E N D
Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِاردو Center for Research in Urdu Language Processing FAST National University of Computer and Emerging Sciences
Purpose of Presentation • Indicate the “state of affairs” • Character set • Collating sequence • Show what has been done regarding the standardization • Identify what needs to be done مرکزتحقیقات اردو
Sources • Data from four dictionaries of Urdu • فیروزاللغات جامع ، فیروز سنز، لاہور(FLJ) • Standard Twentieth Century Dictionary: Urdu to English, Educational Publishing House, New Dehli, India (STCD) • فرہنگِِِِ تلفظ ، مقتدرہ قومی زبان ، اسلامآباد(FT) • جدید اردو لغت ، مقتدرہ قومی زبان ، اسلامآباد (JUL) مرکزتحقیقات اردو
Character Set • Alphabet • Harakat (Aerab) • Other Symbols مرکزتحقیقات اردو
“Typical” Alphabet آابپتٹثجچحخ دڈذرڑزژ سشصض طظعغ فقکگ لم نوہءیے -اردو قاءدہ ، فیروز سنز ، لاہور مرکزتحقیقات اردو
Jazm ْد Zabar دَ Zer ِد Pesh دُ Khari zabarد Khari zer د Ulta pesh د “Familiar” Harakaat (Aerab) Do zabar دً Do zerدٍ Do pesh دُ Tashdeed دّ Noon ghunna ن مرکزتحقیقات اردو
Numbers 0 ۰ 1 ١ 2 ٢ 3 ٣ 4 5 ۵ 6 ٦ 7 8 ٨ 9 ٩ Punctuation ؟ ؛ ٬ - “Common” Other Symbols Other Symbols Honorifics ס مرکزتحقیقات اردو
Urdu Alphabet: State of Affairs • FT, JUL • اآببھپپھتتھٹٹھثججھچچھحخ ددھڈڈھذررھڑڑھزژ سشصضطظعغ فقککھگگھللھممھںںھننھووھہءیے • FLJ, STCD • آابپتٹثجچحخ دڈذرڑزژ سشصضطظعغ فقکگلمںنو ہھءیے مرکزتحقیقات اردو
Current GoP Standard: UZT 1.01 مرکزتحقیقات اردو
Logical Sections of UZT 1.01 • Alphabet (80 – 122) • Aerab/diacritics/harakat (66 – 79, 123 – 126) • Other characters • Punctuation and arithmetic symbols (32 – 47, 58 – 65) • Digits (48 – 57) • Special symbols (160 – 176, 192 – 199) • Miscellaneous • Control characters (0 – 31, 127) • Reserved control space (128 – 159, 255) • Reserved expansion space (177 – 191, 200 – 207, 240 – 253) • Vendor area (208 – 239) • Toggle character (254) مرکزتحقیقات اردو
Conclusions: Standard Urdu Character Set • No general agreement on Urdu Character Set by dictionary publishers • Standard Character Set defined by National Language Authority • not well-publicized • not widely adopted • GoP Computing Standard for Computing, UZT 1.01 implements the NLA-defined character and symbol set • Will soon be fully represented in Unicode/ISO 10646 مرکزتحقیقات اردو
Urdu Collating Sequence: State of Affairs • FT, JUL • اآببھپپھتتھٹٹھثججھچچھحخ ددھڈڈھذررھڑڑھزژ سشصضطظعغ فقککھگگھللھممھںںھننھووھہءیے • FLJ • آابپتٹثجچحخ دڈذرڑزژ سشصضطظعغ فقکگلمںنوہھءیے • STCD • آ ابپتٹثجچحخ دڈذرڑزژ سشصضطظعغ فقکگلمنںوھ ہءیے مرکزتحقیقات اردو
STCD and FLJ آب آپ اب ایوان FT and JUL اب ایوان آب آپ ا آ Variation مرکزتحقیقات اردو
FLJ, FT & STCD ماں مان JUL مان ماں ن ں Variation مرکزتحقیقات اردو
FLJ باپ بہن بہنگی بھابی بھنگی بیٹا STCD باپ بھابی بہن بھنگی بہنگی بیٹا • بانو • بانھ • بانی ہھ Variation • FT & JUL • باپ • بہن • بہنگی • بیٹا • بھابی • بھنگی مرکزتحقیقات اردو
FJL,FT & JUL بی بی بی بے بیابان STCD بی بے بیابان بی بی یے Variation • Middle “yay” predicament: ے or ی • بیکار = ب ے ک ا ر • ٹیلیوژن = ٹ ی ل ی و ژ ن مرکزتحقیقات اردو
Role of Aerab in Sorting • Aerab ignored in the first (primary) pass of sorting an Urdu string • بِہار (= بِ ہار) • بَہانہ • بِہاءی (= بِ ہاءی) • However, aerab are relevant in second pass, when first pass gives an exact match • بَن بِن بُُن • سَن سِن سُُن مرکزتحقیقات اردو
Vocalic Aerab - Zabar, Zer, Pesh • FT, FLJ, JUL • بَن • بِن • بُُن • بَیر • بِیر • بیر • بِیر • بیر • STCD • بَن • بُُن • بِن • سَن • سِن • سُُن مرکزتحقیقات اردو
Vocalic Aerab – Khari Zabar • No effect at primary level sorting • اعلا مَوسی • اعلان مُوسی • اعلم • اعلی • No minimal pairs found so secondary level so involvement could not be determined مرکزتحقیقات اردو
Consonantal Aerab - Hamza • Ignored at primary level • Minimal pairs not found to determine secondary level effect • مرا • مرٲت • مراتب • مرام • مرآت • باوا • باٶٹا • باون مرکزتحقیقات اردو
Consonantal Aerab - Tashdeed • Ignored are primary level • Effects secondary level sorting • “heavier than null” • Interacts with vocalic aerab • بدی • بدّی • بدّیا • بَرانا • برّانا • بَرایا • بدو • بدّ ُو • بدّیا allexamples from FT مرکزتحقیقات اردو
Ignored at primary level and secondary level ٹیلیوژن ، ٹیلی وژن ٹیلیفون ، ٹیلی فون بے کار ، بیکار But given each pair, which word first? Tertiary level decision Ligature-Break (Half Space) مرکزتحقیقات اردو
Word-Break (Normal Space) • Ignored at primary level ? • American Heritage Dictionary (2nd Collegiate ed.) • black art • black bear • blackberry • black box • blacken • Black Death • black gold • Space ignored at primary level مرکزتحقیقات اردو
Word-Break (Normal Space) - II • FLJ • بانگ • بانگِ درا • بانگ دینا • If sorting is done at word break then 1,3,2 • So sorting ignores word break مرکزتحقیقات اردو
Multi-level Complex Problem Pre-processing Contractions (ب ھ بھ) Primary Level characters Secondary Level Vocalic aerab Consonantal aerab Interaction of Vocalic and Consonantal aerab Others (?) Tertiary Level Ligature Break Others (?) Conclusions: Urdu Collating Sequence مرکزتحقیقات اردو
What Needs to be Done: Urdu • If required revisit and revise the Urdu character set • Extensive work on sorting done at linguistic level by NLA and UDB. Need to • Standardize it • Publicize it • Need to develop at computational level to build • Collation Element Table to generate sort keys • Standardize it • Publicize it مرکزتحقیقات اردو
What Needs to be Done: Other Languages of Pakistan • Need to work towards standardization of • Character set • Collating Sequence • Need to do gap analysis of character sets with Unicode/ISO 10646 for international standardization • Need to develop Collation Element Tables for these Languages for sorting مرکزتحقیقات اردو
Thank you Questions? مرکزتحقیقات اردو