320 likes | 1.4k Views
URDU WORD PROCESSING. AN INTRODUCTION By: Sharraf Hussain. Classes of Problems in Urdu Word Processing. Any computer application that handles Urdu text must confront two main classes of problems. 1. Contextual Formatting 2. Directional Layout. Contextual Formatting .
E N D
URDU WORD PROCESSING AN INTRODUCTION By: Sharraf Hussain
Classes of Problems in Urdu Word Processing Any computer application that handles Urdu text must confront two main classes of problems. 1. Contextual Formatting 2. Directional Layout
Contextual Formatting We mean the termination of a character’s proper presentation form according to its context. • In Urdu each character may have several presentation forms. • The proper form of a character in a text is determined according to the available presentation form of the character itself and those of character surrounding it. • It also depends on the current presentation forms of the surrounding characters.
Contextual Joining Joining Method • Urdu letters join to their neighbors with in a word • Each joining letter is represented by four basic contextual forms • Rule: The software replaces the simple contextual forms by special graphics designed for the specific context. • Note: This kind of contextual formatting is only possible in Naskh script.
The Unicode Algorithm for Arabic character (Version 2.0) • The character divide into six groups • Right-joining • Left-joining • Dual-joining • Join-causing (e.g zero-width joiner) • Non-joining (e.g non zero-width space characters) • The use of non-joiner between two letters prevents them from forming a cursive connection with each other when rendered. • Transparent (e.g. harkats) • Harkats are marks that indicate vowels or other modification of consonant letters. • Two subgroups are defined • Right join-causing characters (including dual-joining, right-joining and join-causing characters) • Left join-causing characters ( including dual-joining, left joining and join-causing characters) • Seven Rules are defined based on these classifications • http://www.unicode.org/unicode/uni2book/ch08.pdf (page 193)
Contextual formatting Algorithm in form of a finite state machine • In this algorithm the separate, last-joining, first joining and middle joining presentation forms are designated by A, B, C and D symbols respectively. In contrast to Unicode algorithm, there is no need to categorize the characters in this algorithm. • b[0] and b[1] are the first and second character respectively. • The state machine has two states: FIRST_CHAR_STATE and SECOND_CHAR_STATE. • In each state a character is entered and is placed in b[1] buffer. • In the first state: • we first determine the presentation form of b[0] • Based on the available presentation forms of b[1], we decide presentation form of b[0] and b[1]. • Finally we move forward one character (i.e the old b[1] becomes the new b[0]. • In the second state: • We decide on the presentation form of b[0] and b[1] according to the available (and present) presentation forms of these characters. • Again move forward one character
FIRST_CHAR_STATE B[1] has FORM B? FORM(b[0]) ? A,C B D Y N b[1] has FORM B? Y FORM(b[0])=B FORM(b[1])=A FORM(b[1])=B b[0] has FORM C? N N b[1] has FORM B? b[1] has FORM B? Y Y Y FORM(b[0])=A FORM(b[1])=A Slide window +1 Slide window +1 b[0] has FORM D? b[0] has FORM D? N N N N FORM(b[0])=C FORM(b[1])=B Y Y FIRST_CHAR_STATE SECOND_CHAR_STATE Slide window +1 FORM(b[1])=A FORM(b[1])=A FORM(b[0])=D FORM(b[1])=B FORM(b[0])=D FORM(b[1])=B Slide window +1 FIRST_CHAR_STATE Slide window +1 Slide window +1 SECOND_CHAR_STATE Slide window +1 Slide window +1 FIRST_CHAR_STATE FIRST_CHAR_STATE SECOND_CHAR_STATE SECOND_CHAR_STATE Legend: A = Separate B = Last-Joining C = First-Joining D = Middle-Joining Legend: b[0] = First Character b[1] = Second Character
Directional Layout The computer take a sequence of right-to-left characters and place each letter in its proper relative position in the text line; this process can be called Directional Layout. This second class of problem is caused by the fact that the Urdu script is written from right to left.
Urdu Numbers For right to left orientation with automatic counter flow enabled, Urdu numerals will automatically initiate a special counter flow mode. Counter flow for Urdu is terminated when you press any other key. Three thousand four hundred fifty six 3456 6543
Design Goals The term word processing is used here to focus on application where the production of readable text is the major goal. 1. Minimal burden on the user 2. Maximum Transferability of text 3. Minimal idiosyncrasies in the internal text 4. Near typeset print quality
Minimal Burden on the user • Automatically present Urdu text in its proper format and layout. • User should not be burdened with back word typing or confusing “modes”. • User should be free to concentrate on the text’s content instead of its form
Maximum Transferability of Text • It must be possible copy text or numbers in any language text editor. • Compatibility with in a document editing window or between window representing different documents or other application program.
Minimum idiosyncrasies in the internal text sequence Because other application may be unaware of the special properties of Urdu text, internally stored and transferred text sequence must be devoid of idiosyncrasies related to text directionality such as sub strings stored backwards (especially numbers) or embedded directional commands.
Near typeset quality • Professional looking text
Ligatures • When neighboring letters fuse together to form a graphic called ligature • Ligature itself has a contextual joining form. • Urdu software processing must be able to recognize the sequence of letters lA in the text and then automatically display or print the correct form of the ligature.
How to calculate ligature forms • Urdu has 5 letters that have the same initial shape as n except dots/Tuain and 4 letters that have the same shape as j except dots. It follows that the nj ligature is one of 5x4= 20 structural identical ligatures; but each of these has two contextual forms, so there are 5x4x2=40 altogether. Note: The shape of nj ligature is not same to jn ligature.
Possible combinations of ligatures? • The possible ligature combinations are too numerous to all be drawn in advance, since many of them would never occur in real text. • 16000 possible combinations of ligature has been discovered in Urdu so far. (Thanks to Mirza Jamil Ahmed)
Implementation of ligatures • To separate out the basic skeleton of the ligature from the surrounding dots. • To assemble each ligature instance dynamically as needed.
Font Busting • The software first represents the letter sequence nj by ligature skeleton plus a dot for the n and a dot for j. • Compute the contextual joining form of the skeleton and joins the skeleton into the word. • Finally, places the dots at the appropriate “attachment points” stored with the skeleton.
Little problem with ligatures • Two ligature need to be join together respectively but the first ligature contains last letter same as the starting letter of second ligature. • Now confusion arise to select ligature
Space (Sp-31) • The invisible character can be typed adjacent to a normal character to trick it assuming a joining form , and it can be typed between s and m of smA in order to break up the automatic formation of sm ligature . • The invisible character in Urdu is known as a Space (Sp-31). It is break in Urdu connected words. Explicit space is achieved by Hard Space (Hs-65). • In Unicode we have zero-width joining character (U+200D) • The availability of this simple override allows users to rely on the automatic formatting algorithm while still retaining final control over the algorithm while still retaining final control over the appearance of the text.
Write English Along With URDU • Automatic mixed directional layout. • Directionality Variable • Downstream (dominant text) • Upstream (Insertion text)
Consequences of Automatic Layout • Directional variable controls the direction of the text. • The text stores backward direction. • Searching algorithms. • Word wrapping algorithms