Mahesh D. Kulkarni Group Coordinator C-DAC GIST 4 th August 2006 Venue : Hotel Raddison, Noida

WELCOME Mahesh D. Kulkarni Group Coordinator C-DAC GIST 4th August 2006 Venue : Hotel Raddison, Noida

Indian Language Domain Name Registration “Issues and Solutions”

Background • Social and economic growth is catalyzed by the presence of Internet • Development of internet is mainly in English • Uses only 26 alphabet (unaccented Latin letters), the 10 digits (0-9), hyphen and the dot. • For proliferation and preservation of heritage, culture and content creation in multiple languages it is essential to have the domain names in multilingual scripts.

Background • User enters IDN : www. (non-ASCII characters) • Application (such as browser) converts to ASCII Compatible encoding (ACE) : www.xn--3b7vcv67.com • Registry entry : xn—3b7vcv67.com (ASCII characters) xn--e2br9czb पेड़ आई xn--m1be

Overview : • India has largest linguistic diversities in the world • 4 major language families and at least 35 different languages and around 2000 dialects. • Languages belong to either Indo-Aryan (ca.74%), the Dravidian (ca 24%), the Austro-Asiatic (Munda) (ca 1.2%) or the Tibeto-Burman (ca 0.6%) families. Some of the languages of Himalayas still unclassified. • India has 22 scheduled languages and English continue to be “associate additional official language” • Following scripts will be most needed : Assamese, Bangla, Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, Telugu, Urdu.

One script :: many languages • Devanagari – Hindi, Marathi, Konkani, Rajasthani, Sindhi, Nepali, Dogri, Santhali, etc. • Thus the code page Devanagari can support all languages using that particular script. • Solution : • Though the contents would reveal the language used, it would be ideal if a special attribute code to indicate language is inserted.

One language :: many scripts • Konkani is written in Roman, Devanagari, Malayalam and Kannada. • Sindhi is written in Gurmukhi (Punjabi), Arabi (Perso-Arabic), Devanagari, Gujarati and also Roman. • Sindhi has adopted the Perso-Arabic script for representing their language. In case of Konkani, Devanagari is used as official script. • Hence it is proposed that the same formula be used in attributing in IDN. • However nothing stops a client from desiring to have his IDN in all the scripts and this can be efficiently catered by providing broad based transliteration facility which would transliterate a name from one Indian script to another. • Thus a Konkani domain name in Devanagari could be transliterated into Kannada, Malayalam and Roman. • Solution: • The best solution to this is by way of linguistic or political consensus

The solution : A tool for transliteration from one Indian script to another can be easily deployed. The transliterated data could be presented to the client who could verify the transliteration and see if it meets his approval and if so, the IDN could be registered in all possible scripts

Alternate mechanism • ACE i.e. ASCII compatible encoding. • This is intimately tied to NamePrep (3491)and PunyCode (RFC-3492) as well as to RFC 3454 StringPrep. • ACE prepares a IDN string to be sent down to PunyCode for storage where it is stored as a 7 bit numeric data • We would like to make a case for the use of ISCII 91 as a parallel code for Brahmi based scripts. • ISCII deploys the same encoding for all Brahmi based scripts. • The advantage for this obvious as storage in ISCII will allow IDN to transliterate on the fly a name into any Indic script and thereby ensure at the PunyCode level itself that a name allotted in one script is also automatically allotted in another script to the same owner, thereby doing away with name squatting in Indic scripts, which will be a regular feature for IDN allocation in Indic scripts.

IDN & THE PROBLEM OF ALLOTTING NAMES • The IDN server which will attribute the domain names is to be automated and hence it is of vital interest that a mechanism of checks and counter-checks be set up to ensure the highest level of security. • Two major issues are at stake. These issues are mainly specific to Indian scripts and the complex nature of their visual rendering.

PROBLEM 1: DOUBLETS The first is the need to ensure that doublets are avoided. Doublets are IDN’s which are nearly alike either as homophones or close homographs. Thus spelling: Mahararashtra as: महाराष्ट्र माहाराष्ट्र माहराष्ट्र can lead to identity confusion and since all the three spellings are different, the server would attribute all the name as valid IDN’s whereas in fact the original client would not like that his IDN be misused.

Problem 2: SECURITY ISSUES More serious is the willful use of such tactics to perpetrate fraud by misleading a user into believing that he has logged on to a bonafide site and thus persuade the user to divulge information such as the number of his credit card etc.

UNDERLYING THESE PROBLEMS AND ISSUES ARE THREE MAJOR POTENTIAL SECURITY HOLES • HOMOPHONES AND HOMOGRAPHS • SPELLING VARIANTS • SPELLING ERRORS • Each of these will be studied in relation to their pertinence to ensuring maximal security

Homophones and Homographs • These are aural and visual look-alikes and given the phonetic nature of Indian scripts are a potential source of confusion. • A typology of these has been established: • VISUAL LOOK ALIKES • AURAL LOOK ALIKES

Homophones and Homographs Visual Look-Alikes-1 TWO LIGATURES HAVING PRACTICALLY THE SAME FORM Devanagari द्ध ध्द The first ligature is a Half da+ Full dha, the second is a half dha followed by a full da. To an average reader of Hindi, the two forms look practically alike and lead to confusion. A similar situation arises in the case of Gujarati ક કલ ક્લ The first is ka+la The second is ka+halanta+la

Homophones and Homographs Visual Look-Alikes-2 AMBIGUITIES ARISING OUT OF POSSIBLE UNICODE VARIANTS. This can be best seen in the case of Nukta characters. These can be generated out in two different manners: क़ क़ख़ ख़ड़ ड़ In each pair, the first character is a single character whereas the second character is made up of two characters: the consonant followed by the dot or nukta character. To the naked eye the two look alike, whereas for the machine, these would be two different IDN’s.

Homophones and Homographs • Visual look-alikes-3 • SIMILAR LOOKING CHARACTERS WITHIN THE SAME CODE-PAGE: • Within a code-page two characters can look practically alike and create ambiguity. This is especially the case when on the client machine the font enabled is not of high quality and given the size of the characters (normally 10 point), can lead to confusion. Some examples are given below: • Devanagari ङड रऱ ऩन ॆे ॊो

Homophones and Homographs Visual Look-Alikes -4 IDENTICAL CHARACTERS IN UNICODE As is the case of the Urdu and Sindhi glyph. Character 06a9 is the letter /keheh/ in Urdu whereas the same symbol in Sindhi has the representation /kheheh/. Since both fall within the same codepage aural disambiguation apart from recourse to the language used is impossible.

Homophones and Homographs • Aural Look-Alikes: Homophones • Indian Languages being phonetic in nature, aural representation is a major issue. • These mainly arrive out of the fact that Indian languages are generally typed as they are spoken. Very often these arrive out of • spelling variants and/or • The ignorance of the user as to the correct spelling of the word. • A large number of sub-types of problems can emerge from such Homophonic representations

Homophones and Homographs • Aural Look-Alikes: Homophones-1 • Confusion between the two nasal modifiers (wherever such nasal modifiers) exist. Hindi ं ँ Gujarati ં ઁ • Confusion between two or more similar sounding consonants (normally dental vs. retroflex sibilants and laterals): Marathi श ष Gujarati લ ળ • Confusion arising out of short and long vowels: Tamil: ே ெ Gujarati ી િ Hindi ु ू

Homophones and Homographs Aural Look-Alikes: Homophones-2 Absence or presence of a halanta. This is a source of errors even among educated speakers of the language. Proper names tend to be written at times with or without the halanta. Thus the name Shirke in Marathi can be written in the following two ways of which the first is correct, the second not normatively valid but could be accepted: शिर्के शिरके Confusion arising out of the use of the rakar+ “u” matra instead of the ऋ vowel form: क्रुपा vs. कृपा

Homophones and Homographs Aural Look-Alikes: Homophones-3 A remote source of error would be the use of the Visarga or Vowel lengthener to modify an IDN. The Visarga is mainly used in Sanskrit and very rarely in neo Indian Aryan languages. However an IDN with or without the Visarga could create ambiguity. दुखः दुःख

Homophones and Homographs Aural Look-Alikes: Homophones-4 Insertion of a zero width character (ZWJ/ZWNJ) within the name string: शिर‍के शिरके The first has no non-joiner, the second has a non-joiner. Visually both look alike and can lead to confusion.

Sub-Type 2: SPELLING ERRORS SUB-TYPE II Spelling Variants This is best seen in the case of Hindi where a nasal modifier can substitute for a corresponding half nasal consonant. The word Hindi itself allows to be written either as: हिन्दी हिंदी Obviously two IDN’s based on these spelling variants should not be allowed but must be resolved to the same norm. A similar situation exists in Marathi in the use of ं (timba) vs. े /e/ vowel modifier. The first is used in colloquial Marathi under special environments whereas the second is the literary form. A filter which would normalize the two would have to be written. तुझं तुझे Other languages and scripts display similar patterns

More examples

SUB-TYPE III SPELLING ERRORS • These whether conscious or unconscious could create homographic doublets and need to be detected in order to ensure that the client does not have a spurious IDN competing with his real IDN. Misspellings of words, introversions can all lead to IDN doublets. • A good example is words in Hindi which have Urdu roots and which can admit spellings without Halanta (Urdu norm) and with halanta (Hindi aural norm)

2. PROPOSED RECOMMENDATIONS

Proposed Recommendations • An action plan has been proposed for ensuring maximum security in allotment of IDN’s in Indian scripts. • This is in shape of recommendations arising out of discussions. • The recommendations are both specific and generic in nature.

Proposed Recommendations: • GENERIC STRATEGIES-1 • Creation of Levels: • Four Levels are provided: • Level 1 Highest security • Level 2 Government bodies and Institutions (Bank, insurance, healthcare, etc) • Level 3 Corporate and NGO’s • Level 4 All other users.

Proposed Recommendations: GENERIC STRATEGIES-2 The implementation should be tested in TESTBED mode and IDN’s should be allotted in a phased manner: Level 1 (Highest security) and Level2 (Government bodies and Institutions) should be permitted to register in the test bed mode. This will also have the advantage of blocking out automatically all demands by “spoofers” and “hackers” to squat on such names. Levels 1 and 2 should be automatically denied to users. At this stage the automated software for providing variants based on visual and homophonic identities should be set in place.

Proposed Recommendations: GENERIC STRATEGIES-2 • Subsequently Level 3 i.e. corporate, NGO’s should be allowed to register. The software which will generate out all possible variants for their names, as per the rules of the language can be proposed to them. If they so desire they can register all these variants or keep them open, after being overtly warned that such a step could lead to spoofing. • Level 4 can be integrated at the end • Phased allotment of IDN’s will eradicate to a large extent spoofing and phishing and ensure maximal security.

Proposed Recommendations: SPECIFIC ISSUES • Two scripts page should not be mixed. • As far as possible, numbers (digits) should not be used, unless they acquire a linguistic value such as 365, 24/7 etc. Domain names are not like mail applications where you can have the name followed by a digit. • Punctuation marks should be avoided as far as possible. These can also result in confusion as is the case of eyelash repha in Marathi: • -या र्‍या • 4. Although under ideal circumstances, correct spelling would be the norm, the first instance of a name registered even if it is incorrect would be deemed as registered and all further variants including the correct one, generated out by the software would be reserved or permitted as per the wish of the sanctioning authority.

Proposed Recommendations: SPECIFIC ISSUES-2 5. The whole process to be automated by means of a software which will ensure to the highest degree that the “security holes” are not breached. Given that there would be a large number of applications and that manual processing would not be possible and if possible would result in inordinate delays, automation is a pre-requisite.

Action Plan -1 • Identification of Potential zones : Potential zones for ensuring were identified. • These are: • Creation of Variant Lists • List of potential spelling variants • List of potential zones of error in terms of misspellings and which are not trapped by the variants list.

Explanatory documents and Templates for each of the desired data were provided by CDAC GIST to the concerned • The templates gave examples for each type of requirements in the sample template below:

Report-1 • CDAC. Pune has been entrusted with the creation of data for three languages: Hindi, Marathi and Urdu • As per agreement Expert committees for all these three languages have been appointed, the experts being professors and experts working in the publishing industry; since these have the linguistic skills and know-how to investigate and create the required data • A translation of the three letter extension of the names has also been provided. To ensure across the board intelligibility, this is in Sanskrit • In the slides that follow, samples of the quantum of work accomplished in each of the languages will be detailed out.

Report-2 Translation of IDN extensions: a sample: 1) EDU विद्या 2) GOV सर्वकार 3) IN भारत 4) COM वाणिज्य 5) ORG संस्थन,प्रतिष्ठान 6) MIL स्थल-सेना 7) RES गवेषणा 8) AC शैक्षणिक 9) TRAVEL यात्रा 10) MOBI जंगम 11) NET जाल 12) INT आंताराष्ट्रिय 13) MED औषध 14) AGRI कृषि

Report-1: Marathi • In the case of Marathi, a committee headed by Shri Phadake who has books on “shuddha-lekhan” to his credit has been appointed. • Work has commenced on all the three areas: • Variants list • Spelling Variants • Erroneous Spellings • A large number of rules have been generated and so is the data on spelling variants and misspellings

Report-1: Marathi : Sample image of Variants list

Report-1: Marathi Sample image of Multiple spellings And misspellings

Report -2 Hindi A similar exercise has been carried out for Hindi. Sample files are provided below. Over 100 different rule variants have been identified.

Report -2 Hindi Spelling variants and misspellings for Hindi Over 300+ collected at present

Report -3 Urdu • Under the able guidance of Prof Yunus Fahmi, spelling variants, misspellings and variant lists are being created. • Some sample files for variant list and spellings variants are appended

Report -3 Urdu Urdu spelling Variants (over 280 in number)

Report -3 Urdu Urdu spelling Variants in PASCII (over 280 in number)

List of Official languages of India

Nurturing living languages T H A N K Y O U

Mahesh D. Kulkarni Group Coordinator C-DAC GIST 4 th August 2006 Venue : Hotel Raddison, Noida