Challenges in Testing Multilingual Databases

Challenges in Testing Multilingual Databases Gaurav Luthra NishaBanu Anwar SwethaKonduru

Abstract The scope of market has widened across the globe with the growing demand for international products. For a product to enter and grow in global markets, key factors like cultural differences, language differences and technical requirements should be taken into account. To achieve the above objectives, organizations adapt Internationalization and Localization for a specific region, country or language respectively. This helps the local customer to benefit from a globally available product or software. For example, If an American Bank needs to launch a new application in country like China/Germany where majority of the population is dependent on its national language and English usage is minimal. Hence there is a necessity to launch the application in their comfort zone i.e. local language in order to provide a more users friendly application. Thereby, increasing the customer satisfaction and the revenue for the organization. Though, globalization is already in practice, there is always a challenge to translate and store the data – coming from different languages data sources – into the database. And, this is done through Unicode which enables data storage in a database from any language in a single character set.

Abstract (cont..) • The data that is translated stored (in database) and fed back to an application via the Unicode is to be validated to ensure the data accuracy and quality. • In this document we will discuss on the different challenges encountered while validating the data i.e. flowing from an application to a database for the various language data sources. Population in billion Popular Internet language

General back ground • Internationalization and Localization in Banking • Due to the recent developments taken place in international financial markets , Globalization has come into existence • The international active banks have acquired significant market power and such global activities are making markets more risky for the banks to sustain with increasing foreign branches and foreign assets. • Moreover, banks with higher shares of foreign investments run through foreign branches, have higher market power in local nation. • Therefore, it is not necessary that the entire local population should be comfortable with the English language. • This asks for adoption of Internationalization and Localization by banks in their business processes. • Internationalization • The process of enabling an application for the users located in different nations and supporting different languages is called Internationalization. • Localization • Localization is the process of customizing an application for the users located in a specific location or a specific country. • Need for testing • As the business and markets are crossing boundaries for their growth and development, volume of end users and their data is increasing exponentially as a result risk involved in storing and keeping this data is also escalating. • So a proper check has to be maintained to keep all this data safe and secure. This is where testing work starts from, it checks for the quality and consistency of the data flow.

Why this Paper? • Data Warehouse testing, in itself, is very exhaustive due to the large amount of data it consists of. • But its complexity increases when data to be tested belongs to different language data sources. • It has to be taken care, so that data may not get misinterpret because of different encodings used. • Thus, testing data from multilingual sources is very critical and challenging. • Few of the challenges are listed below: • Usage of correct version of Unicode used by a database for supporting different languages. • Correctness of data loaded in a data base from different applications using different languages i.e. the data entered via front end application is same as the data stored in a DB. • Data lost during Data migration due to improper usage of non Unicode data types. • Data misinterpretation due to different encoding versions during file transactions

Pre Requisites for Testing Tester should be well aware of the language for which data is to be tested. Check if the product is locale aware. The Database that need to be tested for a particular application. Check whether proper Standards are being followed for data storage. Availability of Language Translator software with the tester. Requirements related to back end testing should be clear to the tester.

Proposed Solutions

Validating the Version of Unicode used by a database for supporting different languages • Many databases used by different financial institution/banking to hold their data in it. Each database supports different Unicode properties. • Prior check to Database needs to be tested for the unicode properties. • Different Databases supports different possible unicode properties .

Validating the Version of Unicode used by a database for supporting different languages (Contd..) • For Example ,when a tester is trying to validate Japanese data in in Sybase database . Few mismatches detected by the tester are, • like content getting truncated • mismatch in the currency format • mismatch found in the address • Population of nulls,symbols,etc. • This is because different Unicode encodings supports different lanuage Scripts. • It is beneficial to use UTF-8 for storing/retrieving European scripts and UTF-16 for storing/retrieving Asian scripts . • So when an Japanese language(asian script) whose each character is 2 bytes it is beneficial to use UTF16 type character set over UTF 8 for storing/retrieving the Japanese content from the database. • So keeping the above issue faced the Tester needs to check for the Unicode Database and also the Unicode Datatype which is best suitable for the Particular Language.

Validating the Correctness of data loaded in the DB • Data Feed Via Front end is validated with the data retrieved from the Database at the backend. • To validate the data , firstly the data is verified in the same language as it is displayed in the Application and secondly it is verified with English. • After validating data for accuracy, common language validation procedure needs to be followed • Check if the database support feature includes Language Support & Territory support. • To check for the database Schema and character set of the Table • Testing for culture awareness

Validating the Correctness of data loaded in the DB (Contd..) • Check if the database support feature includes Language Support & Territory support. • The database should be validated to check for the particular language to enable the user to store, process and retrieve the data in his\her native language. • Here the tester needs to look for the “NLS” parameters, which allows the database session to use different cultural settings. E.g. one can set the euro (EUR) as the primary currency and the Japanese yen (JPY) as the secondary currency for a given database session even when the territory is defined as AMERICA. • To check for the database Schema and character set of the Table • Here the Schema of the table needs to be checked according to the language. It should be supporting the Unicode for encoding the different language and also use proper Unicode data types for storing and retrieving data through the database. • Validation of character set should be compatible with the database. • when one pass SQL parameter to the database the data value will be converted according to the character set compatible with that database. i.e. for the database supporting ASCII values of Unicode will get convert into ASCII characters

Validating the Correctness of data loaded in the DB (Contd..) • Testing for culture awareness • The data retreived needs to be tested for the basic formats like Date,address,time,currency,etc • Validation of address format: For example in English, name, city, state and postal code in the order of display but for Japanese the order can be postal code, state, city and name. • Validation of date format: To verify whether the Date and Time format displayed across the application is based on the client locale and also to validate the date value if it’s handled using double-byte numbers. • Validation of Currency formats: It has to be done depending upon different locale. e.g. in American locale decimal point is used for the showing the value after the one’s position , but in the French locale the decimal point is replaced by a comma for showcasing the lower numerals.

Validating the Correctness of data loaded in the DB (Contd..)

Validating Data loss during Data migration due to incorrect Unicode data types DATA MIGRATION Teradata Sybase • One of the biggest challenges in testing Multilingual Databases is during the process of data migration. • During such migration the below checks need to be considered : • target Database should support Unicode characters • special data types are used • Which language it is moving to

Validating Data loss during Data migration due to incorrect Unicode data types (Contd..) Teradata (SQL Server) Employee number NUMBER(4) Employee name NCHAR(15) Date Of Birth DATE Salary NUMBER(8,2) Address NVARCHAR2(20) Contact Number NUMBER(12) Sybase (SQL Server2000) Employee number NUMBER(4) Employee name CHAR(15) Date Of Birth DATE Salary NUMBER(8,2) Address VARCHAR2(20) Contact Number NUMBER(12) • For Example the source (Teradata) uses SQL Server which has defined data types like nchar, nvarchar and ntext to allow the user to store Unicode text • Target database(Sybase) uses SQL Server 2000 which has some other defined data types like char, varchar for same purpose. • During such migration Tester need to give special attention to the mapping of the source and target data columns. • Because there is more possibility of data getting lost due to different data types being used at source and target side and also there may be inconsistency in the schema of the table

Validating Data loss during Data migration due to incorrect Unicode data types (Contd..)

Validating the encoding versions while file transactions. • Data misinterpretation is a major problem being faced while retrieving a data from the database. • Such situations can be avoided by clearly specifying the encoding technique used for encoding that particular data otherwise database will use the default encoding available with it. • Use of default encoding by database, sometimes may result in data alteration or misinterpretation. So whichever encoding is to be used should be clearly mentioned. • E.g. UTF-8 is default encoding for .NET framework and if a file encoded in UTF-16 is tried to open in .NET framework without clear specification of encoding (UTF-16).In that case it will use its default encoding (UTF-8) and which will result in unintelligible input.

Common Mistakes As the solution for the various challenges is known, the following points will help in avoiding some of the common mistakes done by a tester new to multilingual testing: • Usage of standard language translators • Knowing the databases supporting different languages • Aware of the validation checks for the different languages as the validation differs from language to language. • Avoid accessing multiple sessions of same application in different languages overlaps the session.

References • http://www.internetworldstats.com/stats7.htm • Oracle® Database Globalization Support Guide 10g Release 1 (10.1) Part No. B10749-02 , June 2004/ b10749.pdf • http://publib.boulder.ibm.com/infocenter/idm/v2r2/index.jsp?topic=%2Fcom.ibm.optimd.install.doc%2F01cgintr%2Fopinstall-r-character_formats.html • Google Translate : http://translate.google.com/?hl=en&tab=wT# • Infosys Project Experience

Q&ANishabanu_anwar@infosys.com Gaurav_Luthra@infosys.com Swetha_konduru@infosys.com

Thank You !

Challenges in Testing Multilingual Databases