Application Development Guide

National Language Support Considerations

This section describes National Language Support (NLS) support issues that you must consider for your applications. The major topics discussed are:

Collating Sequence Overview

Collating Sequences

The database manager compares character data using a collating sequence. This is an ordering for a set of characters that determines whether a particular character sorts higher, lower, or the same as another.
Note:Character string data defined with the FOR BIT DATA attribute, and BLOB data, is sorted using the binary sort sequence.

For example, a collating sequence can be used to indicate that lowercase and uppercase versions of a particular character are to be sorted equally.

The database manager allows databases to be created with custom collating sequences. The following sections help you determine and implement a particular collating sequence for a database.

Each single-byte character in a database is represented internally as a unique number between 0 and 255 (in hexadecimal notation, between X'00' and X'FF'). This number is referred to as the code point of the character; the assignment of numbers to characters in a set is collectively called a code page. A collating sequence is a mapping between the code point and the desired position of each character in a sorted sequence. The numeric value of the position is called the weight of the character in the collating sequence. In the simplest collating sequence, the weights are identical to the code points. This is called the identity sequence.

For example, consider the characters B (X'42') and b (X'62'). If (according to the collating sequence table) they both have a sort weight of X'42' (B), they collate the same. If the sort weight for B is X'9E', and the sort weight for b is X'9D', b will be sorted before B. Actual weights depend on the collating sequence table used, which in turn depends on the code set and locale. Note that a collating sequence table is not the same as a code page table, which defines code points.

Consider the following example. The ASCII characters A through Z are represented by X'41' through X'5A'. To describe a collating sequence in which these characters are sorted consecutively (no intervening characters), you can write: X'41', X'42', ... X'59', X'5A'.

The hexadecimal value of a multi-byte character is also used as the weight. For example, X'8260' and X'8261' are the code points for double byte characters A and B. In this case, the collation weights for X'82', X'60', and X'61' are used to sort these two characters according to their code points.

The weights in a collating sequence need not be unique. For example, you could give uppercase letters and their lowercase equivalents the same weight.

Specifying a collating sequence can be simplified if the collating sequence provides weights for all 256 code points. The weight of each character can be determined using the code point of the character. This is the method used to specify a collating sequence for the database manager: a string of 256 bytes, where the nth byte (starting with 0) contains the weight of code point n.

In all cases, DB2 uses the collation table that was specified at database creation time. If you want the multi-byte characters to be sorted the way that they appear in their code point table, you must specify IDENTITY as the collation sequence when you create the database.
Note:For DBCS characters in GRAPHIC fields, the sort sequence is always IDENTITY.

Character Comparisons

Once a collating sequence is established, character comparison is performed by comparing the weights of two characters, instead of directly comparing their code point values.

If weights that are not unique are used, characters that are not identical may compare equally. Because of this, string comparison must be a two-phase process:

  1. Compare the characters in each string based on their weights.
  2. If step 1 yields equality, compare the characters of each string based on their code point values.

If the collating sequence contains 256 unique weights, only the first step is performed. If the collating sequence is the identity sequence, only the second step is performed. In either case, there is a performance benefit.

For more information about character comparisons, refer to the SQL Reference.

Case Independent Comparisons

To perform character comparisons that are independent of case, you can use the TRANSLATE function to select and compare mixed case column data by translating it to uppercase (for purposes of comparison only). Consider the following data:

   Abel
   abels
   ABEL
   abel
   ab
   Ab

The following SELECT statement:

   SELECT c1 FROM T1 WHERE TRANSLATE(c1) LIKE 'AB%'

returns

   ab
   Ab
   abel
   Abel
   ABEL
   abels

You could also specify the following SELECT statement when creating view "v1", make all comparisons against the view in uppercase, and request table INSERTs in mixed case:

   CREATE VIEW v1 AS SELECT TRANSLATE(c1) FROM T1

At the database level, you can set the collating sequence as part of the sqlecrea - Create Database API. This allows you to decide if "a" is processed before "A", or if "A" is processed after "a", or if they are processed with equal weighting. This will make them equal when collating or sorting using the ORDER BY clause. "A" will always come before "a", because they are equal in every sense. The only basis upon which to sort is the hexadecimal value.

Thus

   SELECT c1 FROM T1 WHERE c1 LIKE 'ab%'

returns

   ab
   abel
   abels

and

   SELECT c1 FROM T1 WHERE c1 LIKE 'A%'

returns

   Abel
   Ab
   ABEL

The following statement

   SELECT c1 FROM T1 ORDER BY c1

returns

   ab
   Ab
   abel
   Abel
   ABEL
   abels

Thus, you may want to consider using the scalar function TRANSLATE(), as well as sqlecrea. Note that you can only specify a collating sequence using sqlecrea. You cannot specify a collating sequence from the command line processor (CLP). For information about the TRANSLATE() function, refer to the SQL Reference. For information about sqlecrea, refer to the Administrative API Reference.

You can also use the UCASE function as follows, but note that DB2 performs a table scan instead of using an index for the select:

   SELECT * FROM EMP WHERE UCASE(JOB) = 'NURSE'

Collating Sequence Sort Order: EBCDIC and ASCII Example

The order in which data in a database is sorted depends on the collating sequence defined for the database. For example, suppose that database A uses the EBCDIC code page's default collating sequence and that database B uses the ASCII code page's default collating sequence. Sort orders at these two databases would differ, as shown in Figure 19.

Figure 19. Example of How a Sort Order in an EBCDIC-Based Sequence Differs from a Sort Order in an ASCII-Based Sequence

SELECT.....
  ORDER BY COL2
 
EBCDIC-Based Sort        ASCII-Based Sort
 
COL2                     COL2
----                     ----
V1G                      7AB
Y2W                      V1G
7AB                      Y2W 

Similarly, character comparisons in a database depend on the collating sequence defined for that database. So if database A uses the EBCDIC code page's default collating sequence and database B uses the ASCII code page's default collating sequence, the results of character comparisons at the two databases would differ. Figure 20 illustrates the difference.

Figure 20. Example of How a Comparison of Characters in an EBCDIC-Based Sequence Differs from a Comparison of Characters in an ASCII-Based Sequence

SELECT.....
  WHERE COL2 > 'TT3'
 
EBCDIC-Based Results     ASCII-Based Results
 
COL2                     COL2
----                     ----
TW4                      TW4
X72                      X72
39G                                                
 

If you are creating a federated database, consider specifying that your collating sequence matches the collating sequence at a data source. This approach will maximize "pushdown" opportunities and possibly increase query performance. For more information on the relationship between pushdown analysis, collating sequences, and query performance, refer to the Administration Guide: Implementation.

Specifying a Collating Sequence

The collating sequence for a database is specified at database creation time. Once the database has been created, the collating sequence cannot be changed.

The CREATE DATABASE API accepts a data structure called the Database Descriptor Block (SQLEDBDESC). You can define your own collating sequence within this structure.

To specify a collating sequence for a database:

The SQLEDBDESC structure contains:

SQLDBCSS
A 4-byte integer indicating the source of the database collating sequence. Valid values are:

SQL_CS_SYSTEM
The collating sequence of the operating system (based on current country code and code page) is used.

SQL_CS_USER
The collating sequence is specified by the value in the SQLDBUDC field.

SQL_CS_NONE
The collating sequence is the identity sequence. Strings are compared byte for byte, starting with the first byte, using a simple code point comparison.
Note:These constants are defined in the SQLENV include file.

SQLDBUDC
A 256-byte field. The nth byte contains the sort weight of the nth character in the code page of the database. If SQLDBCSS is not equal to SQL_CS_USER, this field is ignored.

Sample Collating Sequences

Several sample collating sequences are provided (as include files) to facilitate database creation using the EBCDIC collating sequences instead of the default workstation collating sequence.

The collating sequences in these include files can be specified in the SQLDBUDC field of the SQLEDBDESC structure. They can also be used as models for the construction of other collating sequences.

For information on the include files that contain collating sequences, see the following sections:

Deriving Code Page Values

The application code page is derived from the active environment when the database connection is made. If the DB2CODEPAGE registry variable is set, its value is taken as the application code page. However, it is not necessary to set the DB2CODEPAGE registry variable because DB2 will determine the appropriate code page value from the operating system. Setting the DB2CODEPAGE registry variable to incorrect values may cause unpredictable results.

The database code page is derived from the value specified (explicitly or by default) at the time the database is created. For example, the following defines how the active environment is determined in different operating environments:

UNIX
On UNIX based operating systems, the active environment is determined from the locale setting, which includes information about language, territory and code set.

OS/2
On OS/2, primary and secondary code pages are specified in the CONFIG.SYS file. You can use the chcp command to display and dynamically change code pages within a given session.

Windows 32-bit operating systems
For all Windows 32-bit operating systems, if the DB2CODEPAGE environment variable is not set, the code page is derived from the ANSI code page setting in the Registry.

For a complete list of environment mappings for code page values, refer to the Administration Guide.

Deriving Locales in Application Programs

Locales are implemented one way on Windows and another way on UNIX based systems. There are two locales on UNIX based systems:

On Windows, cultural preferences can be set through Regional Settings on the Control Panel. However, there is no environment locale like the one on UNIX based systems.

When your program is started, it gets a default C locale. It does not get a copy of the environment locale. If you set the program locale to any locale other than "C", DB2 Universal Database uses your current program locale to determine the code page and territory settings for your application environment. Otherwise, these values are obtained from the operating system environment. Note that setlocale() is not thread-safe, and if you issue setlocale() from within your application, the new locale is set for the entire process.

How DB2 Derives Locales

On UNIX based systems, the active locale used by DB2 is determined from the LC_CTYPE portion of the locale. For details, see the NLS documentation for your operating system.

National Language Support Application Development

Constant character strings in static SQL statements are converted at bind time, from the application code page to the database code page, and will be used at execution time in this database code page representation. To avoid such conversions if they are not desired, you can use host variables in place of string constants.

If your program contains constant character strings, it is strongly recommended that you precompile, bind, compile, and execute the application using the same code page. For a Unicode database, you should use host variables instead of using string constants. This is because data conversions by the server can occur in both the bind and the execution phases. This could be a concern if constant character strings are used within the program. These embedded strings are converted at bind time based on the code page which is in effect during the bind phase. Seven-bit ASCII characters are common to all the code pages supported by DB2 Universal Database and will not cause a problem. For non-ASCII characters, users should ensure that the same conversion tables are used by binding and executing with the same active code page. For a discussion of how applications determine the active code page, see Deriving Code Page Values.

Any external data obtained by the application will be assumed to be in the application code page. This includes data obtained from a file or from user input. Make sure that data from sources outside the application uses the same code page as the application.

If you use host variables that use graphic data in your C or C++ applications, there are special precompiler, application performance, and application design issues you need to consider. For a detailed discussion of these considerations, see Handling Graphic Host Variables in C and C++. If you deal with EUC code sets in your applications, refer to Japanese and Traditional Chinese EUC and UCS-2 Code Set Considerations for guidelines that you should consider.

Coding SQL Statements

The coding of SQL statements is not language dependent. The SQL keywords must be typed as shown in this book, although they may be typed in uppercase, lowercase, or mixed case. The names of database objects, host variables and program labels that occur in an SQL statement cannot contain characters outside the extended character set supported by your code page. For more information about extended character sets, refer to the SQL Reference.

The server does not convert file names. To code a file name, either use the ASCII invariant set, or provide the path in the hexadecimal values that are physically stored in the file system.

In a multi-byte environment, there are four characters which are considered special that do not belong to the invariant character set. These characters are:

The code points for each of these characters, by code page, is as follows:

Table 19. Code Points for Special Double-byte Characters
Code Page Double-Byte Percentage Double-Byte Underscore Double-byte Space Double-Byte Substitution Character
932 X'8193' X'8151' X'8140' X'FCFC'
938 X'8193' X'8151' X'8140' X'FCFC'
942 X'8193' X'8151' X'8140' X'FCFC'
943 X'8193' X'8151' X'8140' X'FCFC'
948 X'8193' X'8151' X'8140' X'FCFC'
949 X'A3A5' X'A3DF' X'A1A1' X'AFFE'
950 X'A248' X'A1C4' X'A140' X'C8FE'
954 X'A1F3' X'A1B2' X'A1A1' X'F4FE'
964 X'A2E8' X'A2A5' X'A1A1' X'FDFE'
970 X'A3A5' X'A3DF' X'A1A1' X'AFFE'
1381 X'A3A5' X'A3DF' X'A1A1' X'FEFE'
1383 X'A3A5' X'A3DF' X'A1A1' X'A1A1'
13488 X'FF05' X'FF3F' X'3000' X'FFFD'

UCS-2 Considerations

For a UCS-2 database, the GRAPHIC space is X'0020' which is different from the "Double-byte Space" of X'3000' used for CCSID 13488. This difference should be taken into consideration when comparing data from a EUC database to data from a UCS-2 database. Note that in a UCS-2 database, Unicode representations of ASCII percent and an ASCII underscore are used for pattern matching. DBCS percent and DBCS underscore have no special meaning for a UCS-2 database. The DBCS substitution character is used to replace any EUC non-SBCS character as required. There is no concept of a three or four byte substitution character.

Coding Remote Stored Procedures and UDFs

When coding stored procedures that will be running remotely, the following considerations apply:

By default, when you invoke DB2 DARI stored procedures and UDFs, they run under a default national language environment which may not match the database's national language environment. Consequently, using country or code page specific operations, such as the C wchar_t graphic host variables and functions, may not work as you expect. You need to ensure that, if applicable, the correct environment is initialized when you invoke the stored procedure or UDF.

Package Name Considerations in Mixed Code Page Environments

Package names are determined when you invoke the PRECOMPILE PROGRAM command or API. By default, they are generated based on the first 8-bytes of the application program source file (without the file extension) and are folded to upper case. Optionally, a name can be explicitly defined. Regardless of the origin of a package name, if you are running in an unequal code page environment, the characters for your package names should be in the invariant character set. Otherwise you may experience problems related to the modification of your package name. The database manager will not be able to find the package for the application or a client-side tool will not display the right name for your package.

A package name modification due to character conversion will occur if any of the characters in the package name, are not directly mapped to a valid character in the database code page. In such cases, a substitution character replaces the character that is not converted. After such a modification, the package name, when converted back to the application code page, may not match the original package name. An example of a case where this behavior is undesirable is when you use the DB2 Database Director to list and work with packages. Package names displayed may not match the expected names.

To avoid conversion problems with package names, ensure that only characters are used which are valid under both the application and database code pages.

Precompiling and Binding

At precompile/bind time, the precompiler is the executing application. The active code page when the database connection was made prior to the precompile request is used for precompiled statements, and any character data returned in the SQLCA.

Executing an Application

At execution time, the active code page of the user application when a database connection is made is in effect for the duration of the connection. All data is interpreted based on this code page; this includes dynamic SQL statements, user input data, user output data, and character fields in the SQLCA.

A Note of Caution

Failure to follow these guidelines may produce unpredictable results. These conditions cannot be detected by the database manager, so no error or warning message will result. For example, a C application contains the following SQL statements operating against a table T1 with one column defined as C1 CHAR(20):

     (0)  EXEC SQL CONNECT TO GLOBALDB;
     (1)  EXEC SQL INSERT INTO T1 VALUES ('a-constant');
          strcpy(sqlstmt, "SELECT C1 FROM T1 WHERE C1='a-constant');
     (2)  EXEC SQL PREPARE S1 FROM :sqlstmt;
Where:
    application code page at bind time = x
    application code page at execution time = y
    database code page = z

At bind time, 'a-constant' in statement (1) is converted from code page x to code page z. This conversion can be noted as (x>z).

At execution time, 'a-constant' (x>z) is inserted into the table when statement (1) is executed. However, the WHERE clause of statement (2) will be executed with 'a-constant' (y>z). If the code points in the constant are such that the two conversions (x>z and y>z) yield different results, the SELECT in statement (2) will fail to retrieve the data inserted by statement (1).

Conversion Between Different Code Pages

Ideally, for optimal performance, your applications should always use the same code page as your database. However, this is not always practical or possible. The DB2 products provide support for character conversion that allows your application and database to use different code pages. Characters from one code page must be mapped to the other code page in order to maintain meaning of the data.

When Does Character Conversion Occur?

Character conversion can occur in the following situations:

Character conversion will not occur for:

Character Substitutions During Conversions

When your application converts from one code page to another, it is possible that one or more characters are not represented in the target code page. If this occurs, DB2 inserts a substitution character into the target string in place of the character that has no representation. The replacement character is then considered a valid part of the string. In situations where a substitution occurs, the SQLWARN10 indicator in the SQLCA is set to 'W'.
Note:Any character conversions resulting from using the WCHARTYPE CONVERT precompiler option will not flag a warning if any substitutions take place.

Supported Character Conversions

When data conversion occurs, conversion will take place from a source code page to a target code page.

The source code page is determined from the source of the data; data from the application has a source code page equal to the application code page, and data from the database has a source code page equal to the database code page.

The determination of target code page is more involved; where the data is to be placed, including rules for intermediate operations, is considered:

For a list of the code pages supported by DB2 Universal Database, refer to the Administration Guide. The values under the heading "Group" can be used to determine where conversions are supported. Any code page can be converted to any other code page that is listed in the same IBM-defined language group. For example, code page 437 can be converted to 37, 819, 850, 1051, 1252, or 1275.
Note:Character string conversions between multi-byte code pages, for example DBCS and EUC, may result in either an increase or a decrease in the length of the string.

Character Conversion Expansion Factor

When your application successfully completes an attempt to connect to a DB2 database server, you should consider the following fields in the returned SQLCA:

The considerations for graphic string data should not be a factor in unequal code page situations. Each string always has the same number of characters, regardless of whether the data is in the application or the database code page.

See Unequal Code Page Situations for information on dealing with unequal code page situations.

DBCS Character Sets

Each combined single-byte character set (SBCS) or double-byte character set (DBCS) code page allows for both single- and double-byte character code points. This is usually accomplished by reserving a subset of the 256 available code points of a mixed code table for single-byte characters, with the remainder of the code points either undefined, or allocated to the first byte of double-byte code points. These code points are shown in the following table.

Table 20. Mixed Character Set Code Points
Country Supported Mixed Code Page Code Points for Single-byte Characters Code Points for First Byte of Double-byte Characters
Japan 932, 943 x00-7F, xA1-DF x81-9F, xE0-FC
Japan 942 x00-80, xA0-DF, xFD-FF x81-9F, xE0-FC
Taiwan 938 (*) x00-7E x81-FC
Taiwan 948 (*) x00-80, FD, FE x81-FC
Korea 949 x00-7F x8F-FE
Taiwan 950 x00-7E x81-FE
China 1381 x00-7F x8C-FE
Korea 1363 x00-7F x81-FE
China 1386 x00 x81-FE
Note:(*) This is an old code page that is no longer recommended.

Code points not assigned to either of these categories are not defined, and are processed as single-byte undefined code points.

Within each implied DBCS code table, there are 256 code points available as the second byte for each valid first byte. Second byte values can have any value from 0x40 to 0x7E, and from 0x80 to 0xFE. Note that in DBCS environments, DB2 does not perform validity checking on individual double-byte characters.

Extended UNIX Code (EUC) Character Sets

Each EUC code page allows for both single-byte character code points, and up to three different sets of multi-byte character code points. This is accomplished by reserving a subset of the 256 available code points of each implied SBCS code page identifier for single-byte characters. The remainder of the code points is undefined, allocated as an element of a multi-byte character, or allocated as a single-shift introducer of a multi-byte character. These code points are shown in the following tables.

Table 21. Japanese EUC Code Points
Group 1st Byte 2nd Byte 3rd Byte 4th Byte
G0 x20-7E n/a n/a n/a
G1 xA1-FE xA1-FE n/a n/a
G2 x8E xA1-FE n/a n/a
G3 x8E xA1-FE xA1-FE n/a

Table 22. Korean EUC Code Points
Group 1st Byte 2nd Byte 3rd Byte 4th Byte
G0 x20-7E n/a n/a n/a
G1 xA1-FE xA1-FE n/a n/a
G2 n/a n/a n/a n/a
G3 n/a n/a n/a n/a

Table 23. Traditional Chinese EUC Code Points
Group 1st Byte 2nd Byte 3rd Byte 4th Byte
G0 x20-7E n/a n/a n/a
G1 xA1-FE xA1-FE n/a n/a
G2 x8E xA1-FE xA1-FE xA1-FE
G3 n/a n/a n/a n/a

Table 24. Simplified Chinese EUC Code Points
Group 1st Byte 2nd Byte 3rd Byte 4th Byte
G0 x20-7E n/a n/a n/a
G1 xA1-FE xA1-FE n/a n/a
G2 n/a n/a n/a n/a
G3 n/a n/a n/a n/a

Code points not assigned to any of these categories are not defined, and are processed as single-byte undefined code points.

Running CLI/ODBC/JDBC/SQLJ Programs in a DBCS Environment

For details on running Java programs that access DB2 Universal Database in a double-byte character set (DBCS) environment, refer to DB2 Java - DBCS Support . This web page currently contains the following information:

JDBC and SQLJ programs access DB2 using the DB2 CLI/ODBC driver and therefore use the same configuration file (db2cli.ini). The following entries must be added to this configuration file if you run Java programs that access DB2 Universal Database in a DBCS environment:

PATCH1 = 65536
This forces the driver to manually insert a "G" in front of character literals which are in fact graphic literals. This PATCH1 value should always be set when working in a double byte environment.

PATCH1 = 64
This forces the driver to NULL terminate graphic output strings. This is needed by Microsoft Access in a double byte environment. If you need to use this PATCH1 value as well then you would add the two values together (64+65536 = 65600) and set PATCH1=65600. See Note #2 below for more information about specifying multiple PATCH1 values.

PATCH2 = 7
This forces the driver to map all graphic column data types to char column data type. This is needed in a double byte environment.

PATCH2 = 10
This setting should only be used in an EUC (Extended Unix Code) environment. It ensures that the CLI driver provides data for character variables (CHAR, VARCHAR, etc...) in the proper format for the JDBC driver. The data in these character types will not be usable in JDBC without this setting.
Note:
  1. Each of these keywords is set in each database specific stanza of the db2cli.ini file. If you want to set them for multiple databases then you need to repeat them for each database stanza in db2cli.ini.
  2. To set multiple PATCH1 values you add the individual values and use the sum. To set PATCH1 to both 64 and 65536 you would set PATCH1=65600 (64+65536). If you already have other PATCH1 values set then replace the existing number with the sum of the existing number and the new PATCH1 values you want to add.
  3. To set multiple PATCH2 values you specify them in a comma delimited string (unlike the PATCH1 option). To set PATCH2 values 1 and 7 you would set PATCH2="1,7"
For more information about setting these keywords refer to the Installation and Configuration Supplement.

Japanese and Traditional Chinese EUC and UCS-2 Code Set Considerations

Extended UNIX Code (EUC) denotes a set of general encoding rules that can support from one to four character sets in UNIX-based operating environments. The encoding rules are based on the ISO 2022 definition for encoding 7-bit and 8-bit data in which control characters are used to separate some of the character sets. EUC is a means of specifying a collection of code sets rather than a code set encoding scheme. A code set based on EUC conforms to the EUC encoding rules but also identifies the specific character sets associated with the specific instances. For example, the IBM-eucJP code set for Japanese refers to the encoding of the Japanese Industrial Standard characters according to the EUC encoding rules. For a list of code pages which are supported, refer to your platform's Quick Beginnings book.

Database and client application support for graphic (pure double-byte character) data, while running under EUC code pages with character encoding that is greater than two bytes in length is limited. The DB2 Universal Database products implement strict rules for graphic data that require all characters to be exactly two bytes wide. These rules do not allow many characters from both the Japanese and Traditional Chinese EUC code pages. To overcome this situation, support is provided at both the application level and the database level to represent Japanese and Traditional Chinese EUC graphic data using another encoding scheme.

A database created under either Japanese or Traditional Chinese EUC code pages will actually store and manipulate graphic data using the ISO 10646 UCS-2 code set, a double-byte encoding scheme which is a proper subset of the full ISO 10646 standard. Similarly, an application running under those code pages will send graphic data to the database server as UCS-2 encoded data. With this support, applications running under EUC code pages can access the same types of data as those running under DBCS code pages. For additional information regarding EUC environments, refer to the SQL Reference. The IBM-defined code page identifier associated with UCS-2 is 1200, and the CCSID number for the same code page is 13488. Graphic data in an eucJP or eucTW database uses the CCSID number 13488. In a UCS-2 database, use the code page number 1200 for GRAPHIC data.

The ISO 10646 standard specifies the encoding of a number of combining characters that are necessary in several scripts, such as Indic, Thai, Arabic and Hebrew. These characters can also be used for a productive generation of characters in Latin, Cyrillic, and Greek scripts. However their presence creates a possibility of an alternative coding for the same text. Although the coding is unambiguous and data integrity is preserved, a processing of text that contains combining characters is more complex. To provide for conformance of applications that choose not to deal with the combining characters, ISO 10646 defines three implementation levels:

Level 1.
Does not allow combining characters.

Level 2.
Only a subset of combining characters are allowed.

Level 3.
All combining characters are allowed.

DB2 Universal Database supports the entire set of UCS-2 characters, including all the combining characters, but does not perform any composition or decomposition of characters. For more information on the Unicode standard, refer to the Unicode Standard Version 2.0 from Addison-Wesley. For more information about UCS-2, refer to ISO/IEC 10646-1 from the International Standard Organization.

If you are working with applications or databases using these character sets you may need to consider dealing with UCS-2 encoded data. When converting UCS-2 graphic data to the application's EUC code page, there is the possibility of an increase in the length of data. For details of data expansion, see Character Conversion Expansion Factor. When large amounts of data are being displayed, it may be necessary to allocate buffers, convert, and display the data in a series of fragments.

The following sections discuss how to handle data in this environment. For these sections, the term EUC is used to refer only to Japanese and Traditional Chinese EUC character sets. Note that the discussions do not apply to DB2 Korean or Simplified-Chinese EUC support since graphic data in these character sets is represented using the EUC encoding.

Mixed EUC and Double-Byte Client and Database Considerations

The administration of database objects in mixed EUC and double-byte code page environments is complicated by the possible expansion or contraction in the length of object names as a result of conversions between the client and database code page. In particular, many administrative commands and utilities have documented limits to the lengths of character strings which they may take as input or output parameters. These limits are typically enforced at the client, unless documented otherwise. For example, the limit for a table name is 128 bytes. It is possible that a character string which is 128 bytes under a double-byte code page is larger, say 135 bytes, under an EUC code page. This hypothetical 135-byte table name would be considered invalid by such commands as REORGANIZE TABLE if used as an input parameter despite being valid in the target double-byte database. Similarly, the maximum permitted length of output parameters may be exceeded, after conversion, from the database code page to the application code page. This may cause either a conversion error or output data truncation to occur.

If you expect to use administrative commands and utilities extensively in a mixed EUC and double-byte environment, you should define database objects and their associated data with the possibility of length expansion past the supported limits. Administering an EUC database from a double-byte client imposes fewer restrictions then administering a double-byte database from an EUC client. Double-byte character strings will always be equal in length or shorter then the corresponding EUC character string. This will generally lead to less problems caused by enforcing the character string length limits.

Note:In the case of SQL statements, validation of input parameters is not conducted until the entire statement has been converted to the database code page. Thus you can use character strings which may be technically longer then allowed when they represented in the client code page, but which meet length requirements when represented in the database code page.

Considerations for Traditional Chinese Users

Due to the standards definition for Traditional Chinese, there is a side effect that you may encounter when you convert some characters between double-byte or EUC code pages and UCS-2. There are 189 characters (consisting of 187 radicals and 2 numbers) that share the same UCS-2 code point, when converted, as another character in the code set. When these characters are converted back to double-byte or EUC, they are converted to the code point of the same character's ideograph, with which it shares the same UCS-2 code point, rather then back to the original code point. When displayed, the character appears the same, but has a different code point. Depending on your application's design, you may have to take this behavior into account.

As an example, consider what happens to code point A7A1 in EUC code page 964, when it is converted to UCS-2 and then converted back to the original code page, EUC 946:
Example conversion of EUC to UCS-2 to EUC

Thus, the original code points A7A1 and C4A1 end up as code point C4A1 after conversion.

If you require the code page conversion tables for EUC code pages 946 (Traditional Chinese EUC) or 950 (Traditional Chinese Big-5) and UCS-2, see the online Product and Service Technical Library .

Developing Japanese or Traditional Chinese EUC Applications

When developing EUC applications, you need to consider the following items:

For additional considerations for stored procedures, see Considerations for Stored Procedures. Additional language-specific application development issues are discussed in:

Graphic Data Handling

This section discusses EUC application development considerations in order to handle graphic data. This includes handling graphic constants, and handling graphic data in UDFs, stored procedures, DBCLOB files, as well as collation.

Graphic Constants

Graphic constants, or literals, are actually classified as mixed character data as they are part of an SQL statement. Any graphic constants in an SQL statement from a Japanese or Traditional Chinese EUC client are implicitly converted to the graphic encoding by the database server. You can use graphic literals that are composed of EUC encoded characters in your SQL applications. An EUC database server will convert these literals to the graphic database code set which will be UCS-2. Graphic constants from EUC clients should never contain single-width characters such as CS0 7-bit ASCII characters or Japanese EUC CS2 (Katakana) characters.

For additional information on graphic constants, refer to the SQL Reference.

Considerations for UDFs

UDFs are invoked at the database server and are meant to deal with data encoded in the same code set as the database. In the case of databases running under the Japanese or Traditional Chinese code set, mixed character data is encoded using the EUC code set under which the database is created. Graphic data is encoded using UCS-2. This means that UDFs need to recognize and handle graphic data which will be encoded with UCS-2.

For example, you create a UDF called VARCHAR which converts a graphic string to a mixed character string. The VARCHAR function has to convert a graphic string encoded as UCS-2 to an EUC representation if the database is created under the EUC code sets.

Considerations for Stored Procedures

A stored procedure, running under either a Japanese or Traditional Chinese EUC code set, must be prepared to recognize and handle graphic data encoded using UCS-2. When running these code sets, graphic data received or returned through the stored procedure's input/output SQLDA is encoded using UCS-2.

Considerations for DBCLOB Files

There are two important considerations for DBCLOB files:

Collation

Graphic data is sorted in binary sequence. Mixed data is sorted in the collating sequence of the database applied on each byte. For a discussion on sorting sequences, refer to the SQL Reference. Due to the possible difference in the ordering of characters in an EUC code set and a DBCS code set for the same country, different results may be obtained when the same data is sorted in an EUC database and in a DBCS database.

Developing for Mixed Code Set Environments

This section deals with the following considerations related to the increase or decrease in the length of data under certain circumstances, when developing applications in a mixed EUC and DBCS environment:

Unequal Code Page Situations

Depending on the character encoding schemes used by the application code page and the database code page, there may or may not be a change in the length of a string as it is converted from the source code page to the target code page. A change in length is usually associated with conversions between multi-byte code pages with different encoding schemes, for example DBCS and EUC.

A possible increase in length is usually more serious than a possible decrease in length since an over-allocation of memory is less problematic than an under-allocation. Application considerations for sending or retrieving data depending on where the possible expansion may occur need to be dealt with separately. It is also important to note the differences between a best-case and worst-case situation when an expansion or contraction in length is indicated. Positive values, indicating a possible expansion, will give the worst-case multiplying factor. For example, a value of 2 for the SQLERRD(1) or SQLERRD(2) field means that a maximum of twice the string length of storage will be required to handle the data after conversion. This is a worst-case indicator. In this example best-case would be that after conversion, the length remains the same.

Negative values for SQLERRD(1) or SQLERRD(2), indicating a possible contraction, also provide the worst-case expansion factor. For example, a value of -1 means that the maximum storage required is equal to the string length prior to conversion. It is indeed possible that less storage may be required, but practically this is of little use unless the receiving application knows in advance how the source data is structured.

To ensure that you always have sufficient storage allocated to cover the maximum possible expansion after character conversion, you should allocate storage equal to the value max_target_length obtained from the following calculation:

  1. Determine the expansion factor for the data.

    For data transfer from the application to the database:

        expansion_factor = ABS[SQLERRD(1)] 
        if expansion_factor = 0 
           expansion_factor = 1
    

    For data transfer from the database to the application:

        expansion_factor = ABS[SQLERRD(2)] 
        if expansion_factor = 0 
           expansion_factor = 1 
    

    In the above calculations, ABS refers to the absolute value.

    The check for expansion_factor = 0 is necessary because some DB2 Universal Database products return 0 in SQLERRD(1) and SQLERRD(2). These servers do not support code page conversions that result in the expansion or shrinkage of data; this is represented by an expansion factor of 1.

  2. Intermediate length calculation.
        temp_target_length = actual_source_length * expansion_factor
    
  3. Determine the maximum length for target data type.

    Target data type
    Maximum length of type (type_maximum_length)

    CHAR
    254

    VARCHAR
    32 672

    LONG VARCHAR
    32 700

    CLOB
    2 147 483 647
  4. Determine the maximum target length.
     
    (1)   if temp_target_length < actual_source_length
             max_target_length = type_maximum_length 
          else 
    (2)   if temp_target_length > type_maximum_length 
             max_target_length = type_maximum_length 
          else 
    (3)   max_target_length = temp_target_length 
    

    All the above checks are required to allow for overflow which may occur during the length calculation. The specific checks are:

    (1)
    Numeric overflow occurs during the calculation of temp_target_length in step 2.

    If the result of multiplying two positive values together is greater than the maximum value for the data type, the result wraps around and is returned as a value less than the larger of the two values.

    For example, the maximum value of a 2-byte signed integer (which is used for the length of non-CLOB data types) is 32 767. If the actual_source_length is 25 000 and the expansion factor is 2, then temp_target_length is theoretically 50 000. This value is too large for the 2-byte signed integer so it gets wrapped around and is returned as -15 536.

    For the CLOB data type, a 4-byte signed integer is used for the length. The maximum value of a 4-byte signed integer is 2 147 483 647.

    (2)
    temp_target_length is too large for the data type.

    The length of a data type cannot exceed the values listed in step 3.

    If the conversion requires more space than is available in the data type, it may be possible to use a larger data type to hold the result. For example, if a CHAR(250) value requires 500 bytes to hold the converted string, it will not fit into a CHAR value because the maximum length is 254 bytes. However, it may be possible to use a VARCHAR(500) to hold the result after conversion. See Character Conversions Past Data Type Limits for more information.

    (3)
    temp_target_length is the correct length for the result.

Using the SQLERRD(1) and SQLERRD(2) values returned when connecting to the database and the above calculations, you can determine whether the length of a string will possibly increase or decrease as a result of character conversion. In general, a value of 0 or 1 indicates no expansion; a value greater than 1 indicates a possible expansion in length; a negative value indicates a possible contraction. (Note that values of '0' will only come from down-level DB2 Universal Database products. Also, these values are undefined for other database server products. Table 25 lists values to expect for various application code page and database code page combinations when using DB2 Universal Database.

Table 25. SQLCA.SQLERRD Settings on CONNECT
Application Code Page Database Code Page SQLERRD(1) SQLERRD(2)
SBCS SBCS +1 +1
DBCS DBCS +1 +1
eucJP eucJP +1 +1
eucJP DBCS -1 +2
DBCS eucJP +2 -1
eucTW eucTW +1 +1
eucTW DBCS -1 +2
DBCS eucTW +2 -1
eucKR eucKR +1 +1
eucKR DBCS +1 +1
DBCS eucKR +1 +1
eucCN eucCN +1 +1
eucCN DBCS +1 +1
DBCS eucCN +1 +1

Expansion at the Database Server

If the SQLERRD(1) entry indicates an expansion at the database server, your application must consider the possibility that length-dependent character data which is valid at the client will not be valid at the database server once it is converted. For example, DB2 products require that column names be no more than 128 bytes in length. It is possible that a character string which is 128 bytes in length encoded under a DBCS code page expands past the 128 byte limit when it is converted to an EUC code page. This means that there may be activities which are valid when the application code page and the database code page are equal, which are invalid when they are different. Exercise caution when you design EUC and DBCS databases for unequal code page situations.

Expansion at the Application

If the SQLERRD(2) entry indicates an expansion at the client application, your application must consider the possibility that length-dependent character data will expand in length after being converted. For example, a row with a CHAR(128) column is retrieved. Under circumstances where the database and application code pages are equal, the length of the data returned is 128 bytes. However, in an unequal code page situation 128 bytes of data encoded under a DBCS code page may expand past 128 bytes when converted to an EUC code page. Thus, additional storage may have to allocated in order to retrieve the complete string.

Client-Based Parameter Validation

An important side effect of potential character data expansion or contraction between the client and server involves the validation of data passed between the client application and the database server. In an unequal code page situation, it is possible that data determined to be valid at the client is actually invalid at the database server after character conversion. Conversely, data that is invalid at the client, may be valid at the database server after conversion.

Any end-user application or API library has the potential of not being able to handle all possibilities in an unequal code page situation. In addition, while some parameter validation such as string length is performed at the client for commands and APIs, the tokens within SQL statements are not verified until they have been converted to the database's code page. This can lead to situations where it is possible to use an SQL statement in an unequal code page environment to access a database object, such as a table, but it will not be possible to access the same object using a particular command or API.

Consider an application that returns data contained in a table provided by an end-user, and checks that the table name is not greater than 128 bytes long. Now consider the following scenarios for this application:

  1. A DBCS database is created. From a DBCS client, a table (t1) is created with a table name which is 128 bytes long. The table name includes several characters which would be greater than two bytes in length if the string is converted to EUC, resulting in the EUC representation of the table name being a total of 131 bytes in length. Since there is no expansion for DBCS to DBCS connections, the table name is 128 bytes in the database environment, and the CREATE TABLE is successful.
  2. An EUC client connects to the DBCS database. It creates a table (t2) with a table name which is 120 bytes long when encoded as EUC and 100 bytes long when converted to DBCS. The table name in the DBCS database is 100 bytes. The CREATE TABLE is successful.
  3. The EUC client creates a table (t3) with a table name that is 64 EUC characters in length (131 bytes). When this name is converted to DBCS its length shrinks to the 128 byte limit. The CREATE TABLE is successful.
  4. The EUC client invokes the application against the each of the tables (t1, t2, and t3) in the DBCS database, which results in:
    Table
    Result
    t1
    The application considers the table name invalid because it is 131 bytes long.
    t2
    Displays correct results
    t3
    The application considers the table name invalid because it is 131 bytes long.
  5. The EUC client is used to query the DBCS database from the CLP. Although the table name is 131 bytes long on the client, the queries are successful because the table name is 128 bytes long at the server.

Using the DESCRIBE Statement

A DESCRIBE performed against an EUC database will return information about mixed character and GRAPHIC columns based on the definition of these columns in the database. This information is based on code page of the server, before it is converted to the client's code page.

When you perform a DESCRIBE against a select list item which is resolved in the application context (for example VALUES SUBSTR(?,1,2)); then for any character or graphic data involved, you should evaluate the returned SQLLEN value along with the returned code page. If the returned code page is the same as the application code page, there is no expansion. If the returned code page is the same as the database code page, expansion is possible. Select list items which are FOR BIT DATA (code page 0), or in the application code page are not converted when returned to the application, therefore there is no expansion or contraction of the reported length.

EUC Application with DBCS Database

If your application's code page is an EUC code page, and it issues a DESCRIBE against a database with a DBCS code page, the information returned for CHAR and GRAPHIC columns is returned in the database context. For example, a CHAR(5) column returned as part of a DESCRIBE has a value of five for the SQLLEN field. In the case of non-EUC data, you allocate five bytes of storage when you fetch the data from this column. With EUC data, this may not be the case. When the code page conversion from DBCS to EUC takes place, there may be an increase in the length of the data due to the different encoding used for characters for CHAR columns. For example, with the Traditional Chinese character set, the maximum increase is double. That is, the maximum character length in the DBCS encoding is two bytes which may increase to a maximum character length of four bytes in EUC. For the Japanese code set, the maximum increase is also double. Note, however, that while the maximum character length in Japanese DBCS is two bytes, it may increase to a maximum character length in Japanese EUC of three bytes. Although this increase appears to be only by a factor of 1.5, the single-byte Katakana characters in Japanese DBCS are only one byte in length, while they are two bytes in length in Japanese EUC. See Character Conversion Expansion Factor for more information on determining the maximum size.

Possible changes in data length as a result of character conversions apply only to mixed character data. Graphic character data encoding is always the same length, two bytes, regardless of the encoding scheme. To avoid losing the data, you need to evaluate whether an unequal code page situation exists, and whether or not it is between a EUC application and a DBCS database. You can determine the database code page and the application code page from tokens in the SQLCA returned from a CONNECT statement. For more information, see Deriving Code Page Values, or refer to the SQL Reference. If such a situation exists, your application needs to allocate additional storage for mixed character data, based on the maximum expansion factor for that encoding scheme.

DBCS Application with EUC Database

If your application code page is a DBCS code page and issues a DESCRIBE against an EUC database, a situation similar to that in EUC Application with DBCS Database occurs. However, in this case, your application may require less storage than indicated by the value of the SQLLEN field. The worst case in this situation is that all of the data is single-byte or double-byte under EUC, meaning that exactly SQLLEN bytes are required under the DBCS encoding scheme. In any other situation, less than SQLLEN bytes are required because a maximum of two bytes are required to store any EUC character.

Using Fixed or Variable Length Data Types

Due to the possible change in length of strings when conversions occur between DBCS and EUC code pages, you should consider not using fixed length data types. Depending on whether you require blank padding, you should consider changing the SQLTYPE from a fixed length character string, to a varying length character string after performing the DESCRIBE. For example, if an EUC to DBCS connection is informed of a maximum expansion factor of two, the application should allocate ten bytes (based on the CHAR(5) example in EUC Application with DBCS Database).

If the SQLTYPE is fixed-length, the EUC application will receive the column as an EUC data stream converted from the DBCS data (which itself may have up to five bytes of trailing blank pads) with further blank padding if the code page conversion does not cause the data element to grow to its maximum size. If the SQLTYPE is varying-length, the original meaning of the content of the CHAR(5) column is preserved, however, the source five bytes may have a target of between five and ten bytes. Similarly, in the case of possible data shrinkage (DBCS application and EUC database), you should consider working with varying-length data types.

An alternative to either allocating extra space or promoting the data type is to select the data in fragments. For example, to select the same VARCHAR(3000) which may be up to 6000 bytes in length after the conversion you could perform two selects, of SUBSTR(VC3000, 1, LENGTH(VC3000)/2) and SUBSTR(VC3000, (LENGTH(VC3000)/2)+1) separately into 2 VARCHAR(3000) application areas. This method is the only possible solution when the data type is no longer promotable. For example, a CLOB encoded in the Japanese DBCS code page with the maximum length of 2 gigabytes is possibly up to twice that size when encoded in the Japanese EUC code page. This means that the data will have to be broken up into fragments since there is no support for a data type in excess of 2 gigabytes in length.

Character Conversion String Length Overflow

In EUC and DBCS unequal code page environments, situations may occur after conversion takes place, when there is not enough space allocated in a column to accommodate the entire string. In this case, the maximum expansion will be twice the length of the string in bytes. In cases where expansion does exceed the capacity of the column, SQLCODE -334 (SQLSTATE 22524) is returned.

This leads to situations that may not be immediately obvious or previously considered as follows:

Rules for String Conversions

If you are designing applications for mixed code page environments, refer to the SQL Reference for any of the following situations:

In these situations, conversions may take place to the application code page instead of the database code page.

Character Conversions Past Data Type Limits

In EUC and DBCS unequal code page environments, situations may occur after conversion takes place, when the length of the mixed character or graphic string exceeds the maximum length allowed for that data type. If the length of the string, after expansion, exceeds the limit of the data type, then type promotion does not occur. Instead, an error message is returned indicating that the maximum allowed expansion length has been exceeded. This situation is more likely to occur while evaluating predicates than with inserts. With inserts, the column width is more readily known by the application, and the maximum expansion factor can be readily taken into account. In many cases, this side effect of character conversion can be avoided by casting the value to an associated data type with a longer maximum length. For example, the maximum length of a CHAR value is 254 bytes while the maximum length of a VARCHAR is 32672 bytes. In cases where expansion does exceed the maximum length of the data type, an SQLCODE -334 (SQLSTATE 22524) is returned.

Character Conversions in Stored Procedures

Mixed character or graphic data specified in host variables and SQLDAs in sqleproc() or SQL CALL invocations are converted in situations where the application and database code pages are different. In cases where string length expansion occurs as a result of conversion, you receive an SQLCODE -334 (SQLSTATE 22524) if there is not enough space allocated to handle the expansion. Thus you must be sure to provide enough space for potentially expanding strings when developing stored procedures. You should use varying length data types with enough space allocated to allow for expansion.

Applications Connected to a Unicode (UCS-2) Database

Note that the information contained in the previous section, Developing for Mixed Code Set Environments, is also applicable to a UCS-2 database.

Applications from any code page environment can connect to a Unicode database. For applications that connect to a Unicode database, the database manager converts character string data between the application code page and the database code page (UTF-8). For a UCS-2 database, GRAPHIC data is always in UCS-2. However, when you use the command line processor to retrieve graphic data, the graphic characters are also converted to the client code page. This conversion allows the command line processor to display graphic characters in the current font. Data loss may occur whenever the database manager converts UCS-2 characters to a client code page. Characters that the database manager cannot convert to a valid character in the client code page are replaced with the default substitution character in that code page.

When DB2 converts characters from a code page to UTF-8, the total number of bytes that represent the characters may expand or shrink, depending on the code page and the code points of the characters. 7-bit ASCII remains invariant in UTF-8, and each ASCII character requires one byte. Non-ASCII UCS-2 characters become two or three bytes each. For more information about UTF-8 conversions, refer to the Administration Guide, or refer to the Unicode standard documents.

For applications that connect to a Unicode database, GRAPHIC data is already in Unicode. For applications that connect to DBCS databases, GRAPHIC data is converted between the application DBCS code page and the database DBCS code page. Unicode applications should perform the necessary conversions to and from Unicode themselves, or should set WCHARTYPE CONVERT option and use wchar_t for graphic data. For more details about this option, please see Handling Graphic Host Variables in C and C++.


Footnotes:

1
However, a literal inserted into a column defined as FOR BIT DATA could be converted if that literal was part of an SQL statement which was converted.


[ Top of Page | Previous Page | Next Page ]