Release Notes


47.1 Introduction

The Unicode standard is a universal character encoding scheme for written characters and text. It defines a character set very precisely, as well as a small number of encodings for it. It defines a consistent way of encoding multilingual text that enables the exchange of text data internationally and creates the foundation for global software.

Two of the encoding schemes provided by Unicode are UTF-16 and UTF-8.

The default encoding scheme is UTF-16, which is a 16-bit encoding format. UCS-2 is a subset of UTF-16 which uses two bytes to represent a character. UCS-2 is generally accepted as the universal code page capable of representing all the necessary characters from all existing single and double byte code pages. UCS-2 is registered in IBM as code page 1200.

The other Unicode encoding format is UTF-8, which is byte-oriented and has been designed for ease of use with existing ASCII-based systems. UTF-8 uses a varying number of bytes (usually 1-3, sometimes 4) to store each character. The invariant ASCII characters are stored as single bytes. All other characters are stored using multiple bytes. In general, UTF-8 data can be treated as extended ASCII data by code that was not designed for multi-byte code pages. UTF-8 is registered in IBM as code page 1208.

It is important that applications take into account the requirements of data as it is converted between the local code page, UCS-2 and UTF-8. For example, 20 characters will require exactly 40 bytes in UCS-2 and somewhere between 20 and 60 bytes in UTF-8, depending on the original code page and the characters used.

47.1.1 DB2 Unicode Databases and Applications

A DB2 Universal database for Unix, Windows, or OS/2 created specifying a code set of UTF-8 can be used to store data in both UCS-2 and UTF-8 formats. Such a database is referred to as a Unicode database. SQL character data is encoded using UTF-8 and SQL graphic data is encoded using UCS-2. This means that MBCS characters, including both single-byte and double-byte characters, are stored in character columns, and DBCS characters are stored in graphic columns.

The code page of an application may not match the code page that DB2 uses to store data. In a non-Unicode database, when the code pages are not the same, the database manager converts character and graphic (pure DBCS) data that is transferred between client and server. In a Unicode database, the conversion of character data between the client code page and UTF-8 is automatically performed by the database manager, but all graphic (UCS-2) data is passed without any conversion between the client and the server.

Figure 1. Code Page Conversions Performed by the Database Manager

Code Page Conversions Performed by the Database Manager

Notes:

  1. When connecting to Unicode Databases, if the application sets DB2CODEPAGE=1208, the local code page is UTF-8, so no code page conversion is needed.

  2. When connected to a Unicode Database, CLI applications can also receive character data as graphic data, and graphic data as character data.

It is possible for an application to specify a UTF-8 code page, indicating that it will send and receive all graphic data in UCS-2 and character data in UTF-8. This application code page is only supported for Unicode databases.

Other points to consider when using Unicode:

  1. The database code page is determined at the time the database is created, and by default its value is determined from the operating system locale (or code page). The CODESET and TERRITORY keywords can be used to explicitly create a Unicode DB2 database. For example:
    CREATE DATABASE unidb USING CODESET UTF-8 TERRITORY US
    
  2. The application code page also defaults to the local code page, but this can be overridden by UTF-8 in one of two ways:

    
    
  3. Data in GRAPHIC columns will take exactly two bytes for each Unicode character, whereas data in CHAR columns will take from 1 to 3 bytes for each Unicode character. SQL limits in terms of characters for GRAPHIC columns are generally half of those as for CHAR columns, but they are equal in terms of bytes. For example, the maximum character length for a CHAR column is 254, and the maximum character length for a graphic column is 127. For more information, see MAX in the "Functions" chapter of the SQL Reference.
    
    
  4. A graphic literal is differentiated from a character literal by a G prefix. For example:
    SELECT * FROM mytable WHERE mychar = 'utf-8 data' 
                                            AND mygraphic = G'ucs-2 data'
    

    Note:
    The G prefix is optional for Unicode databases.
    See 41.6.2.4, "Literals in Unicode Databases" for more information and updated support.
    
    
  5. Support for CLI/ODBC and JDBC applications differ from the support for Embedded applications. For information specific to CLI/ODBC support, see "CLI Guide and Reference".
    
    
  6. The byte ordering of UCS-2 data may differ between platforms. Internally, DB2 uses big-endian format.

47.1.2 Documentation Updates

These release notes include updates to the following information on using Unicode with DB2 Version 7.1:

For more information on using Unicode with DB2 refer to the Administration Guide, National Language Support (NLS) appendix: "Unicode Support in DB2 UDB".


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]