Globalization support in Oracle
This is a concise compilation attempting to explain Oracle’s support of multiple languages. It is geared towards helping a .NET developer understand Oracle’s globalization support and also understand how to program in such an environment. It is not comprehensive but does attempt to answer some basic questions in this area. Oracle data provider for .NET internally uses Oracle’s OCI (Oracle Call Interface) libraries (unless you are using the managed version) and so do many other third-party drivers (DevArt dotConnect for Oracle in native mode, for example). The easiest way to deploy OCI libraries is using the Oracle Instant Client. Hence, it is also important to understand how configure it to support globalization.
Much of the information comes verbatim from Oracle® Database Globalization Support Guide 10g Release 2 (10.2)
In the past, Oracle's globalization support capabilities were referred to as National Language Support (NLS) features. National Language Support is a subset of globalization support. National Language Support is the ability to choose a national language and store data in a specific character set. Globalization support enables you to develop multilingual applications and software products that can be accessed and run from anywhere in the world simultaneously. An application can render content of the user interface and process data in the native users' languages and locale preferences.
The locale-dependent operations are controlled by several parameters and environment variables on both the client and the database server. On the database server, each session started on behalf of a client may run in the same or a different locale as other sessions, and have the same or different language requirements specified.
The database has a set of session-independent NLS parameters that are specified when the database is created. Two of the parameters specify the database character set and the national character set, an alternate Unicode character set that can be specified for
If the client session and the database server specify different character sets, then the database converts character set strings automatically.
When the application connects to a database, a session is created on the server. The new session initializes its NLS environment from NLS instance parameters specified in the initialization parameter file. These settings can be subsequently changed by an
Immediately after the connection has been established, if the
See Also:
After the database is created, you cannot change the character sets, with some exceptions, without re-creating the database.
If all client applications use the same character set, then that character set is usually the best choice for the database character set. When client applications use different character sets, the database character set should be a superset of all the client character sets. This ensures that every character is represented when converting from a client character set to the database character set.
SQL
You can convert between any two character sets
AL32UTF8 Oracle character set is based on the Unicode UTF-8 character set.
NLS (National Language Support) parameters determine the locale-specific behavior on both the client and the server. NLS parameters can be specified in the following ways:
The
Each component of the
In a UNIX operating system C-shell session, you can specify the value of
The
Setting
When a new database is created during the execution of the
You can find the values for NLS parameters by using:
Unicode 4.0 defines supplementary characters to meet this need. It uses two 16-bit code points (also known as supplementary characters) to represent a single character. This enables an additional 1,048,576 characters to be defined. The Unicode 4.0 standard defines 45,960 supplementary characters.
Adding supplementary characters increases the complexity of Unicode, but it is less complex than managing several different encodings in the same configuration.
UTF-8 is the Unicode encoding supported on UNIX platforms and used for HTML and most Internet browsers. Other environments such as Windows and Java use UCS-2 encoding.
The benefits of UTF-8 are as follows:
The benefits of UCS-2 over UTF-8 are as follows:
One character can be either 2 bytes or 4 bytes in UTF-16. Characters from European and most Asian scripts are represented in 2 bytes. Supplementary characters are represented in 4 bytes. UTF-16 is the main Unicode encoding used by Microsoft Windows 2000.
The benefits of UTF-16 over UTF-8 are as follows:
Table 6-1 Unicode Character Sets Supported by Oracle Database
You can create a Unicode database that enables you to store UTF-8 encoded characters as SQL
If you prefer to implement Unicode support incrementally or if you need to support multilingual data only in certain columns, then you can store Unicode data in either the UTF-16 or UTF-8 encoding form in SQL
To avoid overloading the database server with data conversion operations, Oracle always tries to move them to the client side database access products. In a few cases, data must be converted in the database, which affects performance. This chapter discusses details of the data conversion paths.
The Unicode mode, in which the OCI_UTF16 flag is used with the
When OCI_UTF16ID is specified for both SQL
Some more relevant information about the Instant Client and OCI can be found in the Oracle® Call Interface Programmer's Guide, 10g Release 2 (10.2)
Oracle Database 10g Release 2 library names are used in the table.
To use the Microsoft ODBC and OLEDB driver,
The OCI application operates in Instant Client mode when the OCI shared libraries are accessible through the operating system Library Path variable. In this mode, there is no dependency on
For OCCI, the OCCI Library (
for the JDBC OCI driver, in addition to the three OCI shared libraries, you must also download OCI JDBC Library (for example
All Oracle net naming methods that do not require use of
One application can have several environment handles initialized within the same system environment using different client side character set IDs and national character set IDs.
where
When character set IDs are set through the function
Any Oracle character set ID, except AL16UTF16, can be specified through the
You can retrieve character sets in NLS_LANG and NLS_NCHAR through another function,
I also encourage you to read the previous post related to NLS_LANG and Oracle Instant Client
Much of the information comes verbatim from Oracle® Database Globalization Support Guide 10g Release 2 (10.2)
1 Overview of Globalization Support
Globalization Support Architecture
Oracle's globalization support enables you to store, process, and retrieve data in native languages. It ensures that database utilities, error messages, sort order, and date, time, monetary, numeric, and calendar conventions automatically adapt to any native language and locale.In the past, Oracle's globalization support capabilities were referred to as National Language Support (NLS) features. National Language Support is a subset of globalization support. National Language Support is the ability to choose a national language and store data in a specific character set. Globalization support enables you to develop multilingual applications and software products that can be accessed and run from anywhere in the world simultaneously. An application can render content of the user interface and process data in the native users' languages and locale preferences.
Architecture to Support Multilingual Applications
The database is implemented to enable multitier applications and client/server applications to support languages for which the database is configured.The locale-dependent operations are controlled by several parameters and environment variables on both the client and the database server. On the database server, each session started on behalf of a client may run in the same or a different locale as other sessions, and have the same or different language requirements specified.
The database has a set of session-independent NLS parameters that are specified when the database is created. Two of the parameters specify the database character set and the national character set, an alternate Unicode character set that can be specified for
NCHAR
, NVARCHAR2
, and NCLOB
data. The parameters specify the character set that is used to store text data in the database. Other parameters, such as language and territory, are used to evaluate check constraints. If the client session and the database server specify different character sets, then the database converts character set strings automatically.
When the application connects to a database, a session is created on the server. The new session initializes its NLS environment from NLS instance parameters specified in the initialization parameter file. These settings can be subsequently changed by an
ALTER
SESSION
statement. The statement changes only the session NLS environment. It does not change the local client NLS environment. Immediately after the connection has been established, if the
NLS_LANG
environment setting is defined on the client side, then an implicit ALTER SESSION
statement synchronizes the client and session NLS environments. Using Unicode in a Multilingual Database
Unicode is a universal encoded character set that enables you to store information in any language, using a single character set. Unicode provides a unique code value for every character, regardless of the platform, program, or language.Character Set Support
Oracle supports a large number of single-byte, multibyte, and fixed-width encoding schemes that are based on national, international, and vendor-specific standards.See Also:
- Chapter 2, "Choosing a Character Set"
- "Character Sets" for a list of supported character sets
Unicode Support
You can store Unicode characters in an Oracle database in two ways:- You can create a Unicode database that enables you to store UTF-8 encoded characters as SQL
CHAR
datatypes.
- You can support multilingual data in specific columns by using Unicode datatypes. You can store Unicode characters into columns of the SQL
NCHAR
datatypes regardless of how the database character set has been defined. TheNCHAR
datatype is an exclusively Unicode datatype.
2 Choosing a Character Set
Choosing an Oracle Database Character Set
Oracle uses the database character set for:- Data stored in SQL
CHAR
datatypes (CHAR
,VARCHAR2
,CLOB
, andLONG)
- Identifiers such as table names, column names, and PL/SQL variables
- Entering and storing SQL and PL/SQL source code
CREATE
DATABASE
statement. All SQL CHAR
datatype columns (CHAR
, CLOB
,VARCHAR2
, and LONG)
, including columns in the data dictionary, have their data stored in the database character set. In addition, the choice of database character set determines which characters can name objects in the database. SQL NCHAR
datatype columns (NCHAR
, NCLOB
, and NVARCHAR2)
use the national character set. After the database is created, you cannot change the character sets, with some exceptions, without re-creating the database.
Client Operating System and Application Compatibility
The database character set is independent of the operating system because Oracle has its own globalization architecture. For example, on an English Windows operating system, you can create and run a database with a Japanese character set. However, when an application in the client operating system accesses the database, the client operating system must be able to support the database character set with appropriate fonts and input methods. For example, you cannot insert or retrieve Japanese data on the English Windows operating system without first installing a Japanese font and input method. Another way to insert and retrieve Japanese data is to use a Japanese operating system remotely to access the database server.Character Set Conversion Between Clients and the Server
If you choose a database character set that is different from the character set on the client operating system, then the Oracle database can convert the operating system character set to the database character set. Character set conversion has the following disadvantages:- Potential data loss
- Increased overhead
If all client applications use the same character set, then that character set is usually the best choice for the database character set. When client applications use different character sets, the database character set should be a superset of all the client character sets. This ensures that every character is represented when converting from a client character set to the database character set.
Restrictions on Character Sets Used to Express Names
Table 2-5 lists the restrictions on the character sets that can be used to express names.
Name | Single-Byte | Variable Width | Comments |
---|---|---|---|
Column names | Yes | Yes | - |
Schema objects | Yes | Yes | - |
Comments | Yes | Yes | - |
Database link names | Yes | No | - |
Database names | Yes | No | - |
File names (datafile, log file, control file, initialization parameter file) | Yes | No | - |
Instance names | Yes | No | - |
Directory names | Yes | No | - |
Keywords | Yes | No | Can be expressed in English ASCII or EBCDIC characters only |
Choosing a National Character Set
A national character set is an alternate character set that enables you to store Unicode character data in a database that does not have a Unicode database character set.SQL
NCHAR
, NVARCHAR2
, and NCLOB
datatypes have been redefined to support Unicode data only. You can use either the UTF8 or the AL 16UTF16 character set. The default is AL16UTF16. Character Set Conversion in a Monolingual Scenario
Character set conversion may be required in a client/server environment if a client application resides on a different platform than the server and if the platforms do not use the same character encoding schemes. Character data passed between client and server must be converted between the two encoding schemes. Character conversion occurs automatically and transparently through Oracle Net.You can convert between any two character sets
Restricted Multilingual Support
Some character sets support multiple languages because they have related writing systems or scripts. For example, the WE8ISO8859P1 Oracle character set supports the following Western European languages:- Catalan
- Danish
- Dutch
- English
- Finnish
- French
- German
- Icelandic
- Italian
- Norwegian
- Portuguese
- Spanish
- Swedish
Unrestricted Multilingual Support
If you need unrestricted multilingual support, then use a universal character set such as Unicode for the server database character set. Unicode has two major encoding schemes:- UTF-16: Each character is either 2 or 4 bytes long.
- UTF-8: Each character takes 1 to 4 bytes to store.
AL32UTF8 Oracle character set is based on the Unicode UTF-8 character set.
3 Setting Up a Globalization Support Environment
NLS (National Language Support) parameters determine the locale-specific behavior on both the client and the server. NLS parameters can be specified in the following ways:
- As initialization parameters on the server
You can include parameters in the initialization parameter file to specify a default session NLS environment. These settings have no effect on the client side; they control only the server's behavior. For example:NLS_TERRITORY = "CZECH REPUBLIC"
As environment variables on the client
You can use NLS environment variables, which may be platform-dependent, to specify locale-dependent behavior for the client and also to override the default values set for the session in the initialization parameter file. For example, on a UNIX system:% setenv NLS_SORT FRENCH
With theALTER SESSION
statement
You can use NLS parameters that are set in anALTER
SESSION
statement to override the default values that are set for the session in the initialization parameter file or set by the client with environment variables.ALTER SESSION SET NLS_SORT = FRENCH;
See Also:
Oracle Database SQL Reference for more information about theALTER
SESSION
statement
In SQL functions
You can use NLS parameters explicitly to hardcode NLS behavior within a SQL function. This practice overrides the default values that are set for the session in the initialization parameter file, set for the client with environment variables, or set for the session by theALTER
SESSION
statement. For example:TO_CHAR(hiredate, 'DD/MON/YYYY', 'nls_date_language = FRENCH')
Table 3-1 Methods of Setting NLS Parameters and Their Priorities
Priority | Method |
---|---|
1 (highest) | Explicitly set in SQL functions |
2 | Set by an ALTER SESSION statement |
3 | Set as an environment variable |
4 | Specified in the initialization parameter file |
5 | Default |
Table 3-2 lists the available NLS parameters. Because the SQL function NLS parameters can be specified only with specific functions, the table does not show the SQL function scope.
Parameter | Description | Default | Scope:I = Initialization Parameter File E = Environment Variable A = ALTER SESSION |
---|---|---|---|
Calendar system | Gregorian | I, E, A | |
SQL, PL/SQL operator comparison | BINARY | I, E, A | |
Credit accounting symbol | Derived from NLS_TERRITORY | E | |
Local currency symbol | Derived from NLS_TERRITORY | I, E, A | |
Date format | Derived from NLS_TERRITORY | I, E, A | |
Language for day and month names | Derived from NLS_LANGUAGE | I, E, A | |
Debit accounting symbol | Derived from NLS_TERRITORY | E | |
ISO international currency symbol | Derived from NLS_TERRITORY | I, E, A | |
NLS_LANG | Language, territory, character set | AMERICAN_AMERICA. US7ASCII | E |
Language | Derived from NLS_LANG | I, A | |
How strings are treated | BYTE | I, E, A | |
Character that separates items in a list | Derived from NLS_TERRITORY | E | |
Monetary symbol for dollar and cents (or their equivalents) | Derived from NLS_TERRITORY | E | |
Reports data loss during a character type conversion | FALSE | I, A | |
Decimal character and group separator | Derived from NLS_TERRITORY | I, E, A | |
Character sort sequence | Derived from NLS_LANGUAGE | I, E, A | |
Territory | Derived from NLS_LANG | I, A | |
Timestamp | Derived from NLS_TERRITORY | I, E, A | |
Timestamp with time zone | Derived from NLS_TERRITORY | I, E, A | |
Dual currency symbol | Derived from NLS_TERRITORY | I, E, A |
Choosing a Locale with the NLS_LANG Environment Variable
A locale is a linguistic and cultural environment in which a system or program is running. Setting theNLS_LANG
environment parameter is the simplest way to specify locale behavior for Oracle software. It sets the language and territory used by the client application and the database server. It also sets the client's character set, which is the character set for data entered or displayed by a client program. NLS_LANG
is set as an environment variable on UNIX platforms. NLS_LANG
is set in the registry on Windows platforms. The
NLS_LANG
parameter has three components: language, territory, and character set. Specify it in the following format, including the punctuation:NLS_LANG = language_territory.charsetFor example, if the Oracle Installer does not populate
NLS_LANG
, then its value by default is AMERICAN_AMERICA.US7ASCII
. The language is AMERICAN
, the territory is AMERICA
, and the character set is US7ASCII
. The values in NLS_LANG
and other NLS parameters are case-insensitive. Each component of the
NLS_LANG
parameter controls the operation of a subset of globalization support features: - language
Specifies conventions such as the language used for Oracle messages, sorting, day names, and month names. Each supported language has a unique name; for example,AMERICAN
,FRENCH
, orGERMAN
. The language argument specifies default values for the territory and character set arguments. If the language is not specified, then the value defaults toAMERICAN
. - territory
Specifies conventions such as the default date, monetary, and numeric formats. Each supported territory has a unique name; for example,AMERICA
,FRANCE
, orCANADA
. If the territory is not specified, then the value is derived from the language value. - charset
Specifies the character set used by the client application (normally the Oracle character set that corresponds to the user's terminal character set or the OS character set). Each supported character set has a unique acronym, for example,US7ASCII
,WE8ISO8859P1
,WE8DEC
,WE8MSWIN1252
, orJA16EUC
. Each language has a default character set associated with it.
Note:
All components of theNLS_LANG
definition are optional; any item that is not specified uses its default value. If you specify territory or character set, then you must include the preceding delimiter [underscore (_) for territory, period (.) for character set]. Otherwise, the value is parsed as a language name.
For example, to set only the territory portion ofNLS_LANG
, use the following format:NLS_LANG=_JAPAN
Specifying the Value of NLS_LANG
In a UNIX operating system C-shell session, you can specify the value of
NLS_LANG
by entering a statement similar to the following:% setenv NLS_LANG FRENCH_FRANCE.WE8ISO8859P1Because
NLS_LANG
is an environment variable, it is read by the client application at startup time. The client communicates the information defined by NLS_LANG
to the server when it connects to the database server. Overriding Language and Territory Specifications
TheNLS_LANG
parameter sets the language and territory environment used by both the server session (for example, SQL command execution) and the client application (for example, display formatting in Oracle tools). Using this parameter ensures that the language environments of both the database and the client application are automatically the same. Should the NLS_LANG Setting Match the Database Character Set?
The
NLS_LANG
character set should reflect the setting of the operating system character set of the client. For example, if the database character set is AL32UTF8 and the client is running on a Windows operating system, then you should not set AL32UTF8 as the client character set in the NLS_LANG
parameter because there are no UTF-8 WIN32 clients. Instead, the NLS_LANG
setting should reflect the code page of the client. For example, on an English Windows client, the code page is 1252. An appropriate setting for NLS_LANG
is AMERICAN_AMERICA.WE8MSWIN1252
. Setting
NLS_LANG
correctly allows proper conversion from the client operating system character set to the database character set. When these settings are the same, Oracle assumes that the data being sent or received is encoded in the same character set as the database character set, so character set validation or conversion may not be performed. This can lead to corrupt data if the client code page and the database character set are different and conversions are necessary. NLS Database Parameters
When a new database is created during the execution of the
CREATE
DATABASE
statement, the NLS-related database configuration is established. The current NLS instance parameters are stored in the data dictionary along with the database and national character sets. The NLS instance parameters are read from the initialization parameter file at instance startup. You can find the values for NLS parameters by using:
NLS Data Dictionary Views
Applications can check the session, instance, and database NLS parameters by querying the following data dictionary views:NLS_SESSION_PARAMETERS
shows the NLS parameters and their values for the session that is querying the view. It does not show information about the character set.NLS_INSTANCE_PARAMETERS
shows the current NLS instance parameters that have been explicitly set and the values of the NLS instance parameters.NLS_DATABASE_PARAMETERS
shows the values of the NLS parameters for the database. The values are stored in the database.
6 Supporting Multilingual Databases with Unicode
What is Unicode?
Oracle started supporting Unicode as a database character set in Oracle Database 7. In Oracle Database 10g, Unicode support has been expanded. Oracle Database 10g, Release 2 supports Unicode 4.0.Supplementary Characters
The first version of Unicode was a 16-bit, fixed-width encoding that used two bytes to encode each character. This allowed 65,536 characters to be represented. However, more characters need to be supported, especially additional CJK ideographs that are important for the Chinese, Japanese, and Korean markets.Unicode 4.0 defines supplementary characters to meet this need. It uses two 16-bit code points (also known as supplementary characters) to represent a single character. This enables an additional 1,048,576 characters to be defined. The Unicode 4.0 standard defines 45,960 supplementary characters.
Adding supplementary characters increases the complexity of Unicode, but it is less complex than managing several different encodings in the same configuration.
Unicode Encodings
The Unicode standard encodes characters in different ways: UTF-8, UCS-2, and UTF-16. Conversion between different Unicode encodings is a simple bit-wise operation that is defined in the Unicode standard.UTF-8 Encoding
UTF-8 is the 8-bit encoding of Unicode. It is a variable-width encoding and a strict superset of ASCII. This means that each and every character in the ASCII character set is available in UTF-8 with the same code point values. One Unicode character can be 1 byte, 2 bytes, 3 bytes, or 4 bytes in UTF-8 encoding. Characters from the European scripts are represented in either 1 or 2 bytes. Characters from most Asian scripts are represented in 3 bytes. Supplementary characters are represented in 4 bytes.UTF-8 is the Unicode encoding supported on UNIX platforms and used for HTML and most Internet browsers. Other environments such as Windows and Java use UCS-2 encoding.
The benefits of UTF-8 are as follows:
- Compact storage requirement for European scripts because it is a strict superset of ASCII
- Ease of migration between ASCII-based characters sets and UTF-8
UCS-2 Encoding
UCS-2 is a fixed-width, 16-bit encoding. Each character is 2 bytes. UCS-2 is the Unicode encoding used by Java and Microsoft Windows NT 4.0. UCS-2 supports characters defined for Unicode 3.0, so there is no support for supplementary characters.The benefits of UCS-2 over UTF-8 are as follows:
- More compact storage for Asian scripts because all characters are two bytes
- Faster string processing because characters are fixed-width
- Better compatibility with Java and Microsoft clients
UTF-16 Encoding
UTF-16 encoding is the 16-bit encoding of Unicode. UTF-16 is an extension of UCS-2 because it supports the supplementary characters by using two UCS-2 code points for each supplementary character. UTF-16 is a strict superset of UCS-2.One character can be either 2 bytes or 4 bytes in UTF-16. Characters from European and most Asian scripts are represented in 2 bytes. Supplementary characters are represented in 4 bytes. UTF-16 is the main Unicode encoding used by Microsoft Windows 2000.
The benefits of UTF-16 over UTF-8 are as follows:
- More compact storage for Asian scripts because most of the commonly used Asian characters are represented in two bytes.
- Better compatibility with Java and Microsoft clients
Oracle's Support for Unicode
Oracle started supporting Unicode as a database character set in release 7. Table 6-1 summarizes the Unicode character sets supported by Oracle Database.Table 6-1 Unicode Character Sets Supported by Oracle Database
Table 6-1 Unicode Character Sets Supported by Oracle Database
Character Set | Supported in RDBMS Release | Unicode Encoding | Unicode Version | Database Character Set | National Character Set |
---|---|---|---|---|---|
7.2 - 8i | UTF-8 | 1.1 | Yes | No | |
8.0 - 10g | UTF-8 | For Oracle Database release 8.0 through Oracle8i release 8.1.6: 2.1 For Oracle8i Database release 8.1.7 and later: 3.0 | Yes | Yes (Oracle9i Database and Oracle Database 10g only) | |
8.0 - 10g | UTF-EBCDIC | For Oracle8i Database releases 8.0 through 8.1.6: 2.1 For Oracle8i Database release 8.1.7 and later: 3.0 | Yes | No | |
9i - 10g | UTF-8 | Oracle9i Database Release 1: 3.0 Oracle9i Database Release 2: 3.1 Oracle Database 10g, Release 1: 3.2 Oracle Database 10g, Release 2: 4.0 | Yes | No | |
9i - 10g | UTF-16 | Oracle9i Database Release 1: 3.0 Oracle9i Database Release 2: 3.1 Oracle Database 10g, Release 1: 3.2 Oracle Database 10g, Release 2: 4.0 | No | Yes |
Implementing a Unicode Solution in the Database
You can store Unicode characters in an Oracle database in two ways.You can create a Unicode database that enables you to store UTF-8 encoded characters as SQL
CHAR
datatypes (CHAR
, VARCHAR2
, CLOB
, and LONG)
. If you prefer to implement Unicode support incrementally or if you need to support multilingual data only in certain columns, then you can store Unicode data in either the UTF-16 or UTF-8 encoding form in SQL
NCHAR
datatypes (NCHAR
, NVARCHAR2
, and NCLOB
). The SQL NCHAR
datatypes are called Unicode datatypes because they are used only for storing Unicode data. Enabling Multilingual Support with Unicode Databases
The database character set specifies the encoding to be used in the SQLCHAR
datatypes as well as the metadata such as table names, column names, and SQL statements. A Unicode database is a database with a UTF-8 character set as the database character set. There are three Oracle character sets that implement the UTF-8 encoding. The first two are designed for ASCII-based platforms while the third one should be used on EBCDIC platforms. - AL32UTF8
The AL32UTF8 character set supports the latest version of the Unicode standard. It encodes characters in one, two, or three bytes. Supplementary characters require four bytes. It is for ASCII-based platforms. - UTF8
The UTF8 character set encodes characters in one, two, or three bytes. It is for ASCII-based platforms.
The UTF8 character set has supported Unicode 3.0 since Oracle8i release 8.1.7 and will continue to support Unicode 3.0 in future releases of Oracle Database. Although specific supplementary characters were not assigned code points in Unicode until version 3.1, the code point range was allocated for supplementary characters in Unicode 3.0. If supplementary characters are inserted into a UTF8 database, then it does not corrupt the data in the database. The supplementary characters are treated as two separate, user-defined characters that occupy 6 bytes. Oracle recommends that you switch to AL32UTF8 for full support of supplementary characters in the database character set. - UTFE
The UTFE character set is for EBCDIC platforms. It is similar to UTF8 on ASCII platforms, but it encodes characters in one, two, three, and four bytes. Supplementary characters are converted as two 4-byte characters.
7 Programming with Unicode
Overview of Programming with Unicode
Oracle offers several database access products for inserting and retrieving Unicode data. Oracle offers database access products for commonly used programming environments such as Java and C/C++. Data is transparently converted between the database and client programs, which ensures that client programs are independent of the database character set and national character set. In addition, client programs are sometimes even independent of the character datatype, such asNCHAR
or CHAR
, used in the database. To avoid overloading the database server with data conversion operations, Oracle always tries to move them to the client side database access products. In a few cases, data must be converted in the database, which affects performance. This chapter discusses details of the data conversion paths.
Database Access Product Stack and Unicode
The Oracle Call Interface (OCI) is the lowest level API that the rest of the client-side database access products use. It provides a flexible way for C/C++ programs to access Unicode data stored in SQLCHAR
and NCHAR
datatypes. Using OCI, you can programmatically specify the character set (UTF-8, UTF-16, and others) for the data to be inserted or retrieved.OCI Programming with Unicode
OCIEnvNlsCreate() Function for Unicode Programming
TheOCIEnvNlsCreate()
function is used to specify a SQL CHAR
character set and a SQL NCHAR
character set when the OCI environment is created. It is an enhanced version of the OCIEnvCreate()
function and has extended arguments for two character set IDs. The OCI_UTF16ID UTF-16 character set ID replaces the Unicode mode introduced in Oracle9i release 1 (9.0.1). The Unicode mode, in which the OCI_UTF16 flag is used with the
OCIEnvCreate()
function, is deprecated. When OCI_UTF16ID is specified for both SQL
CHAR
and SQL NCHAR
character sets, all metadata and bound and defined data are encoded in UTF-16. Metadata includes SQL statements, user names, error messages, and column names. Thus, all inherited operations are independent of the NLS_LANG
setting,….OCI Unicode Code Conversion
Unicode character set conversions take place between an OCI client and the database server if the client and server character sets are different. The conversion occurs on either the client or the server depending on the circumstances, but usually on the client side.Setting UTF-8 to the NLS_LANG Character Set in OCI
For OCI client applications that support Unicode UTF-8 encoding, use AL32UTF8 to specify theNLS_LANG
character set, unless the database character set is UTF8. Use UTF8 if the database character set is UTF8.Appendix A - Locale Data
There is a detailed reference information about languages, character sets etc in Appendix A - Locale DataSome more relevant information about the Instant Client and OCI can be found in the Oracle® Call Interface Programmer's Guide, 10g Release 2 (10.2)
1 Introduction and Upgrading
OCI Instant Client
Table 1-3 OCI Instant Client Shared Libraries
Linux and UNIX | Description for Linux and UNIX | Windows | Description for Windows |
---|---|---|---|
libclntsh.so.10.1 | Client Code Library | oci.dll | Forwarding functions that applications link with |
libociei.so | OCI Instant Client Data Shared Library | oraociei10.dll | Data and code |
libnnz10.so | Security Library | orannzsbb10.dll | Security Library |
Oracle Database 10g Release 2 library names are used in the table.
To use the Microsoft ODBC and OLEDB driver,
ociw32.dll
must also be copied from ORACLE_HOME\bin. OCI Instant Client Installation Process
The installation process is as simple as- Downloading and installing the Instant Client shared libraries to a directory such as
instantclient_10_2
. - Setting the operating system shared library path environment variable to the directory from step 1. For example, on Linux or UNIX, set the LD_LIBRARY_PATH to
instantclient_10_2
. On Windows, set PATH to locate theinstantclient_10_2
directory.
The OCI application operates in Instant Client mode when the OCI shared libraries are accessible through the operating system Library Path variable. In this mode, there is no dependency on
ORACLE_HOME
and none of the other code and data files provided in ORACLE_HOME
are needed by OCI (except for the tnsnames.ora
file described later). For OCCI, the OCCI Library (
libocci.so.10.1
on Linux or UNIX and oraocci10.dll
on Windows) must also be installed in the Instant Client directory. for the JDBC OCI driver, in addition to the three OCI shared libraries, you must also download OCI JDBC Library (for example
libocijdbc10.so
on Linux or UNIX and oraocijdbc10.dll
on Windows) and ojdbcXY.jar
(where XY is the version number, for example, ojdbc14.jar
).Database Connection Strings for OCI Instant Client
The OCI Instant Client can make remote database connections in all the ways that ordinary SQL clients can. However because the Instant Client does not have theORACLE_HOME
environment and directory structure some of the database naming methods will require additional configuration steps. All Oracle net naming methods that do not require use of
ORACLE_HOME
or TNS_ADMIN
(to locate configuration files such as tnsnames.ora
or sqlnet.ora
) work in the Instant Client mode. In particular, the connect_identifier
in the OCIServerAttach()
call can be specified in the following formats: - A SQL Connect URL string of the form:
[//]host[:port][/service name]
such as://dlsun242:5521/bjava21
- As an Oracle Net connect descriptor. For example:
"(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp) (HOST=dlsun242) (PORT=5521))
(CONNECT_DATA=(SERVICE_NAME=bjava21)))"
2 OCI Programming Basics
OCI Globalization Support
Client Character Set Control from OCI
The functionOCIEnvNlsCreate()
enables you to set character set information in applications, independently from NLS_LANG and NLS_NCHAR settings. One application can have several environment handles initialized within the same system environment using different client side character set IDs and national character set IDs.
OCIEnvNlsCreate(OCIEnv **envhp, ..., csid, ncsid);
where
csid
is the value for character set ID, and ncsid
is the value for national character set ID. Either can be 0 or OCI_UTF16ID
. If both are 0, this is equivalent to using OCIEnvCreate()
instead. The other arguments are the same as for the OCIEnvCreate()
call. OCIEnvNlsCreate()
is an enhancement for programmatic control of character sets, because it validates OCI_UTF16ID
. When character set IDs are set through the function
OCIEnvNlsCreate()
, they will replace the settings in NLS_LANG and NLS_NCHAR. In addition to all character sets supported by NLSRTL, OCI_UTF16ID
is also allowed as a character set ID in the OCIEnvNlsCreate()
function, although this ID is not valid in NLS_LANG or NLS_NCHAR. Any Oracle character set ID, except AL16UTF16, can be specified through the
OCIEnvNlsCreate()
function to specify the encoding of metadata, SQL CHAR
data, and SQL NCHAR
data. You can retrieve character sets in NLS_LANG and NLS_NCHAR through another function,
OCINlsEnvironmentVariableGet()
.I also encourage you to read the previous post related to NLS_LANG and Oracle Instant Client