| --- |
| title: Character Set Support Reference |
| --- |
| |
| This topic provides a referene of the character sets supported in HAWQ. |
| |
| The character set support in HAWQ allows you to store text in a variety of character sets, including single-byte character sets such as the ISO 8859 series and multiple-byte character sets such as EUC (Extended Unix Code), UTF-8, and Mule internal code. All supported character sets can be used transparently by clients, but a few are not supported for use within the server (that is, as a server-side encoding). The default character set is selected while initializing your HAWQ using `hawq init.` It can be overridden when you create a database, so you can have multiple databases each with a different character set. |
| |
| <table style="width:100%;"> |
| <colgroup> |
| <col width="16%" /> |
| <col width="16%" /> |
| <col width="16%" /> |
| <col width="16%" /> |
| <col width="16%" /> |
| <col width="16%" /> |
| </colgroup> |
| <thead> |
| <tr class="header"> |
| <th>Name</th> |
| <th>Description</th> |
| <th>Language</th> |
| <th>Server</th> |
| <th>Bytes/Char</th> |
| <th>Aliases</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr class="odd"> |
| <td>BIG5</td> |
| <td>Big Five</td> |
| <td>Traditional Chinese</td> |
| <td>No</td> |
| <td>1-2</td> |
| <td>WIN950, Windows950</td> |
| </tr> |
| <tr class="even"> |
| <td>EUC_CN</td> |
| <td>Extended UNIX Code-CN</td> |
| <td>Simplified Chinese</td> |
| <td>Yes</td> |
| <td>1-3</td> |
| <td></td> |
| </tr> |
| <tr class="odd"> |
| <td>EUC_JP</td> |
| <td>Extended UNIX Code-JP</td> |
| <td>Japanese</td> |
| <td>Yes</td> |
| <td>1-3</td> |
| <td> </td> |
| </tr> |
| <tr class="even"> |
| <td>EUC_KR</td> |
| <td>Extended UNIX Code-KR</td> |
| <td>Korean</td> |
| <td>Yes</td> |
| <td>1-3</td> |
| <td> </td> |
| </tr> |
| <tr class="odd"> |
| <td>EUC_TW</td> |
| <td>Extended UNIX Code-TW</td> |
| <td>Traditional Chinese, Taiwanese</td> |
| <td>Yes</td> |
| <td>1-3</td> |
| <td> </td> |
| </tr> |
| <tr class="even"> |
| <td>GB18030</td> |
| <td>National Standard</td> |
| <td>Chinese</td> |
| <td>No</td> |
| <td>1-2</td> |
| <td> </td> |
| </tr> |
| <tr class="odd"> |
| <td>GBK</td> |
| <td>Extended National Standard</td> |
| <td>Simplified Chinese</td> |
| <td>No</td> |
| <td>1-2</td> |
| <td>WIN936,Windows936</td> |
| </tr> |
| <tr class="even"> |
| <td>ISO_8859_5</td> |
| <td>ISO 8859-5, ECMA 113</td> |
| <td>Latin/Cyrillic</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td> </td> |
| </tr> |
| <tr class="odd"> |
| <td>ISO_8859_6</td> |
| <td>ISO 8859-6, ECMA 114</td> |
| <td>Latin/Arabic</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td> </td> |
| </tr> |
| <tr class="even"> |
| <td>ISO_8859_7</td> |
| <td>ISO 8859-7, ECMA 118</td> |
| <td>Latin/Greek</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td> </td> |
| </tr> |
| <tr class="odd"> |
| <td>ISO_8859_8</td> |
| <td>ISO 8859-8, ECMA 121</td> |
| <td>Latin/Hebrew</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td> </td> |
| </tr> |
| <tr class="even"> |
| <td>JOHAB</td> |
| <td>JOHA</td> |
| <td>Korean (Hangul)</td> |
| <td>Yes</td> |
| <td>1-3</td> |
| <td> </td> |
| </tr> |
| <tr class="odd"> |
| <td>KOI8</td> |
| <td>KOI8-R(U)</td> |
| <td>Cyrillic</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td>KOI8R</td> |
| </tr> |
| <tr class="even"> |
| <td>LATIN1</td> |
| <td>ISO 8859-1, ECMA 94</td> |
| <td>Western European</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td>ISO88591</td> |
| </tr> |
| <tr class="odd"> |
| <td>LATIN2</td> |
| <td>ISO 8859-2, ECMA 94</td> |
| <td>Central European</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td>ISO88592</td> |
| </tr> |
| <tr class="even"> |
| <td>LATIN3</td> |
| <td>ISO 8859-3, ECMA 94</td> |
| <td>South European</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td>ISO88593</td> |
| </tr> |
| <tr class="odd"> |
| <td>LATIN4</td> |
| <td>ISO 8859-4, ECMA 94</td> |
| <td>North European</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td>ISO88594</td> |
| </tr> |
| <tr class="even"> |
| <td>LATIN5</td> |
| <td>ISO 8859-9, ECMA 128</td> |
| <td>Turkish</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td>ISO88599</td> |
| </tr> |
| <tr class="odd"> |
| <td>LATIN6</td> |
| <td>ISO 8859-10, ECMA 144</td> |
| <td>Nordic</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td>ISO885910</td> |
| </tr> |
| <tr class="even"> |
| <td>LATIN7</td> |
| <td>ISO 8859-13</td> |
| <td>Baltic</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td>ISO885913</td> |
| </tr> |
| <tr class="odd"> |
| <td>LATIN8</td> |
| <td>ISO 8859-14</td> |
| <td>Celtic</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td>ISO885914</td> |
| </tr> |
| <tr class="even"> |
| <td>LATIN9</td> |
| <td>ISO 8859-15</td> |
| <td>LATIN1 with Euro and accents</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td>ISO885915</td> |
| </tr> |
| <tr class="odd"> |
| <td>LATIN10</td> |
| <td>ISO 8859-16, ASRO SR 14111</td> |
| <td>Romanian</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td>ISO885916</td> |
| </tr> |
| <tr class="even"> |
| <td>MULE_INTERNAL</td> |
| <td>Mule internal code</td> |
| <td>Multilingual Emacs</td> |
| <td>Yes</td> |
| <td>1-4</td> |
| <td> </td> |
| </tr> |
| <tr class="odd"> |
| <td>SJIS</td> |
| <td>Shift JIS</td> |
| <td>Japanese</td> |
| <td>No</td> |
| <td>1-2</td> |
| <td>Mskanji, ShiftJIS, WIN932, Windows932</td> |
| </tr> |
| <tr class="even"> |
| <td>SQL_ASCII</td> |
| <td>unspecified2</td> |
| <td>any</td> |
| <td>No</td> |
| <td>1</td> |
| <td> </td> |
| </tr> |
| <tr class="odd"> |
| <td>UHC</td> |
| <td>Unified Hangul Code</td> |
| <td>Korean</td> |
| <td>No</td> |
| <td>1-2</td> |
| <td>WIN949, Windows949</td> |
| </tr> |
| <tr class="even"> |
| <td>UTF8</td> |
| <td>Unicode, 8-bit </td> |
| <td>all</td> |
| <td>Yes</td> |
| <td>1-4</td> |
| <td>Unicode</td> |
| </tr> |
| <tr class="odd"> |
| <td>WIN866</td> |
| <td>Windows CP866</td> |
| <td>Cyrillic</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td>ALT</td> |
| </tr> |
| <tr class="even"> |
| <td>WIN874</td> |
| <td>Windows CP874</td> |
| <td>Thai</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td> </td> |
| </tr> |
| <tr class="odd"> |
| <td>WIN1250</td> |
| <td>Windows CP1250</td> |
| <td>Central European</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td> </td> |
| </tr> |
| <tr class="even"> |
| <td>WIN1251</td> |
| <td>Windows CP1251</td> |
| <td>Cyrillic</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td>WIN</td> |
| </tr> |
| <tr class="odd"> |
| <td>WIN1252</td> |
| <td>Windows CP1252</td> |
| <td>Western European</td> |
| <td>Yes</td> |
| <td><p>1</p></td> |
| <td> </td> |
| </tr> |
| <tr class="even"> |
| <td>WIN1253</td> |
| <td>Windows CP1253</td> |
| <td>Greek</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td> </td> |
| </tr> |
| <tr class="odd"> |
| <td>WIN1254</td> |
| <td>Windows CP1254</td> |
| <td>Turkish</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td> </td> |
| </tr> |
| <tr class="even"> |
| <td>WIN1255</td> |
| <td>Windows CP1255</td> |
| <td>Hebrew</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td> </td> |
| </tr> |
| <tr class="odd"> |
| <td>WIN1256</td> |
| <td>Windows CP1256</td> |
| <td>Arabic</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td> </td> |
| </tr> |
| <tr class="even"> |
| <td>WIN1257</td> |
| <td>Windows CP1257</td> |
| <td>Baltic</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td> </td> |
| </tr> |
| <tr class="odd"> |
| <td>WIN1258</td> |
| <td>Windows CP1258</td> |
| <td>Vietnamese</td> |
| <td>Yes</td> |
| <td>1</td> |
| <td>ABC, TCVN, TCVN5712, VSCII </td> |
| </tr> |
| </tbody> |
| </table> |
| |
| **Note:** |
| - Not all the APIs support all the listed character sets. For example, the JDBC driver does not support MULE\_INTERNAL, LATIN6, LATIN8, and LATIN10. |
| - The SQLASCII setting behaves considerable differently from the other settings. Byte values 0-127 are interpreted according to the ASCII standard, while byte values 128-255 are taken as uninterpreted characters. If you are working with any nonASCII data, it is unwise to use the SQL\_ASCII setting as a client encoding. SQL\_ASCII is not supported as a server encoding. |
| |
| ## Setting the Character Set<a id="settingthecharacterset"></a> |
| |
| `hawq init` defines the default character set for a HAWQ system by reading the setting of the ENCODING parameter in the gp\_init\_config file at initialization time. The default character set is UNICODE or UTF8. |
| |
| You can create a database with a different character set besides what is used as the system-wide default. For example: |
| |
| ``` pre |
| CREATE DATABASE korean WITH ENCODING 'EUC_KR'; |
| ``` |
| |
| **Note:** Although you can specify any encoding you want for a database, it is unwise to choose an encoding that is not what is expected by the locale you have selected. The LC\_COLLATE and LC\_CTYPE settings imply a particular encoding, and locale-dependent operations (such as sorting) are likely to misinterpret data that is in an incompatible encoding. |
| |
| Since these locale settings are frozen by hawq init, the apparent flexibility to use different encodings in different databases is more theoretical than real. |
| |
| One way to use multiple encodings safely is to set the locale to C or POSIX during initialization time, thus disabling any real locale awareness. |
| |
| ## Character Set Conversion Between Server and Client<a id="charactersetconversionbetweenserverandclient"></a> |
| |
| HAWQ supports automatic character set conversion between server and client for certain character set combinations. The conversion information is stored in the master pg\_conversion system catalog table. HAWQ comes with some predefined conversions or you can create a new conversion using the SQL command CREATE CONVERSION. |
| |
| | Server Character Set | Available Client Sets | |
| |----------------------|--------------------------------------------------------------------------------------------------------------------------------| |
| | BIG5 | not supported as a server encoding | |
| | EUC\_CN | EUC\_CN, MULE\_INTERNAL, UTF8 | |
| | EUC\_JP | EUC\_JP, MULE\_INTERNAL, SJIS, UTF8 | |
| | EUC\_KR | EUC\_KR, MULE\_INTERNAL, UTF8 | |
| | EUC\_TW | EUC\_TW, BIG5, MULE\_INTERNAL, UTF8 | |
| | GB18030 | not supported as a server encoding | |
| | GBK | not supported as a server encoding | |
| | ISO\_8859\_5 | ISO\_8859\_5, KOI8, MULE\_INTERNAL, UTF8, WIN866, WIN1251 | |
| | ISO\_8859\_6 | ISO\_8859\_6, UTF8 | |
| | ISO\_8859\_7 | ISO\_8859\_7, UTF8 | |
| | ISO\_8859\_8 | ISO\_8859\_8, UTF8 | |
| | JOHAB | JOHAB, UTF8 | |
| | KOI8 | KOI8, ISO\_8859\_5, MULE\_INTERNAL, UTF8, WIN866, WIN1251 | |
| | LATIN1 | LATIN1, MULE\_INTERNAL, UTF8 | |
| | LATIN2 | LATIN2, MULE\_INTERNAL, UTF8, WIN1250 | |
| | LATIN3 | LATIN3, MULE\_INTERNAL, UTF8 | |
| | LATIN4 | LATIN4, MULE\_INTERNAL, UTF8 | |
| | LATIN5 | LATIN5, UTF8 | |
| | LATIN6 | LATIN6, UTF8 | |
| | LATIN7 | LATIN7, UTF8 | |
| | LATIN8 | LATIN8, UTF8 | |
| | LATIN9 | LATIN9, UTF8 | |
| | LATIN10 | LATIN10, UTF8 | |
| | MULE\_INTERNAL | MULE\_INTERNAL, BIG5, EUC\_CN, EUC\_JP, EUC\_KR, EUC\_TW, ISO\_8859\_5, KOI8, LATIN1 to LATIN4, SJIS, WIN866, WIN1250, WIN1251 | |
| | SJIS | not supported as a server encoding | |
| | SQL\_ASCII | not supported as a server encoding | |
| | UHC | not supported as a server encoding | |
| | UTF8 | all supported encodings | |
| | WIN866 | WIN866 | |
| | WIN874 | WIN874, UTF8 | |
| | WIN1250 | WIN1250, LATIN2, MULE\_INTERNAL, UTF8 | |
| | WIN1251 | WIN1251, ISO\_8859\_5, KOI8, MULE\_INTERNAL, UTF8, WIN866 | |
| | WIN1252 | WIN1252, UTF8 | |
| | WIN1253 | WIN1253, UTF8 | |
| | WIN1254 | WIN1254, UTF8 | |
| | WIN1255 | WIN1255, UTF8 | |
| | WIN1256 | WIN1256, UTF8 | |
| | WIN1257 | WIN1257, UTF8 | |
| | WIN1258 | WIN1258, UTF8 | |
| |
| To enable automatic character set conversion, you have to tell HAWQ the character set (encoding) you would like to use in the client. There are several ways to accomplish this: |
| |
| - Using the \\encoding command in psql, which allows you to change client encoding on the fly. |
| - Using SET client\_encoding TO. Setting the client encoding can be done with this SQL command: |
| |
| ``` pre |
| SET CLIENT_ENCODING TO 'value'; |
| ``` |
| |
| To query the current client encoding: |
| |
| ``` pre |
| SHOW client_encoding; |
| ``` |
| |
| To return the default encoding: |
| |
| ``` pre |
| RESET client_encoding; |
| ``` |
| |
| - Using the PGCLIENTENCODING environment variable. When PGCLIENTENCODING is defined in the client's environment, that client encoding is automatically selected when a connection to the server is made. (This can subsequently be overridden using any of the other methods mentioned above.) |
| - Setting the configuration parameter client\_encoding. If client\_encoding is set in the master `hawq-site.xml` file, that client encoding is automatically selected when a connection to HAWQ is made. (This can subsequently be overridden using any of the other methods mentioned above.) |
| |
| If the conversion of a particular character is not possible — suppose you chose EUC\_JP for the server and LATIN1 for the client, then some Japanese characters do not have a representation in LATIN1 — then an error is reported. |
| |
| If the client character set is defined as SQL\_ASCII, encoding conversion is disabled, regardless of the server’s character set. The use of SQL\_ASCII is unwise unless you are working with all-ASCII data. SQL\_ASCII is not supported as a server encoding. |
| |
| |