| |
| Issues (and their resolutions) when using gettext for message translation |
| |
| Contents |
| ======== |
| |
| * Windows issues |
| * Automatic characterset conversion |
| * Translations on the client |
| * No translations on the server |
| |
| |
| |
| Windows issues |
| ============== |
| |
| On Windows, Subversion is linked against a modified version of GNU gettext. |
| This resolves several issues: |
| |
| - Eliminated need to link against libiconv (which would be the second |
| iconv library, since we already link against apr-iconv) |
| - No automatic charset conversion (guaranteed UTF-8 strings returned by |
| gettext() calls without performance penalties) |
| |
| More in the paragraphs below... |
| |
| |
| Automatic characterset conversion |
| ================================= |
| |
| Some gettext implementations automatically convert the strings in the |
| message catalogue to the active system characterset. The source encoding |
| is stored in the "" message id. The message string looks somewhat like |
| a mime header and contains a "Content-Encoding" line. It's typically GNU's |
| gettext which does this. |
| |
| Subversion uses UTF-8 to encode strings internally, which may not be the |
| systems default character encoding. To prevent internal corruption, |
| libsvn_subr:svn_cmdline_init2() explicitly tells gettext to return UTF-8 |
| encoded strings if it has bind_textdomain_codeset(). |
| |
| Some gettext implementations don't contain automatic string recoding. In |
| order to work with both recoding and non-recoding implementations, the |
| source strings must be UTF-8 encoded. This is achieved by requiring .po |
| files to be UTF-8 encoded. [Note: a pre-commit hook has been installed to |
| ensure this.] |
| |
| On Windows Subversion links against a version of GNU gettext, which has |
| been modified not to do character conversions. This eliminates the |
| requirement to link against libiconv which would mean Subversion being |
| linked against 2 iconv libraries (apr_iconv as well as libiconv). |
| |
| |
| Translations on the client |
| ========================== |
| |
| The translation effort is to translate most error messages generated on |
| the system on which the user has invoked his subversion command (svnadmin, |
| svnlook, svndumpfilter, svnversion or svn). |
| |
| This means that in all layers of the libraries strings have been marked for |
| translation, either with _(), N_() or Q_(). |
| |
| Parameters are sprintf-ed straight into errorstrings at the time they are |
| added to the error structure, so most strings are marked with _() and |
| translated directly into the language for which the client was set up. |
| [Note: The N_() macro marks strings for delayed translation.] |
| |
| |
| |
| Translations on the server |
| ========================== |
| |
| On systems which define the LC_MESSAGES constant, setlocale() can be used |
| to set string translation for all (error) strings even those outside |
| the Subversion domain. |
| |
| Windows doesn't define LC_MESSAGES. Instead GNU gettext uses the environ- |
| ment variables LANGUAGE, LC_ALL, LC_MESSAGES and LANG (in that order) to |
| find out what language to translate to. If none of these are defined, the |
| system and user default locales are queried. Though setting one of |
| the aforementioned variables before starting the server will avoid |
| localization by Subversion to the default locale, messages generated |
| by the system itself are likely to still be in its default locale |
| (they are on Windows). |
| |
| While systems which have the LC_MESSAGES flag (or setenv() - of which |
| Windows has neither) allow languages to be switched at run time, this cannot |
| be done portably. |
| |
| Any attempt to use setlocale() in an Apache environment may conflict |
| with settings other modules expect to be setup (even when using a |
| prefork MPM). On the svnserve side, having no portable way to change |
| languages dynamically means that the environment has to be set up |
| correctly from the start. Futhermore, the svnserve protocol doesn't |
| yet support content negotiation. |
| |
| In other words, there is no way -- programmatically -- to ensure that |
| messages are served in any specific language using a traditional |
| gettext implementation. Current consensus is that gettext must be |
| replaced on the server side with a more flexible implementation. |
| |
| Server requirement(s): |
| - Language negotiation on a per-client session basis. For a |
| stateless protocol like HTTP, this means per-request. For a |
| stateful protocol like the one used by svnserve, this means |
| per-connection. |
| - Avoid contamination of environment used by other code (e.g. other |
| Apache modules running in the same server as mod_dav_svn). |
| - Allow for propagation of the language to use to hook scripts. |
| - Continue to inter-op with generic HTTP/DAV clients, and stay |
| compatible with SVN clients of various versions (as per existing |
| compatibility rules). |
| |
| I18N module requirement(s): |
| - Cross-platform. |
| - Interoperable with gettext tools (e.g. for .po files). |
| - Non-viral license which allows for any necessary modifications. |
| - gettext-like API (needn't be an exact match). |
| |
| Implementation guidelines: |
| - The L10N API will be uniform across all libraries, clients, and |
| servers. Server-negotiated language will be recorded in either a |
| context baton (e.g. apr_pool_t.userdata), or in thread-local |
| storage (TLS). |
| - Implemented on top of a new gettext-like module with per-struct or |
| per-thread locale mutator functions and storage for name/value |
| pairs (a glorified apr_hash_t). (See implementation from Nicolás |
| Lichtmaier noted below.) |
| - Language chosen by the server will be negotiated based on a ranked |
| list of preferences provided by the client. |
| - Language used by httpd/mod_dav_svn will be derived from the |
| Accept-Language HTTP header, and setup by mod_negotiation (when |
| available), or by mod_dav_svn on a per-request basis. |
| - Language used by svnserve derived from additions to the protocol |
| which allow for HTTP-style content negotiation on a per-connection |
| basis. The protocol extension would use the same sort of q-value |
| list found in the Accept-Language header to specify user language |
| preferences. |
| |
| Investigation: A brief canvasing of developers (on IRC) indicated that |
| no thorough investigation of existing solutions which might meet the |
| above requirements has been done. This incomplete canvasing may not |
| paint an accurate picture, however. |
| |
| A branch <http://svn.apache.org/repos/asf/subversion/branches/server-l10n/> |
| has been created to explore a solution to the above requirements. While |
| the L10N module is important, how that module is applied to both the |
| server-side and client-side is possibly even more so; an |
| implementation which meets the requirements should not dramatically |
| impact the solution used across the code base for the general L10N |
| API, nor the necessary server-side machinations. |
| |
| Nicolás Lichtmaier wrote something along the lines of the module |
| referenced in the "Possible implementation" section |
| <http://svn.haxx.se/dev/archive-2004-04/0788.shtml>, which has been |
| committed to the server-l10n branch. However, it depends upon the GNU |
| gettext .mo format, and the GNU implementation may not be available on |
| all platforms (unless re-implemented). This module will need to be |
| enhanced or replaced, ideally completely obviating the need for |
| linkage against a platform's own gettext implementation. |
| |
| Whether to use TLS or a context baton for the L10N API is under |
| discussion. TLS can provide a more friendly API (albeit somewhat |
| underhanded), while use of a context baton more resilient to change |
| (e.g. if httpd someday allowed more than one thread to service a |
| request). Here's a sample: |
| - No localization: "A message to localize" |
| - Localization w/ TLS or global preference: _("A message to localize") |
| - Localization w/ a context baton: _("A message to localize", pool) |
| |
| Historical note: Original consensus indicated that messages from the |
| server side should stay untranslated for transmission to the client. |
| However, client side localization is not an option, because by then |
| the parameter values have been inserted into the string, meaning that |
| it can't be looked up in the messages catalogue anymore. So any |
| localization must occur on the server, or significantly increase the |
| complexity of marshalling messages from the server as |
| unlocalized/unformatted data structures and localizing them on the |
| client side using some additional wrapper APIs to handle the |
| unmarshalling and message formatting. Additionally, client and server |
| versions may not match up, meaning that message keys and format string |
| values provided by the server may not correspond to what's available |
| on the client. |
| |
| Paul Querna suggested a variation on this scheme involving requesting |
| (once) and caching the localizations (to the local disk) for each |
| server version, along with sending the message key (for lookup of |
| localized text) and an already formatted text (to use as the default |
| when no localization bundle is available). In addition to the |
| complications mentioned previously, this has the downside of crippling |
| the localization of server-generated messages when no write access to |
| the local disk is available to the client. |