| -*- Text -*- |
| |
| |
| Content |
| ======= |
| |
| * Context |
| * Issue Description |
| * Pre-Resolution State of Affairs |
| - Single platform |
| - Multi-platform: Windows + MacOS X |
| * Proposed Support Library |
| - Assumptions |
| - Options |
| * Proposed Normal Form |
| * Possible Solutions |
| - Normalization of path-input on MacOS X |
| - Normalization of path-input everywhere |
| - Comparison routines (client side) |
| - Comparison routines (everywhere) |
| * Short Term (ie before 2.0) solution |
| * Long Term Solution (ie 2.0+) |
| * Additional Information |
| * References |
| |
| |
| Context |
| ======= |
| |
| Within Unicode, some characters - with diacritical marks - can be |
| represented in 2 forms: Normal Form Composed (NFC) or Normal Form |
| Decomposed (NFD). A string of unicode characters can contain any |
| mixture of both forms. |
| |
| This problem explicitly does not concern itself with invisible |
| characters, spaces or other characters unlikely to be present in |
| filenames. Please note that this issue is explicitly excluding |
| NFKC/NFKD (compatibility) normal forms, because they remove for |
| example formatting (meaning they are lossy?). |
| |
| Because there are 2 forms for representing (some) characters in |
| Unicode, it's possible to produce different sequences of codepoints |
| meaning to indicate the same sequence of characters [1]. UTF-8, the |
| internal Unicode encoding of choice for Subversion, encodes codepoints |
| in (a series of) bytes (octets). Because the sequences of codepoints |
| specifying a character may differ, so may the resulting UTF-8. Hence, |
| we end up with more than one way to specify the same path. |
| |
| The following table specifies behaviour of OSes related to handling of |
| Unicode filenames: |
| |
| OS Accepts Gives back |
| ---------- ------- ---------- |
| MacOS X[2] all NFD* |
| Linux all <input> |
| Windows all <input> |
| Others ? ? |
| |
| *) There are some remarks to be made regarding full or partial NFD |
| here, but the essential thing is: if you send in NFC, don't |
| expect it back! |
| |
| |
| Issue Description |
| ================= |
| |
| From the above issue description, two problems follow: |
| |
| First, we can't generally depend on the OS to give us back the exact |
| filename we gave it. This is mainly a client side issue, something |
| which might be resolved in the client side libraries (client/subr/wc). |
| |
| Secondly, the same filename may be encoded in different codepoints. |
| This issue is much broader than the first, especially given the fact |
| that we already have lots of populated repositories "out there". We |
| cannot depend on a filename coming from the operating system -- even |
| though different from the one in the repository -- to name a different |
| file. This has repository (ie. server-side) impact. |
| |
| |
| Pre-Resolution State of Affairs |
| =============================== |
| |
| This section serves to describe the problems to be expected in different |
| combinations of client/server OSes. As indicated in the table in the |
| context section, Linux and Windows are expected to behave equally. This |
| section therefor leaves out the consideration of Linux as a separate |
| system. |
| |
| The platforms below are strictly client side: the server side problems |
| mentioned in the issue description section solely relates to the repository, |
| which can be located at any server platform. |
| |
| |
| Single platform |
| --------------- |
| |
| This can be multiple MacOSX machines or multiple Windows machines. |
| In this scenario, no interoperability problems are to be expected. |
| |
| |
| Multi-platform: Windows + MacOSX |
| -------------------------------- |
| |
| Consider a filename which contains one or more precomposed (NFC) |
| characters being committed from Windows. When the MacOSX developer |
| updates, a file is written in NFC form, but as stated in the |
| context section, Mac recodes that to NFD. Now, when comparing what |
| comes from the disk (NFD) with what's in the entries file (NFC), |
| results in a missing file (the NFC encoded one) and an unversioned |
| file (the NFD encoded one). Both of the filenames look exactly the |
| same to the person reading the Subversion output on the |
| screen. [==> confusion!] |
| |
| Committing a file the other way around might be less problematic, |
| since Windows is capable of storing NFD filenames. |
| |
| |
| Proposed Support Library |
| ======================== |
| |
| Assumptions |
| ----------- |
| |
| The main assumption is that we'll keep using APR for character set |
| conversion, meaning that the recoding solution to choose would not |
| need to provide any other functionality than recoding. |
| |
| |
| Options |
| ------- |
| |
| There are two options (that I'm aware of [dionisos]) for choosing a |
| library which supports the required functionality: |
| |
| 1) International Component for Unicode (ICU)[3] -- a library with a |
| very wide range of targeted functions, but with a memory |
| footprint to match. In order to be able to use it, we'd need to |
| trim this library down significantly. |
| |
| 2) utf8proc -- a library for processing UTF-8 encoded unicode |
| strings. A library specifically targeted at a limited number of |
| operations to be performed on UTF-8 encoded strings. It |
| consists of two .c and a single .h file, with a total source |
| size of 1MB (compiled less than 0.5MB). |
| |
| From these two, under the given assumption, it only makes sense to |
| use utf8proc. |
| |
| |
| Proposed Normal Form |
| ==================== |
| |
| The proposed internal 'normal form' should be NFC, if only if |
| it were because it's the most compact form of the two: when allocating |
| memory to store a conversion result, it won't be necessary (ever) to |
| allocate more than the size of the input buffer. |
| |
| This would give the maximum performance from utf8proc, which requires |
| two recoding runs when the buffer is too small: one to retrieve the |
| required buffer size, the second to actually store the result. |
| |
| |
| |
| Possible Solutions |
| ================== |
| |
| Several options are available for resolution of this problem, each |
| with its pros and cons, to be outlined below. |
| |
| 1) Normalization of (path) input on MacOSX. Since the Mac seems to be |
| the only platform which mutilates its pathname input to be NFD, |
| this seems like a logical (low impact) solution. |
| |
| 2) Normalization of (path) input on all platforms. Since paths can't |
| differ only in encoding if we standardize on encoding, this seems |
| like a logical (relatively low) impact solution. |
| |
| 3) Normalization of path input in the client and server. On the server |
| side, non-normalized paths may have become part of the repository. |
| We can achieve full in-memory standardization by converting any |
| path coming from the repository as well as the client. |
| |
| 4) Client and server-side path comparison routines. Because paths read |
| from the repository may be used to access said repository, possibly |
| by calculating hash values, paths from can't be munged |
| (repository-side). To eliminate the effect, we acknowledge we're |
| not going to be 'clean': we'll always need path comparison |
| routines. |
| |
| Solution (1) has a very strong CON: it will break all pre-existing |
| MacOSX-only workshops. Consider a client which starts sending NFC |
| encoded paths in an environment where all paths have been NFD encoded |
| until that time - without proper support in the server. This would |
| result in commits with NFC encoded paths to files for which the path |
| in the repository is NFD encoded: breakage. |
| |
| Solution (2) has the same problem as solution (1) on MacOSX, but |
| on the upside it prevents new NFD paths from entering into the repository |
| (for sufficiently broad definitions of 'client' [think mod_dav_svn]). |
| |
| As already stated, solution (3) may prevent paths from being found, if |
| the retrieval mechanism is hash-based. Meaning this could break any |
| repository backend using hashing to store information about paths. |
| (Don't we store locks in FSFS based on hashing?) |
| |
| Solution (4) defines no internal standard representation, assuming it's |
| not possible to maintain a clean in-memory state, given all problems |
| found in the earlier solutions. Instead, it requires all path comparisons |
| to be performed using special NFC/NFD encoding aware functions. |
| |
| |
| Short Term Solution |
| =================== |
| |
| Because of our interoperability guarantees, the client and server |
| should be considered separate universes, each of which can use its own |
| (internal) solution. However, the client should at all times use the |
| exact path the server sent it. The same applies the other way around. |
| |
| Given the above, the short term (before 2.0) solution should be to |
| use path comparison routines as stated in solution (4). |
| |
| |
| Long Term Solution |
| ================== |
| |
| The long term (2.0+) solution would be to use option (2), which ensures |
| recoding of all input paths into the 'normal' normal form (NFC). In that |
| case, it'll no longer require the use of specialised path comparison |
| routines (although that might still be desired for other design |
| considerations). |
| |
| |
| Short Term Solution Implementation Consequences |
| =============================================== |
| |
| As stated before, since we don't know whether the other side of the |
| equation might be a pre-normalization-aware client or server until |
| we break backward compat in 2.0, the client and server should be |
| able to talk backward compatibly with a pre-NF-aware 'other side'. |
| |
| Hence, solving this problem means considering the client and the server |
| separate universes, each of which can employ its own internal solution. |
| |
| Implementing option (4) means: |
| |
| A. Comparing file names with entry paths using NFC/NFD aware |
| comparison functions. Then, when there's a match, *use the pathname |
| from the entries file* to communicate with the server; after all, |
| the path might have been added with a different encoding than we |
| got back from the disk. |
| |
| B. Match working copy paths with entries-file paths using NFC/NFD |
| aware comparison functions. On a match, use the entries-file path |
| to communicate with the server. |
| |
| The above means the client has to be very careful to preserve the |
| encoding from the server and use that when talking to the server |
| otherwise the server may not recognize the path as a versioned entity. |
| |
| Locally however, we can't be sure the filesystem enforces the encoding |
| the server sent to the client, meaning there are (contrived) cases where |
| a file exists in a different encoding locally than in the repository. |
| Which means we have to be very careful about how we find our files and |
| to use the encoding we got from the local filesystem. |
| |
| Implementation details: |
| |
| * The hash keys in svn_wc_adm_access_t's are hashed on the normalized |
| path encoding, not the repository path, in order to be able to |
| calculate the hash key from both the wc path as well as the repo |
| path. |
| |
| * The same line of reasoning applies to the hash keys in the entries |
| hash. |
| |
| New conventions: |
| |
| * Variables containing a path as encoded in the local filesystem |
| should contain the (sub)string 'wc_path'. |
| |
| * Variables containing a path as encoded in the repository should |
| contain the (sub)string 'repo_path'. |
| |
| |
| Additional Information |
| ====================== |
| |
| * "UTF-8 NFC/NFD paths issue" dev@ mailing list thread: |
| http://svn.haxx.se/dev/archive-2010-09/0319.shtml |
| |
| |
| References |
| ========== |
| |
| 1) UAX #15: Unicode normalization forms |
| http://unicode.org/reports/tr15/ |
| 2) Apple Technical Q&A: Path encodings in VFS |
| http://developer.apple.com/qa/qa2001/qa1173.html |
| 3) ICU - International Component for Unicode |
| http://www-306.ibm.com/software/globalization/icu/index.jsp |
| 4) utf8proc - a library targeted at processing UTF-8 encoded unicode strings |
| http://www.flexiguided.de/publications.utf8proc.en.html |