|  | -*- Text -*- | 
|  |  | 
|  |  | 
|  | Content | 
|  | ======= | 
|  |  | 
|  | * Context | 
|  | * Issue Description | 
|  | * Pre-Resolution State of Affairs | 
|  | - Single platform | 
|  | - Multi-platform: Windows + MacOS X | 
|  | * Proposed Support Library | 
|  | - Assumptions | 
|  | - Options | 
|  | * Proposed Normal Form | 
|  | * Possible Solutions | 
|  | - Normalization of path-input on MacOS X | 
|  | - Normalization of path-input everywhere | 
|  | - Comparison routines (client side) | 
|  | - Comparison routines (everywhere) | 
|  | * Short Term (ie before 2.0) solution | 
|  | * Long Term Solution (ie 2.0+) | 
|  | * Additional Information | 
|  | * References | 
|  |  | 
|  |  | 
|  | Context | 
|  | ======= | 
|  |  | 
|  | Within Unicode, some characters - with diacritical marks - can be | 
|  | represented in 2 forms: Normal Form Composed (NFC) or Normal Form | 
|  | Decomposed (NFD).  A string of unicode characters can contain any | 
|  | mixture of both forms. | 
|  |  | 
|  | This problem explicitly does not concern itself with invisible | 
|  | characters, spaces or other characters unlikely to be present in | 
|  | filenames.  Please note that this issue is explicitly excluding | 
|  | NFKC/NFKD (compatibility) normal forms, because they remove for | 
|  | example formatting (meaning they are lossy?). | 
|  |  | 
|  | Because there are 2 forms for representing (some) characters in | 
|  | Unicode, it's possible to produce different sequences of codepoints | 
|  | meaning to indicate the same sequence of characters [1].  UTF-8, the | 
|  | internal Unicode encoding of choice for Subversion, encodes codepoints | 
|  | in (a series of) bytes (octets).  Because the sequences of codepoints | 
|  | specifying a character may differ, so may the resulting UTF-8.  Hence, | 
|  | we end up with more than one way to specify the same path. | 
|  |  | 
|  | The following table specifies behaviour of OSes related to handling of | 
|  | Unicode filenames: | 
|  |  | 
|  | OS           Accepts   Gives back | 
|  | ----------   -------   ---------- | 
|  | MacOS X[2]     all        NFD* | 
|  | Linux          all      <input> | 
|  | Windows        all      <input> | 
|  | Others          ?          ? | 
|  |  | 
|  | *) There are some remarks to be made regarding full or partial NFD | 
|  | here, but the essential thing is:  if you send in NFC, don't | 
|  | expect it back! | 
|  |  | 
|  |  | 
|  | Issue Description | 
|  | ================= | 
|  |  | 
|  | From the above issue description, two problems follow: | 
|  |  | 
|  | First, we can't generally depend on the OS to give us back the exact | 
|  | filename we gave it.  This is mainly a client side issue, something | 
|  | which might be resolved in the client side libraries (client/subr/wc). | 
|  |  | 
|  | Secondly, the same filename may be encoded in different codepoints. | 
|  | This issue is much broader than the first, especially given the fact | 
|  | that we already have lots of populated repositories "out there".  We | 
|  | cannot depend on a filename coming from the operating system -- even | 
|  | though different from the one in the repository -- to name a different | 
|  | file.  This has repository (ie. server-side) impact. | 
|  |  | 
|  |  | 
|  | Pre-Resolution State of Affairs | 
|  | =============================== | 
|  |  | 
|  | This section serves to describe the problems to be expected in different | 
|  | combinations of client/server OSes.  As indicated in the table in the | 
|  | context section, Linux and Windows are expected to behave equally. This | 
|  | section therefor leaves out the consideration of Linux as a separate | 
|  | system. | 
|  |  | 
|  | The platforms below are strictly client side: the server side problems | 
|  | mentioned in the issue description section solely relates to the repository, | 
|  | which can be located at any server platform. | 
|  |  | 
|  |  | 
|  | Single platform | 
|  | --------------- | 
|  |  | 
|  | This can be multiple MacOSX machines or multiple Windows machines. | 
|  | In this scenario, no interoperability problems are to be expected. | 
|  |  | 
|  |  | 
|  | Multi-platform: Windows + MacOSX | 
|  | -------------------------------- | 
|  |  | 
|  | Consider a filename which contains one or more precomposed (NFC) | 
|  | characters being committed from Windows.  When the MacOSX developer | 
|  | updates, a file is written in NFC form, but as stated in the | 
|  | context section, Mac recodes that to NFD.  Now, when comparing what | 
|  | comes from the disk (NFD) with what's in the entries file (NFC), | 
|  | results in a missing file (the NFC encoded one) and an unversioned | 
|  | file (the NFD encoded one).  Both of the filenames look exactly the | 
|  | same to the person reading the Subversion output on the | 
|  | screen. [==> confusion!] | 
|  |  | 
|  | Committing a file the other way around might be less problematic, | 
|  | since Windows is capable of storing NFD filenames. | 
|  |  | 
|  |  | 
|  | Proposed Support Library | 
|  | ======================== | 
|  |  | 
|  | Assumptions | 
|  | ----------- | 
|  |  | 
|  | The main assumption is that we'll keep using APR for character set | 
|  | conversion, meaning that the recoding solution to choose would not | 
|  | need to provide any other functionality than recoding. | 
|  |  | 
|  |  | 
|  | Options | 
|  | ------- | 
|  |  | 
|  | There are two options (that I'm aware of [dionisos]) for choosing a | 
|  | library which supports the required functionality: | 
|  |  | 
|  | 1) International Component for Unicode (ICU)[3] -- a library with a | 
|  | very wide range of targeted functions, but with a memory | 
|  | footprint to match.  In order to be able to use it, we'd need to | 
|  | trim this library down significantly. | 
|  |  | 
|  | 2) utf8proc -- a library for processing UTF-8 encoded unicode | 
|  | strings.  A library specifically targeted at a limited number of | 
|  | operations to be performed on UTF-8 encoded strings.  It | 
|  | consists of two .c and a single .h file, with a total source | 
|  | size of 1MB (compiled less than 0.5MB). | 
|  |  | 
|  | From these two, under the given assumption, it only makes sense to | 
|  | use utf8proc. | 
|  |  | 
|  |  | 
|  | Proposed Normal Form | 
|  | ==================== | 
|  |  | 
|  | The proposed internal 'normal form' should be NFC, if only if | 
|  | it were because it's the most compact form of the two:  when allocating | 
|  | memory to store a conversion result, it won't be necessary (ever) to | 
|  | allocate more than the size of the input buffer. | 
|  |  | 
|  | This would give the maximum performance from utf8proc, which requires | 
|  | two recoding runs when the buffer is too small:  one to retrieve the | 
|  | required buffer size, the second to actually store the result. | 
|  |  | 
|  |  | 
|  |  | 
|  | Possible Solutions | 
|  | ================== | 
|  |  | 
|  | Several options are available for resolution of this problem, each | 
|  | with its pros and cons, to be outlined below. | 
|  |  | 
|  | 1) Normalization of (path) input on MacOSX.  Since the Mac seems to be | 
|  | the only platform which mutilates its pathname input to be NFD, | 
|  | this seems like a logical (low impact) solution. | 
|  |  | 
|  | 2) Normalization of (path) input on all platforms.  Since paths can't | 
|  | differ only in encoding if we standardize on encoding, this seems | 
|  | like a logical (relatively low) impact solution. | 
|  |  | 
|  | 3) Normalization of path input in the client and server.  On the server | 
|  | side, non-normalized paths may have become part of the repository. | 
|  | We can achieve full in-memory standardization by converting any | 
|  | path coming from the repository as well as the client. | 
|  |  | 
|  | 4) Client and server-side path comparison routines.  Because paths read | 
|  | from the repository may be used to access said repository, possibly | 
|  | by calculating hash values, paths from can't be munged | 
|  | (repository-side).  To eliminate the effect, we acknowledge we're | 
|  | not going to be 'clean': we'll always need path comparison | 
|  | routines. | 
|  |  | 
|  | Solution (1) has a very strong CON: it will break all pre-existing | 
|  | MacOSX-only workshops.  Consider a client which starts sending NFC | 
|  | encoded paths in an environment where all paths have been NFD encoded | 
|  | until that time - without proper support in the server.  This would | 
|  | result in commits with NFC encoded paths to files for which the path | 
|  | in the repository is NFD encoded: breakage. | 
|  |  | 
|  | Solution (2) has the same problem as solution (1) on MacOSX, but | 
|  | on the upside it prevents new NFD paths from entering into the repository | 
|  | (for sufficiently broad definitions of 'client' [think mod_dav_svn]). | 
|  |  | 
|  | As already stated, solution (3) may prevent paths from being found, if | 
|  | the retrieval mechanism is hash-based.  Meaning this could break any | 
|  | repository backend using hashing to store information about paths. | 
|  | (Don't we store locks in FSFS based on hashing?) | 
|  |  | 
|  | Solution (4) defines no internal standard representation, assuming it's | 
|  | not possible to maintain a clean in-memory state, given all problems | 
|  | found in the earlier solutions.  Instead, it requires all path comparisons | 
|  | to be performed using special NFC/NFD encoding aware functions. | 
|  |  | 
|  |  | 
|  | Short Term Solution | 
|  | =================== | 
|  |  | 
|  | Because of our interoperability guarantees, the client and server | 
|  | should be considered separate universes, each of which can use its own | 
|  | (internal) solution.  However, the client should at all times use the | 
|  | exact path the server sent it.  The same applies the other way around. | 
|  |  | 
|  | Given the above, the short term (before 2.0) solution should be to | 
|  | use path comparison routines as stated in solution (4). | 
|  |  | 
|  |  | 
|  | Long Term Solution | 
|  | ================== | 
|  |  | 
|  | The long term (2.0+) solution would be to use option (2), which ensures | 
|  | recoding of all input paths into the 'normal' normal form (NFC).  In that | 
|  | case, it'll no longer require the use of specialised path comparison | 
|  | routines (although that might still be desired for other design | 
|  | considerations). | 
|  |  | 
|  |  | 
|  | Short Term Solution Implementation Consequences | 
|  | =============================================== | 
|  |  | 
|  | As stated before, since we don't know whether the other side of the | 
|  | equation might be a pre-normalization-aware client or server until | 
|  | we break backward compat in 2.0, the client and server should be | 
|  | able to talk backward compatibly with a pre-NF-aware 'other side'. | 
|  |  | 
|  | Hence, solving this problem means considering the client and the server | 
|  | separate universes, each of which can employ its own internal solution. | 
|  |  | 
|  | Implementing option (4) means: | 
|  |  | 
|  | A. Comparing file names with entry paths using NFC/NFD aware | 
|  | comparison functions. Then, when there's a match, *use the pathname | 
|  | from the entries file* to communicate with the server; after all, | 
|  | the path might have been added with a different encoding than we | 
|  | got back from the disk. | 
|  |  | 
|  | B. Match working copy paths with entries-file paths using NFC/NFD | 
|  | aware comparison functions. On a match, use the entries-file path | 
|  | to communicate with the server. | 
|  |  | 
|  | The above means the client has to be very careful to preserve the | 
|  | encoding from the server and use that when talking to the server | 
|  | otherwise the server may not recognize the path as a versioned entity. | 
|  |  | 
|  | Locally however, we can't be sure the filesystem enforces the encoding | 
|  | the server sent to the client, meaning there are (contrived) cases where | 
|  | a file exists in a different encoding locally than in the repository. | 
|  | Which means we have to be very careful about how we find our files and | 
|  | to use the encoding we got from the local filesystem. | 
|  |  | 
|  | Implementation details: | 
|  |  | 
|  | * The hash keys in svn_wc_adm_access_t's are hashed on the normalized | 
|  | path encoding, not the repository path, in order to be able to | 
|  | calculate the hash key from both the wc path as well as the repo | 
|  | path. | 
|  |  | 
|  | * The same line of reasoning applies to the hash keys in the entries | 
|  | hash. | 
|  |  | 
|  | New conventions: | 
|  |  | 
|  | * Variables containing a path as encoded in the local filesystem | 
|  | should contain the (sub)string 'wc_path'. | 
|  |  | 
|  | * Variables containing a path as encoded in the repository should | 
|  | contain the (sub)string 'repo_path'. | 
|  |  | 
|  |  | 
|  | Additional Information | 
|  | ====================== | 
|  |  | 
|  | * "UTF-8 NFC/NFD paths issue" dev@ mailing list thread: | 
|  | http://svn.haxx.se/dev/archive-2010-09/0319.shtml | 
|  |  | 
|  |  | 
|  | References | 
|  | ========== | 
|  |  | 
|  | 1) UAX #15: Unicode normalization forms | 
|  | http://unicode.org/reports/tr15/ | 
|  | 2) Apple Technical Q&A: Path encodings in VFS | 
|  | http://developer.apple.com/qa/qa2001/qa1173.html | 
|  | 3) ICU - International Component for Unicode | 
|  | http://www-306.ibm.com/software/globalization/icu/index.jsp | 
|  | 4) utf8proc - a library targeted at processing UTF-8 encoded unicode strings | 
|  | http://www.flexiguided.de/publications.utf8proc.en.html |