| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> |
| <HTML> |
| <head> |
| <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=iso-8859-1"> |
| <TITLE>URIs in UDK</TITLE> |
| <style type="text/css"> |
| <!-- |
| h1 { text-align:center; margin-top: 0.2cm; text-decoration: none; color: #ffffff; font-size: 6; margin-top: 0.2cm} |
| h2 { margin-top: 0.2cm; margin-bottom=0.1cm; color: #ffffff; background-color: #666699 } |
| li {margin-bottom: 0.2cm;} |
| dl {margin-bottom: 0.2cm;} |
| dd {margin-bottom: 0.2cm;} |
| dt {margin-bottom: 0.2cm;} |
| --> |
| </style> |
| </head> |
| <body> |
| <TABLE WIDTH="100%" BORDER="0" CELLSPACING="0" CELLPADDING="4" bgcolor=#666699 summary=header> |
| <TR> |
| <td> |
| <h1> URIs in UDK </h1> |
| <td></td> |
| <a href="http://www.openoffice.org"><img src="../../../images/open_office_org_logo.gif" alt="OpenOffice.org" align="right" border="0"></a> |
| </TD> |
| </TR> |
| </TABLE> |
| |
| |
| <P>This text describes how Uniform Resource Identifiers (URIs, see |
| <A href="http://www.ietf.org/rfc/rfc2396.txt">RFC 2396</A>) are used within |
| the UDK and within OpenOffice.org in general. URIs encompass both the widely |
| known Uniform Resource Locators (URLs) and the lesser known Uniform Resource |
| Names (URNs), but if you only know the term URL and neither URI nor URN, that's |
| just fine, as the discussion here in fact centers around only a few URL |
| schemes.</P> |
| |
| <H2>Used URL Schemes</H2> |
| |
| <P>Currently, the UDK only uses two URL schemes directly: file URLs and uno |
| URLs. File URLs are defined in |
| <A href="http://www.ietf.org/rfc/rfc1738.txt">RFC 1738</A>, but that |
| definition leaves the semantics somewhat open. The OpenOffice.org code chose to |
| use a certain interpretation of file URLs (described in this text), but that |
| interpretation can be incompatible with the interpretations used by other |
| programs.</P> |
| |
| <P>Uno URLs follow a private scheme that is explained in the document <A HREF="../../../common/man/spec/uno-url.html"><CITE>UNO-Url</CITE></A>. |
| |
| <P>Other URL schemes (http, ftp, etc.) are used within OpenOffice.org (mainly |
| through the class <CODE>INetURLObject</CODE> in the <CODE>tools</CODE> project), |
| and many of them suffer from the same problems as file URLs (see below).</P> |
| |
| <H2>File URL Basics</H2> |
| |
| <P>A file URL consists of encoded data, interspersed with some delimiting |
| characters. Considering the file URL<P> |
| |
| <PRE> |
| <CODE>file://<VAR>host</VAR>/<VAR>seg<SUB>1</SUB></VAR>/<VAR>seg<SUB>2</SUB></VAR>/<VAR>…</VAR>/<VAR>seg<SUB>n</SUB></VAR></CODE> |
| </PRE> |
| |
| <P>the <CODE><VAR>host</VAR></CODE> and <CODE><VAR>seg<SUB>1</SUB></VAR></CODE>, |
| <CODE><VAR>seg<SUB>2</SUB></VAR></CODE>, …, |
| <CODE><VAR>seg<SUB>n</SUB></VAR></CODE> parts are the encoded data, and the |
| rest (the <CODE>file://</CODE> and the single slashes) are syntactic delimiters. |
| The encoded data in the <CODE><VAR>seg<SUB>i</SUB></VAR></CODE> parts are |
| sequences of bytes, written using ASCII characters (that represent the numeric |
| values of the ASCII characters themselves) and <DFN>escape sequences</DFN> (a |
| <CODE>%</CODE> followed by two hexadecimal digits, representing any numeric |
| value in the range 0–255). (The encoded data in the optional |
| <CODE><VAR>host</VAR></CODE> part follows other rules.)</P> |
| |
| <P>File URLs are used to locate files on a certain machine, that is, they |
| somehow have to encode (platform-dependent) file system paths as used on that |
| machine in their <CODE><VAR>seg<SUB>i</SUB></VAR></CODE> parts. The problem is |
| that there is no global specification of exactly how file URLs encode file |
| system paths.</P> |
| |
| <P>The strategy chosen by the OpenOffice.org code maps from file system paths |
| (as used in some operating system's interfaces) to file URLs in two steps. In |
| the first step (which is platform dependent), a hierarchical file system path is |
| translated into a sequence of segments, represented in Unicode. In the second |
| step (which is platform independent), those segments are translated into |
| sequences of bytes, using UTF-8, and then concatenated into URLs (encoding the |
| individual bytes as single ASCII characters or as escape sequences).</P> |
| |
| <P>As an example, consider the Windows file system path</P> |
| |
| <PRE> |
| <CODE>C:\directory\other dir\file.txt</CODE> |
| </PRE> |
| |
| <P>This is first translated into the four segments <CODE>C:</CODE>, |
| <CODE>directory</CODE>, <CODE>other dir</CODE>, and <CODE>file.txt</CODE> (all |
| represented using Unicode). Using UTF-8, these segments are then translated |
| into the corresponding byte sequences (represented here as sequences of |
| hexadecimal numbers):</P> |
| |
| <UL> |
| <LI>43 3A <EM>(<CODE>C:</CODE>)</EM> |
| <LI>64 69 72 65 63 74 6F 72 79 <EM>(<CODE>directory</CODE>)</EM> |
| <LI>6F 74 68 65 72 20 64 69 72 <EM>(<CODE>other dir</CODE>)</EM> |
| <LI>66 69 6C 65 2E 74 78 74 <EM>(<CODE>file.txt</CODE>)</EM> |
| </UL> |
| |
| <P>Then, these byte sequences are combined into a single file URL (adding the |
| necessary syntactic delimiters), namely</P> |
| |
| <PRE> |
| <CODE>file:///C:/directory/other%20dir/file.txt</CODE> |
| </PRE> |
| |
| <P>Nothing exciting here (all the bytes from the four sequences are represented |
| as the corresponding ASCII characters, except for the space in <CODE>other |
| dir</CODE>, which is illegal in URLs and is thus escaped as |
| <CODE>%20</CODE>).</P> |
| |
| <P>Similarly, a Unix file system path</P> |
| |
| <PRE> |
| <CODE>/directory/other dir/file.txt</CODE> |
| </PRE> |
| |
| <P>is translated into the URL</P> |
| |
| <PRE> |
| <CODE>file:///directory/other%20dir/file.txt</CODE> |
| </PRE> |
| |
| <H2>Non-ASCII Characters</H2> |
| |
| <P>On many platforms, file system paths can contain non-ASCII characters. This |
| is typically handled within the operating system by naming files with byte |
| strings, and letting the user choose a character encoding (via a locale) that |
| specifies how these byte strings are to be interpreted.</P> |
| |
| <P>Consider the Unix file system path <CODE>/stränge</CODE> (assuming the |
| user's locale is such that it contains an ä in the supported character |
| repertoire). Again, translation into a file URL is done in two steps: First, |
| the path is split into the single segment <CODE>stränge</CODE> (represented |
| as a Unicode string; some platform-dependent “magic” is needed to |
| convert from the operation system's representation to this Unicode |
| representation). Second, that segment is translated into the (hexadecimal) byte |
| sequence 73 74 72 C3 A4 6E 67 65, and the URL <CODE>file:///str%C3%A4nge</CODE> |
| is constructed from that byte sequence.</P> |
| |
| <P>Other programs may handle file URLs differently, in that they directly use |
| the operating system's byte strings (interpreted in a locale chosen by the user) |
| within the file URL, without going via Unicode and UTF-8. For example, if the |
| user had chosen a ISO 8859-1 locale, the Unix file system path |
| <CODE>/stränge</CODE> (possibly represented as the hexadecimal byte string |
| 2F 73 74 72 E4 6E 67 65 by the operating system) would correspond to the URL |
| <CODE>file:///str%E4nge</CODE>.</P> |
| |
| <P>If the OpenOffice.org code wants to exchange file URLs with such other |
| programs, the URLs have to be converted back and forth at the interfaces, to |
| avoid any misunderstandings. This problem is typically not noticed when the |
| file system paths contain only ASCII characters, as these are always represented |
| the same within file URLs.</P> |
| |
| <P>Both approaches (“UTF-8 URLs” and “locale dependent |
| URLs”) have pros and cons, but the main benefit of the OpenOffice.org |
| (i.e., UTF-8 URLs) approach seems to be the stability of how a file URL locates |
| a specific file, regardless of context. Imagine a text processor application |
| that lets you include URLs as hyperlinks within your documents, and imagine that |
| application used locale dependent URLs. When creating a document, a user has |
| specified locale <VAR>X</VAR> for his operating system, and a file URL |
| included in the document is therefore encoded using the conventions of |
| locale <VAR>X</VAR>. Now, the user switches to locale <VAR>Y</VAR> |
| and re-opens the document: The hyperlink is no longer guaranteed to point to |
| the same file, as the file URL is not encoded using the conventions of |
| locale <VAR>Y</VAR>.</P> |
| |
| <P>Also, with the UTF-8 URLs approach, code that handles file URLs can often be |
| made simpler and platform-independent. A typical scenario is a text edit field |
| allowing the user to enter a (file) URL. Many users are not familiar with the |
| nitty details of URLs, and will type in things like |
| <CODE>file:///stränge</CODE>. Even though this is not a valid URL (since |
| an <CODE>ä</CODE> is not allowed within a URL), it would be nice if the |
| application were forgiving and would handle that input as locating the file |
| <CODE>/stränge</CODE>. With the UTF-8 URLs approach, this is easy, as the |
| text edit can silently convert the input into the correct URL |
| <CODE>file:///str%C3%A4nge</CODE>. With the locale dependent URLs approach, |
| this would be more difficult, as the text edit would have to know which locale |
| is in use, to convert the input into something like |
| <CODE>file:///str%E4nge</CODE>, but only in case the ISO 8859-1 locale was |
| used.</P> |
| |
| <P>Note that these problems of interpreting non-ASCII characters are not |
| restricted to file URLs. Other URL schemes that do not explicitly state what |
| character encoding to use (like http and ftp URLs) have similar problems. As |
| Richard Gillam puts it in his book <CITE>Unicode Demystified</CITE> |
| (Addison-Wesley, 2003):</P> |
| <BLOCKQUOTE> |
| <P>The industry is converging around always treating escape sequences in |
| URLs as referring to UTF-8 code units. That is, the industry is leaning |
| toward always interpreting <CODE>R%c3%a9sum%c3%a9.html</CODE> to mean |
| <CODE>Résumé.html</CODE> (and always representing |
| <CODE>Résumé.html</CODE> as |
| <CODE>R%c3%a9sum%c3%a9.html</CODE>). If everyone agreed on this system, |
| then you could use illegal URL characters (such as the accented é in |
| our example) in URL references in other kinds of documents (such as HTML or |
| XML files) and know that a universally understood method of transforming |
| them into a legal URL existed. Web browsers or other software could do the |
| reverse, displaying URLs that include escape sequences by using the |
| characters the escape sequences represent (at least in the cases where they |
| represent non-ASCII characters) and allowing you to type them in that |
| way.</P> |
| </BLOCKQUOTE> |
| |
| <H2>The UTF-8 URLs Approach</H2> |
| |
| <P>There are a number of problems specific to the UTF-8 URLs approach:</P> |
| |
| <P>First, the UTF-8 URLs approach has an additional step compared to the locale |
| dependent URLs approach, namely the translation between a system specific |
| representation of some textual entity and a Unicode representation of that |
| entity. In the one direction, a problem might occur when a certain entity |
| represented in the system specific way cannot be represented in Unicode (e.g., |
| because it contains characters that are not present in the Unicode repertoire). |
| Given that Unicode was specifically designed to encompass the repertoires of all |
| legacy character encodings in use today, chances for such a problem should be |
| close to zero. In the other direction, different Unicode strings could |
| translate to the same system specific representation (e.g., because two |
| different Unicode characters are mapped to the same character in the system |
| specific encoding). This leads to two different URLs locating the same |
| resource, something that should not be considered much of a problem, since it is |
| already a wide-spread phenomenon (think about file URLs differing in the use of |
| upper and lower case letters on a case-insensitive file system, or think about |
| file systems that support links).</P> |
| |
| <P>Another problem stems from the fact that Unicode allows a single |
| “conceptual character” (as interpreted by a user, e.g., the |
| character “ä”) to be represented in different ways. For |
| example, the conceptual character “ä” can be represented as |
| either the single code point</P> |
| |
| <UL> |
| <LI>U+00E4 LATIN SMALL LETTER A WITH DIARESIS |
| </UL> |
| |
| <P>or as a sequence of the two code points</P> |
| |
| <UL> |
| <LI>U+0061 LATIN SMALL LETTER A |
| <LI>U+0308 COMBINING DIARESIS |
| </UL> |
| |
| <P>(a so-called “combining character sequence”). Both |
| representations should be considered equivalent, so that the two URLs |
| <CODE>file:///str%C3%A4nge</CODE> (containing a UTF-8 encoded U+00E4) and |
| <CODE>file:///stra%CC%88nge</CODE> (containing a U+0061 represented as |
| ASCII <CODE>a</CODE> followed by a UTF-8 encoded U+0308) should probably be |
| considered equivalent also, and should both denote a file named |
| <CODE>stränge</CODE>. In current versions of OpenOffice.org |
| (“SRC644”), loading a file named <CODE>stränge</CODE> on |
| Windows XP works with the former URL, but fails with the latter.</P> |
| |
| <P>Two solutions for this problem seem possible: One solution would be to |
| enhance the (platform dependent) code that maps from Unicode to a system |
| specific character encoding so that it handles combining character sequences |
| correctly. Another solution is to require all URLs to use only one form of |
| Unicode strings (i.e., to use a <DFN>normalization form</DFN>), which would make |
| this problem go away. The obvious choice is to use |
| <A href="http://www.unicode.org/unicode/reports/tr15/">Unicode Normalization |
| Form C</A>, as is also recommended by the W3C's |
| <A href="http://www.w3.org/TR/charmod/"><CITE>Character Model for the World Wide |
| Web 1.0</CITE></A>. Using that solution, only <CODE>file:///str%C3%A4nge</CODE> |
| would be a valid URL to access the file <CODE>stränge</CODE>, while the URL |
| <CODE>file:///stra%CC%88nge</CODE> would be ruled out as invalid.</P> |
| |
| <P>Following the approach of requiring URLs to use a normalization form, a new |
| problem might show up: Consider an operating system that allows files to be |
| named using some Unicode encoding (any of the UTF-<VAR>n</VAR> encoding |
| forms/schemes), but that is uncompliant enough to allow two different files to |
| have names that a Unicode-compliant system should consider equivalent (e.g., a |
| file named <CODE>stränge</CODE> written using U+00E4 and another file named |
| <CODE>stränge</CODE> written using U+0061 followed by U+0308). Now, the |
| requirement to always use normalized Unicode strings within file URLs makes it |
| impossible to access one of the two files with a URL. (This is another |
| manifestation of the problem already described above, that an entity represented |
| in the system specific way cannot be presented in Unicode—or, more |
| specifically, in some Unicode normalization form. Above, that problem was said |
| to be ignorable; here, one can only hope for Unicode-compliant operating systems |
| that do not allow two different files to have equivalent names.)</P> |
| |
| <H2>Drive Letters</H2> |
| |
| <P>In Windows, as well as in related systems like DOS and OS/2, file system |
| paths start with a <DFN>drive letter</DFN>, followed by a colon (e.g., the |
| <CODE>C:</CODE> in <CODE>C:\directory\file.txt</CODE>). That drive letter |
| (together with the following colon) makes up the first segment of file URLs on |
| these systems (as in <CODE>file:///C:/directory/file.txt</CODE>). However, |
| historically also a vertical bar has been used instead of the colon, as in |
| <CODE>file:///C|/directory/file.txt</CODE> (note that the vertical bar is not |
| escaped as <CODE>%7C</CODE> in this special case, even though it is an illegal |
| character).</P> |
| |
| <P>The OpenOffice.org code can handle both cases of file URLs (with either a |
| colon or a vertical bar), but URLs generated by the code always follow the |
| “standard” convention of using a colon.</P> |
| |
| <H2>Hosts</H2> |
| |
| <P>The file URL scheme allows for an optional <CODE><VAR>host</VAR></CODE> |
| component after the <CODE>file://</CODE> prefix. Specifying |
| <CODE>localhost</CODE> (in any combination of upper and lower case letters) is |
| the same as leaving that component empty: the URL locates a file system path on |
| the current machine.</P> |
| |
| <P>The intended use for this <CODE><VAR>host</VAR></CODE> component was to |
| specify the DNS name (or IPv4/IPv6 address) of a machine, to indicate that the |
| URL locates a file system path on that machine. The problem is that there is no |
| protocol that details how such a remote file should be accessed, so |
| interpretation of file URLs containing a <CODE><VAR>host</VAR></CODE> component |
| was left unspecified.</P> |
| |
| <P>On Unix, the OpenOffice.org code does not support file URLs with a |
| <CODE><VAR>host</VAR></CODE> component. Windows, on the other hand, knows the |
| concept of <DFN>UNC paths</DFN>, file system paths containing the name of a |
| remote machine:</P> |
| |
| <PRE> |
| <CODE>\\somewhere\somedir\file.txt</CODE> |
| </PRE> |
| |
| <P>The machine names used in UNC paths (<CODE>somewhere</CODE> in the above |
| example) have another structure than DNS names, so, strictly speaking, they |
| could not be used in the <CODE><VAR>host</VAR></CODE> component of file URLs. |
| Nevertheless, many applications on Windows, including the OpenOffice.org code, |
| follow the convention of allowing UNC machine names within file URLs. This |
| means that the above UNC path corresponds with the URL</P> |
| |
| <PRE> |
| <CODE>file://somewhere/somedir/file.txt</CODE> |
| </PRE> |
| |
| <TABLE WIDTH="100%" BORDER="0" CELLSPACING="0" CELLPADDING="4" summary=footer> |
| <TR> |
| <TD BGCOLOR="#666699"> |
| <P><FONT COLOR="White">Author: <A HREF="mailto:stephan.bergmann@sun.com"><FONT COLOR="White">Stephan Bergmann</FONT></A> (Last modification $Date: 2004/12/08 12:03:50 $).<br/> |
| Copyright 2002 <A HREF="http://www.openoffice.org"><FONT COLOR="White">OpenOffice.org</FONT></A> Foundation. All Rights Reserved.</FONT></P> |
| </TD> |
| </TR> |
| </TABLE> |
| </body> |
| </HTML> |