content/udk/cpp/man/spec/uris.html - openoffice-org - Git at Google

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
 <HTML>
 <head>
     <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=iso-8859-1">
     <TITLE>URIs in UDK</TITLE>
 <style type="text/css">
 <!--
 h1 { text-align:center; margin-top: 0.2cm; text-decoration: none; color: #ffffff; font-size: 6; margin-top: 0.2cm}
 h2 { margin-top: 0.2cm; margin-bottom=0.1cm; color: #ffffff; background-color: #666699 }
 li {margin-bottom: 0.2cm;}
 dl {margin-bottom: 0.2cm;}
 dd {margin-bottom: 0.2cm;}
 dt {margin-bottom: 0.2cm;}
 -->
 </style>
 </head>
 <body>
 <TABLE WIDTH="100%" BORDER="0" CELLSPACING="0" CELLPADDING="4" bgcolor=#666699 summary=header>
     <TR>
         <td>
             <h1> URIs in UDK </h1>
 		<td></td>
 			<a href="http://www.openoffice.org"><img src="../../../images/open_office_org_logo.gif" alt="OpenOffice.org" align="right" border="0"></a>
         </TD>
     </TR>
 </TABLE>


 <P>This text describes how Uniform Resource Identifiers (URIs, see
 <A href="http://www.ietf.org/rfc/rfc2396.txt">RFC&nbsp;2396</A>) are used within
 the UDK and within OpenOffice.org in general.  URIs encompass both the widely
 known Uniform Resource Locators (URLs) and the lesser known Uniform Resource
 Names (URNs), but if you only know the term URL and neither URI nor URN, that's
 just fine, as the discussion here in fact centers around only a few URL
 schemes.</P>

 <H2>Used URL Schemes</H2>

 <P>Currently, the UDK only uses two URL schemes directly: file URLs and uno
 URLs.  File URLs are defined in
 <A href="http://www.ietf.org/rfc/rfc1738.txt">RFC&nbsp;1738</A>, but that
 definition leaves the semantics somewhat open.  The OpenOffice.org code chose to
 use a certain interpretation of file URLs (described in this text), but that
 interpretation can be incompatible with the interpretations used by other
 programs.</P>

 <P>Uno URLs follow a private scheme that is explained in the document <A HREF="../../../common/man/spec/uno-url.html"><CITE>UNO-Url</CITE></A>.

 <P>Other URL schemes (http, ftp, etc.) are used within OpenOffice.org (mainly
 through the class <CODE>INetURLObject</CODE> in the <CODE>tools</CODE> project),
 and many of them suffer from the same problems as file URLs (see below).</P>

 <H2>File URL Basics</H2>

 <P>A file URL consists of encoded data, interspersed with some delimiting
 characters.  Considering the file URL<P>

 <PRE>
 <CODE>file://<VAR>host</VAR>/<VAR>seg<SUB>1</SUB></VAR>/<VAR>seg<SUB>2</SUB></VAR>/<VAR>&hellip;</VAR>/<VAR>seg<SUB>n</SUB></VAR></CODE>
 </PRE>

 <P>the <CODE><VAR>host</VAR></CODE> and <CODE><VAR>seg<SUB>1</SUB></VAR></CODE>,
 <CODE><VAR>seg<SUB>2</SUB></VAR></CODE>, &hellip;,
 <CODE><VAR>seg<SUB>n</SUB></VAR></CODE> parts are the encoded data, and the
 rest (the <CODE>file://</CODE> and the single slashes) are syntactic delimiters.
 The encoded data in the <CODE><VAR>seg<SUB>i</SUB></VAR></CODE> parts are
 sequences of bytes, written using ASCII characters (that represent the numeric
 values of the ASCII characters themselves) and <DFN>escape sequences</DFN> (a
 <CODE>%</CODE> followed by two hexadecimal digits, representing any numeric
 value in the range 0&ndash;255).  (The encoded data in the optional
 <CODE><VAR>host</VAR></CODE> part follows other rules.)</P>

 <P>File URLs are used to locate files on a certain machine, that is, they
 somehow have to encode (platform-dependent) file system paths as used on that
 machine in their <CODE><VAR>seg<SUB>i</SUB></VAR></CODE> parts.  The problem is
 that there is no global specification of exactly how file URLs encode file
 system paths.</P>

 <P>The strategy chosen by the OpenOffice.org code maps from file system paths
 (as used in some operating system's interfaces) to file URLs in two steps.  In
 the first step (which is platform dependent), a hierarchical file system path is
 translated into a sequence of segments, represented in Unicode.  In the second
 step (which is platform independent), those segments are translated into
 sequences of bytes, using UTF-8, and then concatenated into URLs (encoding the
 individual bytes as single ASCII characters or as escape sequences).</P>

 <P>As an example, consider the Windows file system path</P>

 <PRE>
 <CODE>C:\directory\other dir\file.txt</CODE>
 </PRE>

 <P>This is first translated into the four segments <CODE>C:</CODE>,
 <CODE>directory</CODE>, <CODE>other dir</CODE>, and <CODE>file.txt</CODE> (all
 represented using Unicode).  Using UTF-8, these segments are then translated
 into the corresponding byte sequences (represented here as sequences of
 hexadecimal numbers):</P>

 <UL>
     <LI>43 3A <EM>(<CODE>C:</CODE>)</EM>
     <LI>64 69 72 65 63 74 6F 72 79 <EM>(<CODE>directory</CODE>)</EM>
     <LI>6F 74 68 65 72 20 64 69 72 <EM>(<CODE>other dir</CODE>)</EM>
     <LI>66 69 6C 65 2E 74 78 74 <EM>(<CODE>file.txt</CODE>)</EM>
 </UL>

 <P>Then, these byte sequences are combined into a single file URL (adding the
 necessary syntactic delimiters), namely</P>

 <PRE>
 <CODE>file:///C:/directory/other%20dir/file.txt</CODE>
 </PRE>

 <P>Nothing exciting here (all the bytes from the four sequences are represented
 as the corresponding ASCII characters, except for the space in <CODE>other
 dir</CODE>, which is illegal in URLs and is thus escaped as
 <CODE>%20</CODE>).</P>

 <P>Similarly, a Unix file system path</P>

 <PRE>
 <CODE>/directory/other dir/file.txt</CODE>
 </PRE>

 <P>is translated into the URL</P>

 <PRE>
 <CODE>file:///directory/other%20dir/file.txt</CODE>
 </PRE>

 <H2>Non-ASCII Characters</H2>

 <P>On many platforms, file system paths can contain non-ASCII characters.  This
 is typically handled within the operating system by naming files with byte
 strings, and letting the user choose a character encoding (via a locale) that
 specifies how these byte strings are to be interpreted.</P>

 <P>Consider the Unix file system path <CODE>/str&auml;nge</CODE> (assuming the
 user's locale is such that it contains an &auml; in the supported character
 repertoire).  Again, translation into a file URL is done in two steps:  First,
 the path is split into the single segment <CODE>str&auml;nge</CODE> (represented
 as a Unicode string; some platform-dependent &ldquo;magic&rdquo; is needed to
 convert from the operation system's representation to this Unicode
 representation).  Second, that segment is translated into the (hexadecimal) byte
 sequence 73 74 72 C3 A4 6E 67 65, and the URL <CODE>file:///str%C3%A4nge</CODE>
 is constructed from that byte sequence.</P>

 <P>Other programs may handle file URLs differently, in that they directly use
 the operating system's byte strings (interpreted in a locale chosen by the user)
 within the file URL, without going via Unicode and UTF-8.  For example, if the
 user had chosen a ISO 8859-1 locale, the Unix file system path
 <CODE>/str&auml;nge</CODE> (possibly represented as the hexadecimal byte string
 2F 73 74 72 E4 6E 67 65 by the operating system) would correspond to the URL
 <CODE>file:///str%E4nge</CODE>.</P>

 <P>If the OpenOffice.org code wants to exchange file URLs with such other
 programs, the URLs have to be converted back and forth at the interfaces, to
 avoid any misunderstandings.  This problem is typically not noticed when the
 file system paths contain only ASCII characters, as these are always represented
 the same within file URLs.</P>

 <P>Both approaches (&ldquo;UTF-8 URLs&rdquo; and &ldquo;locale dependent
 URLs&rdquo;) have pros and cons, but the main benefit of the OpenOffice.org
 (i.e., UTF-8 URLs) approach seems to be the stability of how a file URL locates
 a specific file, regardless of context.  Imagine a text processor application
 that lets you include URLs as hyperlinks within your documents, and imagine that
 application used locale dependent URLs.  When creating a document, a user has
 specified locale&nbsp;<VAR>X</VAR> for his operating system, and a file URL
 included in the document is therefore encoded using the conventions of
 locale&nbsp;<VAR>X</VAR>.  Now, the user switches to locale&nbsp;<VAR>Y</VAR>
 and re-opens the document:  The hyperlink is no longer guaranteed to point to
 the same file, as the file URL is not encoded using the conventions of
 locale&nbsp;<VAR>Y</VAR>.</P>

 <P>Also, with the UTF-8 URLs approach, code that handles file URLs can often be
 made simpler and platform-independent.  A typical scenario is a text edit field
 allowing the user to enter a (file) URL.  Many users are not familiar with the
 nitty details of URLs, and will type in things like
 <CODE>file:///str&auml;nge</CODE>.  Even though this is not a valid URL (since
 an <CODE>&auml;</CODE> is not allowed within a URL), it would be nice if the
 application were forgiving and would handle that input as locating the file
 <CODE>/str&auml;nge</CODE>.  With the UTF-8 URLs approach, this is easy, as the
 text edit can silently convert the input into the correct URL
 <CODE>file:///str%C3%A4nge</CODE>.  With the locale dependent URLs approach,
 this would be more difficult, as the text edit would have to know which locale
 is in use, to convert the input into something like
 <CODE>file:///str%E4nge</CODE>, but only in case the ISO 8859-1 locale was
 used.</P>

 <P>Note that these problems of interpreting non-ASCII characters are not
 restricted to file URLs.  Other URL schemes that do not explicitly state what
 character encoding to use (like http and ftp URLs) have similar problems.  As
 Richard Gillam puts it in his book <CITE>Unicode Demystified</CITE>
 (Addison-Wesley, 2003):</P>
 <BLOCKQUOTE>
     <P>The industry is converging around always treating escape sequences in
     URLs as referring to UTF-8 code units.  That is, the industry is leaning
     toward always interpreting <CODE>R%c3%a9sum%c3%a9.html</CODE> to mean
     <CODE>R&eacute;sum&eacute;.html</CODE> (and always representing
     <CODE>R&eacute;sum&eacute;.html</CODE> as
     <CODE>R%c3%a9sum%c3%a9.html</CODE>).  If everyone agreed on this system,
     then you could use illegal URL characters (such as the accented &eacute; in
     our example) in URL references in other kinds of documents (such as HTML or
     XML files) and know that a universally understood method of transforming
     them into a legal URL existed.  Web browsers or other software could do the
     reverse, displaying URLs that include escape sequences by using the
     characters the escape sequences represent (at least in the cases where they
     represent non-ASCII characters) and allowing you to type them in that
     way.</P>
 </BLOCKQUOTE>

 <H2>The UTF-8 URLs Approach</H2>

 <P>There are a number of problems specific to the UTF-8 URLs approach:</P>

 <P>First, the UTF-8 URLs approach has an additional step compared to the locale
 dependent URLs approach, namely the translation between a system specific
 representation of some textual entity and a Unicode representation of that
 entity.  In the one direction, a problem might occur when a certain entity
 represented in the system specific way cannot be represented in Unicode (e.g.,
 because it contains characters that are not present in the Unicode repertoire).
 Given that Unicode was specifically designed to encompass the repertoires of all
 legacy character encodings in use today, chances for such a problem should be
 close to zero.  In the other direction, different Unicode strings could
 translate to the same system specific representation (e.g., because two
 different Unicode characters are mapped to the same character in the system
 specific encoding).  This leads to two different URLs locating the same
 resource, something that should not be considered much of a problem, since it is
 already a wide-spread phenomenon (think about file URLs differing in the use of
 upper and lower case letters on a case-insensitive file system, or think about
 file systems that support links).</P>

 <P>Another problem stems from the fact that Unicode allows a single
 &ldquo;conceptual character&rdquo; (as interpreted by a user, e.g., the
 character &ldquo;&auml;&rdquo;) to be represented in different ways.  For
 example, the conceptual character &ldquo;&auml;&rdquo; can be represented as
 either the single code point</P>

 <UL>
     <LI>U+00E4 LATIN SMALL LETTER A WITH DIARESIS
 </UL>

 <P>or as a sequence of the two code points</P>

 <UL>
     <LI>U+0061 LATIN SMALL LETTER A
     <LI>U+0308 COMBINING DIARESIS
 </UL>

 <P>(a so-called &ldquo;combining character sequence&rdquo;).  Both
 representations should be considered equivalent, so that the two URLs
 <CODE>file:///str%C3%A4nge</CODE> (containing a UTF-8 encoded U+00E4) and
 <CODE>file:///stra%CC%88nge</CODE> (containing a U+0061 represented as
 ASCII&nbsp;<CODE>a</CODE> followed by a UTF-8 encoded U+0308) should probably be
 considered equivalent also, and should both denote a file named
 <CODE>str&auml;nge</CODE>.  In current versions of OpenOffice.org
 (&ldquo;SRC644&rdquo;), loading a file named <CODE>str&auml;nge</CODE> on
 Windows&nbsp;XP works with the former URL, but fails with the latter.</P>

 <P>Two solutions for this problem seem possible:  One solution would be to
 enhance the (platform dependent) code that maps from Unicode to a system
 specific character encoding so that it handles combining character sequences
 correctly.  Another solution is to require all URLs to use only one form of
 Unicode strings (i.e., to use a <DFN>normalization form</DFN>), which would make
 this problem go away.  The obvious choice is to use
 <A href="http://www.unicode.org/unicode/reports/tr15/">Unicode Normalization
 Form&nbsp;C</A>, as is also recommended by the W3C's
 <A href="http://www.w3.org/TR/charmod/"><CITE>Character Model for the World Wide
 Web 1.0</CITE></A>.  Using that solution, only <CODE>file:///str%C3%A4nge</CODE>
 would be a valid URL to access the file <CODE>str&auml;nge</CODE>, while the URL
 <CODE>file:///stra%CC%88nge</CODE> would be ruled out as invalid.</P>

 <P>Following the approach of requiring URLs to use a normalization form, a new
 problem might show up:  Consider an operating system that allows files to be
 named using some Unicode encoding (any of the UTF-<VAR>n</VAR> encoding
 forms/schemes), but that is uncompliant enough to allow two different files to
 have names that a Unicode-compliant system should consider equivalent (e.g., a
 file named <CODE>str&auml;nge</CODE> written using U+00E4 and another file named
 <CODE>str&auml;nge</CODE> written using U+0061 followed by U+0308).  Now, the
 requirement to always use normalized Unicode strings within file URLs makes it
 impossible to access one of the two files with a URL.  (This is another
 manifestation of the problem already described above, that an entity represented
 in the system specific way cannot be presented in Unicode&mdash;or, more
 specifically, in some Unicode normalization form.  Above, that problem was said
 to be ignorable; here, one can only hope for Unicode-compliant operating systems
 that do not allow two different files to have equivalent names.)</P>

 <H2>Drive Letters</H2>

 <P>In Windows, as well as in related systems like DOS and OS/2, file system
 paths start with a <DFN>drive letter</DFN>, followed by a colon (e.g., the
 <CODE>C:</CODE> in <CODE>C:\directory\file.txt</CODE>).  That drive letter
 (together with the following colon) makes up the first segment of file URLs on
 these systems (as in <CODE>file:///C:/directory/file.txt</CODE>).  However,
 historically also a vertical bar has been used instead of the colon, as in
 <CODE>file:///C|/directory/file.txt</CODE> (note that the vertical bar is not
 escaped as <CODE>%7C</CODE> in this special case, even though it is an illegal
 character).</P>

 <P>The OpenOffice.org code can handle both cases of file URLs (with either a
 colon or a vertical bar), but URLs generated by the code always follow the
 &ldquo;standard&rdquo; convention of using a colon.</P>

 <H2>Hosts</H2>

 <P>The file URL scheme allows for an optional <CODE><VAR>host</VAR></CODE>
 component after the <CODE>file://</CODE> prefix.  Specifying
 <CODE>localhost</CODE> (in any combination of upper and lower case letters) is
 the same as leaving that component empty: the URL locates a file system path on
 the current machine.</P>

 <P>The intended use for this <CODE><VAR>host</VAR></CODE> component was to
 specify the DNS name (or IPv4/IPv6 address) of a machine, to indicate that the
 URL locates a file system path on that machine.  The problem is that there is no
 protocol that details how such a remote file should be accessed, so
 interpretation of file URLs containing a <CODE><VAR>host</VAR></CODE> component
 was left unspecified.</P>

 <P>On Unix, the OpenOffice.org code does not support file URLs with a
 <CODE><VAR>host</VAR></CODE> component.  Windows, on the other hand, knows the
 concept of <DFN>UNC paths</DFN>, file system paths containing the name of a
 remote machine:</P>

 <PRE>
 <CODE>\\somewhere\somedir\file.txt</CODE>
 </PRE>

 <P>The machine names used in UNC paths (<CODE>somewhere</CODE> in the above
 example) have another structure than DNS names, so, strictly speaking, they
 could not be used in the <CODE><VAR>host</VAR></CODE> component of file URLs.
 Nevertheless, many applications on Windows, including the OpenOffice.org code,
 follow the convention of allowing UNC machine names within file URLs.  This
 means that the above UNC path corresponds with the URL</P>

 <PRE>
 <CODE>file://somewhere/somedir/file.txt</CODE>
 </PRE>

 <TABLE WIDTH="100%" BORDER="0" CELLSPACING="0" CELLPADDING="4" summary=footer>
     <TR>
         <TD BGCOLOR="#666699">
             <P><FONT COLOR="White">Author: <A HREF="mailto:stephan.bergmann@sun.com"><FONT COLOR="White">Stephan Bergmann</FONT></A> (Last modification $Date: 2004/12/08 12:03:50 $).<br/>
 			Copyright 2002 <A HREF="http://www.openoffice.org"><FONT COLOR="White">OpenOffice.org</FONT></A> Foundation.  All Rights Reserved.</FONT></P>
         </TD>
     </TR>
 </TABLE>
 </body>
 </HTML>
	<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
	<HTML>
	<head>
	<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=iso-8859-1">
	<TITLE>URIs in UDK</TITLE>
	<style type="text/css">
	<!--
	h1 { text-align:center; margin-top: 0.2cm; text-decoration: none; color: #ffffff; font-size: 6; margin-top: 0.2cm}
	h2 { margin-top: 0.2cm; margin-bottom=0.1cm; color: #ffffff; background-color: #666699 }
	li {margin-bottom: 0.2cm;}
	dl {margin-bottom: 0.2cm;}
	dd {margin-bottom: 0.2cm;}
	dt {margin-bottom: 0.2cm;}
	-->
	</style>
	</head>
	<body>
	<TABLE WIDTH="100%" BORDER="0" CELLSPACING="0" CELLPADDING="4" bgcolor=#666699 summary=header>
	<TR>
	<td>
	<h1> URIs in UDK </h1>
	<td></td>
	<a href="http://www.openoffice.org"><img src="../../../images/open_office_org_logo.gif" alt="OpenOffice.org" align="right" border="0"></a>
	</TD>
	</TR>
	</TABLE>


	<P>This text describes how Uniform Resource Identifiers (URIs, see
	<A href="http://www.ietf.org/rfc/rfc2396.txt">RFC 2396</A>) are used within
	the UDK and within OpenOffice.org in general. URIs encompass both the widely
	known Uniform Resource Locators (URLs) and the lesser known Uniform Resource
	Names (URNs), but if you only know the term URL and neither URI nor URN, that's
	just fine, as the discussion here in fact centers around only a few URL
	schemes.</P>

	<H2>Used URL Schemes</H2>

	<P>Currently, the UDK only uses two URL schemes directly: file URLs and uno
	URLs. File URLs are defined in
	<A href="http://www.ietf.org/rfc/rfc1738.txt">RFC 1738</A>, but that
	definition leaves the semantics somewhat open. The OpenOffice.org code chose to
	use a certain interpretation of file URLs (described in this text), but that
	interpretation can be incompatible with the interpretations used by other
	programs.</P>

	<P>Uno URLs follow a private scheme that is explained in the document <A HREF="../../../common/man/spec/uno-url.html"><CITE>UNO-Url</CITE></A>.

	<P>Other URL schemes (http, ftp, etc.) are used within OpenOffice.org (mainly
	through the class <CODE>INetURLObject</CODE> in the <CODE>tools</CODE> project),
	and many of them suffer from the same problems as file URLs (see below).</P>

	<H2>File URL Basics</H2>

	<P>A file URL consists of encoded data, interspersed with some delimiting
	characters. Considering the file URL<P>

	<PRE>
	<CODE>file://<VAR>host</VAR>/<VAR>seg<SUB>1</SUB></VAR>/<VAR>seg<SUB>2</SUB></VAR>/<VAR>…</VAR>/<VAR>seg<SUB>n</SUB></VAR></CODE>
	</PRE>

	<P>the <CODE><VAR>host</VAR></CODE> and <CODE><VAR>seg<SUB>1</SUB></VAR></CODE>,
	<CODE><VAR>seg<SUB>2</SUB></VAR></CODE>, …,
	<CODE><VAR>seg<SUB>n</SUB></VAR></CODE> parts are the encoded data, and the
	rest (the <CODE>file://</CODE> and the single slashes) are syntactic delimiters.
	The encoded data in the <CODE><VAR>seg<SUB>i</SUB></VAR></CODE> parts are
	sequences of bytes, written using ASCII characters (that represent the numeric
	values of the ASCII characters themselves) and <DFN>escape sequences</DFN> (a
	<CODE>%</CODE> followed by two hexadecimal digits, representing any numeric
	value in the range 0–255). (The encoded data in the optional
	<CODE><VAR>host</VAR></CODE> part follows other rules.)</P>

	<P>File URLs are used to locate files on a certain machine, that is, they
	somehow have to encode (platform-dependent) file system paths as used on that
	machine in their <CODE><VAR>seg<SUB>i</SUB></VAR></CODE> parts. The problem is
	that there is no global specification of exactly how file URLs encode file
	system paths.</P>

	<P>The strategy chosen by the OpenOffice.org code maps from file system paths
	(as used in some operating system's interfaces) to file URLs in two steps. In
	the first step (which is platform dependent), a hierarchical file system path is
	translated into a sequence of segments, represented in Unicode. In the second
	step (which is platform independent), those segments are translated into
	sequences of bytes, using UTF-8, and then concatenated into URLs (encoding the
	individual bytes as single ASCII characters or as escape sequences).</P>

	<P>As an example, consider the Windows file system path</P>

	<PRE>
	<CODE>C:\directory\other dir\file.txt</CODE>
	</PRE>

	<P>This is first translated into the four segments <CODE>C:</CODE>,
	<CODE>directory</CODE>, <CODE>other dir</CODE>, and <CODE>file.txt</CODE> (all
	represented using Unicode). Using UTF-8, these segments are then translated
	into the corresponding byte sequences (represented here as sequences of
	hexadecimal numbers):</P>

	<UL>
	<LI>43 3A <EM>(<CODE>C:</CODE>)</EM>
	<LI>64 69 72 65 63 74 6F 72 79 <EM>(<CODE>directory</CODE>)</EM>
	<LI>6F 74 68 65 72 20 64 69 72 <EM>(<CODE>other dir</CODE>)</EM>
	<LI>66 69 6C 65 2E 74 78 74 <EM>(<CODE>file.txt</CODE>)</EM>
	</UL>

	<P>Then, these byte sequences are combined into a single file URL (adding the
	necessary syntactic delimiters), namely</P>

	<PRE>
	<CODE>file:///C:/directory/other%20dir/file.txt</CODE>
	</PRE>

	<P>Nothing exciting here (all the bytes from the four sequences are represented
	as the corresponding ASCII characters, except for the space in <CODE>other
	dir</CODE>, which is illegal in URLs and is thus escaped as
	<CODE>%20</CODE>).</P>

	<P>Similarly, a Unix file system path</P>

	<PRE>
	<CODE>/directory/other dir/file.txt</CODE>
	</PRE>

	<P>is translated into the URL</P>

	<PRE>
	<CODE>file:///directory/other%20dir/file.txt</CODE>
	</PRE>

	<H2>Non-ASCII Characters</H2>

	<P>On many platforms, file system paths can contain non-ASCII characters. This
	is typically handled within the operating system by naming files with byte
	strings, and letting the user choose a character encoding (via a locale) that
	specifies how these byte strings are to be interpreted.</P>

	<P>Consider the Unix file system path <CODE>/stränge</CODE> (assuming the
	user's locale is such that it contains an ä in the supported character
	repertoire). Again, translation into a file URL is done in two steps: First,
	the path is split into the single segment <CODE>stränge</CODE> (represented
	as a Unicode string; some platform-dependent “magic” is needed to
	convert from the operation system's representation to this Unicode
	representation). Second, that segment is translated into the (hexadecimal) byte
	sequence 73 74 72 C3 A4 6E 67 65, and the URL <CODE>file:///str%C3%A4nge</CODE>
	is constructed from that byte sequence.</P>

	<P>Other programs may handle file URLs differently, in that they directly use
	the operating system's byte strings (interpreted in a locale chosen by the user)
	within the file URL, without going via Unicode and UTF-8. For example, if the
	user had chosen a ISO 8859-1 locale, the Unix file system path
	<CODE>/stränge</CODE> (possibly represented as the hexadecimal byte string
	2F 73 74 72 E4 6E 67 65 by the operating system) would correspond to the URL
	<CODE>file:///str%E4nge</CODE>.</P>

	<P>If the OpenOffice.org code wants to exchange file URLs with such other
	programs, the URLs have to be converted back and forth at the interfaces, to
	avoid any misunderstandings. This problem is typically not noticed when the
	file system paths contain only ASCII characters, as these are always represented
	the same within file URLs.</P>

	<P>Both approaches (“UTF-8 URLs” and “locale dependent
	URLs”) have pros and cons, but the main benefit of the OpenOffice.org
	(i.e., UTF-8 URLs) approach seems to be the stability of how a file URL locates
	a specific file, regardless of context. Imagine a text processor application
	that lets you include URLs as hyperlinks within your documents, and imagine that
	application used locale dependent URLs. When creating a document, a user has
	specified locale <VAR>X</VAR> for his operating system, and a file URL
	included in the document is therefore encoded using the conventions of
	locale <VAR>X</VAR>. Now, the user switches to locale <VAR>Y</VAR>
	and re-opens the document: The hyperlink is no longer guaranteed to point to
	the same file, as the file URL is not encoded using the conventions of
	locale <VAR>Y</VAR>.</P>

	<P>Also, with the UTF-8 URLs approach, code that handles file URLs can often be
	made simpler and platform-independent. A typical scenario is a text edit field
	allowing the user to enter a (file) URL. Many users are not familiar with the
	nitty details of URLs, and will type in things like
	<CODE>file:///stränge</CODE>. Even though this is not a valid URL (since
	an <CODE>ä</CODE> is not allowed within a URL), it would be nice if the
	application were forgiving and would handle that input as locating the file
	<CODE>/stränge</CODE>. With the UTF-8 URLs approach, this is easy, as the
	text edit can silently convert the input into the correct URL
	<CODE>file:///str%C3%A4nge</CODE>. With the locale dependent URLs approach,
	this would be more difficult, as the text edit would have to know which locale
	is in use, to convert the input into something like
	<CODE>file:///str%E4nge</CODE>, but only in case the ISO 8859-1 locale was
	used.</P>

	<P>Note that these problems of interpreting non-ASCII characters are not
	restricted to file URLs. Other URL schemes that do not explicitly state what
	character encoding to use (like http and ftp URLs) have similar problems. As
	Richard Gillam puts it in his book <CITE>Unicode Demystified</CITE>
	(Addison-Wesley, 2003):</P>
	<BLOCKQUOTE>
	<P>The industry is converging around always treating escape sequences in
	URLs as referring to UTF-8 code units. That is, the industry is leaning
	toward always interpreting <CODE>R%c3%a9sum%c3%a9.html</CODE> to mean
	<CODE>Résumé.html</CODE> (and always representing
	<CODE>Résumé.html</CODE> as
	<CODE>R%c3%a9sum%c3%a9.html</CODE>). If everyone agreed on this system,
	then you could use illegal URL characters (such as the accented é in
	our example) in URL references in other kinds of documents (such as HTML or
	XML files) and know that a universally understood method of transforming
	them into a legal URL existed. Web browsers or other software could do the
	reverse, displaying URLs that include escape sequences by using the
	characters the escape sequences represent (at least in the cases where they
	represent non-ASCII characters) and allowing you to type them in that
	way.</P>
	</BLOCKQUOTE>

	<H2>The UTF-8 URLs Approach</H2>

	<P>There are a number of problems specific to the UTF-8 URLs approach:</P>

	<P>First, the UTF-8 URLs approach has an additional step compared to the locale
	dependent URLs approach, namely the translation between a system specific
	representation of some textual entity and a Unicode representation of that
	entity. In the one direction, a problem might occur when a certain entity
	represented in the system specific way cannot be represented in Unicode (e.g.,
	because it contains characters that are not present in the Unicode repertoire).
	Given that Unicode was specifically designed to encompass the repertoires of all
	legacy character encodings in use today, chances for such a problem should be
	close to zero. In the other direction, different Unicode strings could
	translate to the same system specific representation (e.g., because two
	different Unicode characters are mapped to the same character in the system
	specific encoding). This leads to two different URLs locating the same
	resource, something that should not be considered much of a problem, since it is
	already a wide-spread phenomenon (think about file URLs differing in the use of
	upper and lower case letters on a case-insensitive file system, or think about
	file systems that support links).</P>

	<P>Another problem stems from the fact that Unicode allows a single
	“conceptual character” (as interpreted by a user, e.g., the
	character “ä”) to be represented in different ways. For
	example, the conceptual character “ä” can be represented as
	either the single code point</P>

	<UL>
	<LI>U+00E4 LATIN SMALL LETTER A WITH DIARESIS
	</UL>

	<P>or as a sequence of the two code points</P>

	<UL>
	<LI>U+0061 LATIN SMALL LETTER A
	<LI>U+0308 COMBINING DIARESIS
	</UL>

	<P>(a so-called “combining character sequence”). Both
	representations should be considered equivalent, so that the two URLs
	<CODE>file:///str%C3%A4nge</CODE> (containing a UTF-8 encoded U+00E4) and
	<CODE>file:///stra%CC%88nge</CODE> (containing a U+0061 represented as
	ASCII <CODE>a</CODE> followed by a UTF-8 encoded U+0308) should probably be
	considered equivalent also, and should both denote a file named
	<CODE>stränge</CODE>. In current versions of OpenOffice.org
	(“SRC644”), loading a file named <CODE>stränge</CODE> on
	Windows XP works with the former URL, but fails with the latter.</P>

	<P>Two solutions for this problem seem possible: One solution would be to
	enhance the (platform dependent) code that maps from Unicode to a system
	specific character encoding so that it handles combining character sequences
	correctly. Another solution is to require all URLs to use only one form of
	Unicode strings (i.e., to use a <DFN>normalization form</DFN>), which would make
	this problem go away. The obvious choice is to use
	<A href="http://www.unicode.org/unicode/reports/tr15/">Unicode Normalization
	Form C</A>, as is also recommended by the W3C's
	<A href="http://www.w3.org/TR/charmod/"><CITE>Character Model for the World Wide
	Web 1.0</CITE></A>. Using that solution, only <CODE>file:///str%C3%A4nge</CODE>
	would be a valid URL to access the file <CODE>stränge</CODE>, while the URL
	<CODE>file:///stra%CC%88nge</CODE> would be ruled out as invalid.</P>

	<P>Following the approach of requiring URLs to use a normalization form, a new
	problem might show up: Consider an operating system that allows files to be
	named using some Unicode encoding (any of the UTF-<VAR>n</VAR> encoding
	forms/schemes), but that is uncompliant enough to allow two different files to
	have names that a Unicode-compliant system should consider equivalent (e.g., a
	file named <CODE>stränge</CODE> written using U+00E4 and another file named
	<CODE>stränge</CODE> written using U+0061 followed by U+0308). Now, the
	requirement to always use normalized Unicode strings within file URLs makes it
	impossible to access one of the two files with a URL. (This is another
	manifestation of the problem already described above, that an entity represented
	in the system specific way cannot be presented in Unicode—or, more
	specifically, in some Unicode normalization form. Above, that problem was said
	to be ignorable; here, one can only hope for Unicode-compliant operating systems
	that do not allow two different files to have equivalent names.)</P>

	<H2>Drive Letters</H2>

	<P>In Windows, as well as in related systems like DOS and OS/2, file system
	paths start with a <DFN>drive letter</DFN>, followed by a colon (e.g., the
	<CODE>C:</CODE> in <CODE>C:\directory\file.txt</CODE>). That drive letter
	(together with the following colon) makes up the first segment of file URLs on
	these systems (as in <CODE>file:///C:/directory/file.txt</CODE>). However,
	historically also a vertical bar has been used instead of the colon, as in
	<CODE>file:///C\|/directory/file.txt</CODE> (note that the vertical bar is not
	escaped as <CODE>%7C</CODE> in this special case, even though it is an illegal
	character).</P>

	<P>The OpenOffice.org code can handle both cases of file URLs (with either a
	colon or a vertical bar), but URLs generated by the code always follow the
	“standard” convention of using a colon.</P>

	<H2>Hosts</H2>

	<P>The file URL scheme allows for an optional <CODE><VAR>host</VAR></CODE>
	component after the <CODE>file://</CODE> prefix. Specifying
	<CODE>localhost</CODE> (in any combination of upper and lower case letters) is
	the same as leaving that component empty: the URL locates a file system path on
	the current machine.</P>

	<P>The intended use for this <CODE><VAR>host</VAR></CODE> component was to
	specify the DNS name (or IPv4/IPv6 address) of a machine, to indicate that the
	URL locates a file system path on that machine. The problem is that there is no
	protocol that details how such a remote file should be accessed, so
	interpretation of file URLs containing a <CODE><VAR>host</VAR></CODE> component
	was left unspecified.</P>

	<P>On Unix, the OpenOffice.org code does not support file URLs with a
	<CODE><VAR>host</VAR></CODE> component. Windows, on the other hand, knows the
	concept of <DFN>UNC paths</DFN>, file system paths containing the name of a
	remote machine:</P>

	<PRE>
	<CODE>\\somewhere\somedir\file.txt</CODE>
	</PRE>

	<P>The machine names used in UNC paths (<CODE>somewhere</CODE> in the above
	example) have another structure than DNS names, so, strictly speaking, they
	could not be used in the <CODE><VAR>host</VAR></CODE> component of file URLs.
	Nevertheless, many applications on Windows, including the OpenOffice.org code,
	follow the convention of allowing UNC machine names within file URLs. This
	means that the above UNC path corresponds with the URL</P>

	<PRE>
	<CODE>file://somewhere/somedir/file.txt</CODE>
	</PRE>

	<TABLE WIDTH="100%" BORDER="0" CELLSPACING="0" CELLPADDING="4" summary=footer>
	<TR>
	<TD BGCOLOR="#666699">
	<P><FONT COLOR="White">Author: <A HREF="mailto:stephan.bergmann@sun.com"><FONT COLOR="White">Stephan Bergmann</FONT></A> (Last modification $Date: 2004/12/08 12:03:50 $).<br/>
	Copyright 2002 <A HREF="http://www.openoffice.org"><FONT COLOR="White">OpenOffice.org</FONT></A> Foundation. All Rights Reserved.</FONT></P>
	</TD>
	</TR>
	</TABLE>
	</body>
	</HTML>