docs/canonical_filenames.html - apr - Git at Google

 <HTML>
 <HEAD><TITLE>APR Canonical Filenames</TITLE></HEAD>
 <BODY>
 <h1>APR Canonical Filename</h1>

 <h2>Requirements</h2>

 <p>APR porters need to address the underlying discrepancies between
 file systems.  To achieve a reasonable degree of security, the
 program depending upon APR needs to know that two paths may be
 compared, and that a mismatch is guarenteed to reflect that the
 two paths do not return the same resource</p>.

 <p>The first discrepancy is in volume roots.  Unix and pure deriviates
 have only one root path, "/".  Win32 and OS2 share root paths of
 the form "D:/", D: is the volume designation.  However, this can
 be specified as "//./D:/" as well, indicating D: volume of the
 'this' machine.  Win32 and OS2 also may employ a UNC root path,
 of the form "//server/share/" where share is a share-point of the
 specified network server.  Finally, NetWare root paths are of the
 form "server/volume:/", or the simpler "volume:/" syntax for 'this'
 machine.  All these non-Unix file systems accept volume:path,
 without a slash following the colon, as a path relative to the
 current working directory, which APR will treat as ambigious, that
 is, neither an absolute nor a relative path per se.</p>

 <p>The second discrepancy is in the meaning of the 'this' directory.
 In general, 'this' must be eliminated from the path where it occurs.
 The syntax "path/./" and "path/" are both aliases to path.  However,
 this isn't file system independent, since the double slash "//" has
 a special meaning on OS2 and Win32 at the start of the path name,
 and is invalid on those platforms before the "//server/share/" UNC
 root path is completed.  Finally, as noted above, "//./volume/" is
 legal root syntax on WinNT, and perhaps others.</p>

 <p>The third discrepancy is in the context of the 'parent' directory.
 When "parent/path/.." occurs, the path must be unwound to "parent".
 It's also critical to simply truncate leading "/../" paths to "/",
 since the parent of the root is root.  This gets tricky on the
 Win32 and OS2 platforms, since the ".." element is invalid before
 the "//server/share/" is complete, and the "//server/share/../"
 seqence is the complete UNC root "//server/share/".  In relative
 paths, leading ".." elements are significant, until they are merged
 with an absolute path.  The relative form must only retain the ".."
 segments as leading segments, to be resolved once merged to another
 relative or an absolute path.</p>

 <p>The fourth discrepancy occurs with acceptance of alternate character
 codes for the same element.  Path seperators are not retained within
 the APR canonical forms.  The OS filesystem and APR (slashed) forms
 can both be returned as strings, to be used in the proper context.
 Unix, Win32 and Netware all accept slashes and backslashes as the
 same path seperator symbol, although unix strictly accepts slashes.
 While the APR form of the name strictly uses slashes, always consider
 that there could be a platform that actually accepts slashes as a
 character within a segment name.</p>

 <p>The fifth and worst discrepancy plauges Win32, OS2, Netware, and some
 filesystems mounted in Unix.  Case insensitivity can permit the same
 file to slip through in both it's proper case and alternate cases.
 Simply changing the case is insufficient for any character set beyond
 ASCII, since various dilectic forms of characters suffer from one to
 many or many to one translations.  An example would be u-umlaut, which
 might be accepted as a single character u-umlaut, a two character
 sequence u and the zero-width umlaut, the upper case form of the same,
 or perhaps even a captial U alone.  This can be handled in different
 ways depending on the purposes of the APR based program, but the one
 requirement is that the path must be absolute in order to resolve these
 ambiguities.  Methods employed include comparison of device and inode
 file uniqifiers, which is a fairly fast operation, or quering the OS
 for the true form of the name, which can be much slower.  Only the
 acknowledgement of the file names by the OS can validate the equality
 of two different cases of the same filename.</p>

 <p>The sixth discrepancy, illegal or insignificant characters, is especially
 significant in non-unix file systems.  Trailing periods are accepted
 but never stored, therefore trailing periods must be ignored for any
 form of comparison.  And all OS's have certain expectations of what
 characters are illegal (or undesireable due to confusion.)</p>

 <p>A final warning, canonical functions don't transform or resolve case
 or character ambiguity issues until they are resolved into an absolute
 path.  The relative canonical path, while useful, while useful for URL
 or similar identifiers, cannot be used for testing or comparison of file
 system objects.</p>

 <hr>

 <h2>Canonical API</h2>

 Functions to manipulate the apr_canon_file_t (an opaque type) include:

 <ul>
 <li>Create canon_file_t (from char* path and canon_file_t parent path)
 <li>Merged canon_file_t (from path and parent, both canon_file_t)
 <li>Get char* path of all or some segments
 <li>Get path flags of IsRelative, IsVirtualRoot, and IsAbsolute
 <li>Compare two canon_file_t structures for file equality
 </ul>

 <p>The path is corrected to the file system case only if is in absolute
 form.  The apr_canon_file_t should be preserved as long as possible and
 used as the parent to create child entries to reduce the number of expensive
 stat and case canonicalization calls to the OS.</p>

 <p>The comparison operation provides that the APR can postpone correction
 of case by simply relying upon the device and inode for equivalence.  The
 stat implementation provides that two files are the same, while their
 strings are not equivalent, and eliminates the need for the operating
 system to return the proper form of the name.</p>

 <p>In any case, returning the char* path, with a flag to request the proper
 case, forces the OS calls to resolve the true names of each segment.  Where
 there is a penality for this operation and the stat device and inode test
 is faster, case correction is postponed until the char* result is requested.
 On platforms that identify the inode, device, or proper name interchangably
 with no penalities, this may occur when the name is initially processed.</p>

 <hr>

 <h2>Unix Example</h2>

 <p>First the simplest case:</p>

 <pre>
 Parse Canonical Name
 accepts parent path as canonical_t
         this path as string

 Split this path Segments on '/'

 For each of this path Segments
   If first Segment
     If this Segment is Empty ([nothing]/)
       Append this Root Segment (don't merge)
       Continue to next Segment
     Else is relative
       Append parent Segments (to merge)
       Continue with this Segment
   If Segment is '.' or empty (2 slashes)
     Discard this Segment
     Continue with next Segment
   If Segment is '..'
     If no previous Segment or previous Segment is '..'
       Append this Segment
       Continue with next Segment
     If previous Segment and previous is not Root Segment
       Discard previous Segment
     Discard this Segment
     Continue with next Segment
   Append this Relative Segment
   Continue with next Segment
 </pre>

 </BODY>
 </HTML>
	<HTML>
	<HEAD><TITLE>APR Canonical Filenames</TITLE></HEAD>
	<BODY>
	<h1>APR Canonical Filename</h1>

	<h2>Requirements</h2>

	<p>APR porters need to address the underlying discrepancies between
	file systems. To achieve a reasonable degree of security, the
	program depending upon APR needs to know that two paths may be
	compared, and that a mismatch is guarenteed to reflect that the
	two paths do not return the same resource</p>.

	<p>The first discrepancy is in volume roots. Unix and pure deriviates
	have only one root path, "/". Win32 and OS2 share root paths of
	the form "D:/", D: is the volume designation. However, this can
	be specified as "//./D:/" as well, indicating D: volume of the
	'this' machine. Win32 and OS2 also may employ a UNC root path,
	of the form "//server/share/" where share is a share-point of the
	specified network server. Finally, NetWare root paths are of the
	form "server/volume:/", or the simpler "volume:/" syntax for 'this'
	machine. All these non-Unix file systems accept volume:path,
	without a slash following the colon, as a path relative to the
	current working directory, which APR will treat as ambigious, that
	is, neither an absolute nor a relative path per se.</p>

	<p>The second discrepancy is in the meaning of the 'this' directory.
	In general, 'this' must be eliminated from the path where it occurs.
	The syntax "path/./" and "path/" are both aliases to path. However,
	this isn't file system independent, since the double slash "//" has
	a special meaning on OS2 and Win32 at the start of the path name,
	and is invalid on those platforms before the "//server/share/" UNC
	root path is completed. Finally, as noted above, "//./volume/" is
	legal root syntax on WinNT, and perhaps others.</p>

	<p>The third discrepancy is in the context of the 'parent' directory.
	When "parent/path/.." occurs, the path must be unwound to "parent".
	It's also critical to simply truncate leading "/../" paths to "/",
	since the parent of the root is root. This gets tricky on the
	Win32 and OS2 platforms, since the ".." element is invalid before
	the "//server/share/" is complete, and the "//server/share/../"
	seqence is the complete UNC root "//server/share/". In relative
	paths, leading ".." elements are significant, until they are merged
	with an absolute path. The relative form must only retain the ".."
	segments as leading segments, to be resolved once merged to another
	relative or an absolute path.</p>

	<p>The fourth discrepancy occurs with acceptance of alternate character
	codes for the same element. Path seperators are not retained within
	the APR canonical forms. The OS filesystem and APR (slashed) forms
	can both be returned as strings, to be used in the proper context.
	Unix, Win32 and Netware all accept slashes and backslashes as the
	same path seperator symbol, although unix strictly accepts slashes.
	While the APR form of the name strictly uses slashes, always consider
	that there could be a platform that actually accepts slashes as a
	character within a segment name.</p>

	<p>The fifth and worst discrepancy plauges Win32, OS2, Netware, and some
	filesystems mounted in Unix. Case insensitivity can permit the same
	file to slip through in both it's proper case and alternate cases.
	Simply changing the case is insufficient for any character set beyond
	ASCII, since various dilectic forms of characters suffer from one to
	many or many to one translations. An example would be u-umlaut, which
	might be accepted as a single character u-umlaut, a two character
	sequence u and the zero-width umlaut, the upper case form of the same,
	or perhaps even a captial U alone. This can be handled in different
	ways depending on the purposes of the APR based program, but the one
	requirement is that the path must be absolute in order to resolve these
	ambiguities. Methods employed include comparison of device and inode
	file uniqifiers, which is a fairly fast operation, or quering the OS
	for the true form of the name, which can be much slower. Only the
	acknowledgement of the file names by the OS can validate the equality
	of two different cases of the same filename.</p>

	<p>The sixth discrepancy, illegal or insignificant characters, is especially
	significant in non-unix file systems. Trailing periods are accepted
	but never stored, therefore trailing periods must be ignored for any
	form of comparison. And all OS's have certain expectations of what
	characters are illegal (or undesireable due to confusion.)</p>

	<p>A final warning, canonical functions don't transform or resolve case
	or character ambiguity issues until they are resolved into an absolute
	path. The relative canonical path, while useful, while useful for URL
	or similar identifiers, cannot be used for testing or comparison of file
	system objects.</p>

	<hr>

	<h2>Canonical API</h2>

	Functions to manipulate the apr_canon_file_t (an opaque type) include:

	<ul>
	<li>Create canon_file_t (from char* path and canon_file_t parent path)
	<li>Merged canon_file_t (from path and parent, both canon_file_t)
	<li>Get char* path of all or some segments
	<li>Get path flags of IsRelative, IsVirtualRoot, and IsAbsolute
	<li>Compare two canon_file_t structures for file equality
	</ul>

	<p>The path is corrected to the file system case only if is in absolute
	form. The apr_canon_file_t should be preserved as long as possible and
	used as the parent to create child entries to reduce the number of expensive
	stat and case canonicalization calls to the OS.</p>

	<p>The comparison operation provides that the APR can postpone correction
	of case by simply relying upon the device and inode for equivalence. The
	stat implementation provides that two files are the same, while their
	strings are not equivalent, and eliminates the need for the operating
	system to return the proper form of the name.</p>

	<p>In any case, returning the char* path, with a flag to request the proper
	case, forces the OS calls to resolve the true names of each segment. Where
	there is a penality for this operation and the stat device and inode test
	is faster, case correction is postponed until the char* result is requested.
	On platforms that identify the inode, device, or proper name interchangably
	with no penalities, this may occur when the name is initially processed.</p>

	<hr>

	<h2>Unix Example</h2>

	<p>First the simplest case:</p>

	<pre>
	Parse Canonical Name
	accepts parent path as canonical_t
	this path as string

	Split this path Segments on '/'

	For each of this path Segments
	If first Segment
	If this Segment is Empty ([nothing]/)
	Append this Root Segment (don't merge)
	Continue to next Segment
	Else is relative
	Append parent Segments (to merge)
	Continue with this Segment
	If Segment is '.' or empty (2 slashes)
	Discard this Segment
	Continue with next Segment
	If Segment is '..'
	If no previous Segment or previous Segment is '..'
	Append this Segment
	Continue with next Segment
	If previous Segment and previous is not Root Segment
	Discard previous Segment
	Discard this Segment
	Continue with next Segment
	Append this Relative Segment
	Continue with next Segment
	</pre>

	</BODY>
	</HTML>