docs/manual/mod/mod_unique_id.xml - httpd - Git at Google

 <?xml version="1.0"?>
 <!DOCTYPE modulesynopsis SYSTEM "../style/modulesynopsis.dtd">
 <?xml-stylesheet type="text/xsl" href="../style/manual.en.xsl"?>
 <!-- $LastChangedRevision$ -->

 <!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
 -->

 <modulesynopsis metafile="mod_unique_id.xml.meta">

 <name>mod_unique_id</name>
 <description>Provides an environment variable with a unique
 identifier for each request</description>
 <status>Extension</status>
 <sourcefile>mod_unique_id.c</sourcefile>
 <identifier>unique_id_module</identifier>

 <summary>

     <p>This module provides a magic token for each request which is
     guaranteed to be unique across "all" requests under very
     specific conditions. The unique identifier is even unique
     across multiple machines in a properly configured cluster of
     machines. The environment variable <code>UNIQUE_ID</code> is
     set to the identifier for each request. Unique identifiers are
     useful for various reasons which are beyond the scope of this
     document.</p>
 </summary>

 <section id="theory">
     <title>Theory</title>

     <p>First a brief recap of how the Apache server works on Unix
     machines. This feature currently isn't supported on Windows NT.
     On Unix machines, Apache creates several children, the children
     process requests one at a time. Each child can serve multiple
     requests in its lifetime. For the purpose of this discussion,
     the children don't share any data with each other. We'll refer
     to the children as <dfn>httpd processes</dfn>.</p>

     <p>Your website has one or more machines under your
     administrative control, together we'll call them a cluster of
     machines. Each machine can possibly run multiple instances of
     Apache. All of these collectively are considered "the
     universe", and with certain assumptions we'll show that in this
     universe we can generate unique identifiers for each request,
     without extensive communication between machines in the
     cluster.</p>

     <p>The machines in your cluster should satisfy these
     requirements. (Even if you have only one machine you should
     synchronize its clock with NTP.)</p>

     <ul>
       <li>The machines' times are synchronized via NTP or other
       network time protocol.</li>

       <li>The machines' hostnames all differ, such that the module
       can do a hostname lookup on the hostname and receive a
       different IP address for each machine in the cluster.</li>
     </ul>

     <p>As far as operating system assumptions go, we assume that
     pids (process ids) fit in 32-bits. If the operating system uses
     more than 32-bits for a pid, the fix is trivial but must be
     performed in the code.</p>

     <p>Given those assumptions, at a single point in time we can
     identify any httpd process on any machine in the cluster from
     all other httpd processes. The machine's IP address and the pid
     of the httpd process are sufficient to do this. So in order to
     generate unique identifiers for requests we need only
     distinguish between different points in time.</p>

     <p>To distinguish time we will use a Unix timestamp (seconds
     since January 1, 1970 UTC), and a 16-bit counter. The timestamp
     has only one second granularity, so the counter is used to
     represent up to 65536 values during a single second. The
     quadruple <em>( ip_addr, pid, time_stamp, counter )</em> is
     sufficient to enumerate 65536 requests per second per httpd
     process. There are issues however with pid reuse over time, and
     the counter is used to alleviate this issue.</p>

     <p>When an httpd child is created, the counter is initialized
     with ( current microseconds divided by 10 ) modulo 65536 (this
     formula was chosen to eliminate some variance problems with the
     low order bits of the microsecond timers on some systems). When
     a unique identifier is generated, the time stamp used is the
     time the request arrived at the web server. The counter is
     incremented every time an identifier is generated (and allowed
     to roll over).</p>

     <p>The kernel generates a pid for each process as it forks the
     process, and pids are allowed to roll over (they're 16-bits on
     many Unixes, but newer systems have expanded to 32-bits). So
     over time the same pid will be reused. However unless it is
     reused within the same second, it does not destroy the
     uniqueness of our quadruple. That is, we assume the system does
     not spawn 65536 processes in a one second interval (it may even
     be 32768 processes on some Unixes, but even this isn't likely
     to happen).</p>

     <p>Suppose that time repeats itself for some reason. That is,
     suppose that the system's clock is screwed up and it revisits a
     past time (or it is too far forward, is reset correctly, and
     then revisits the future time). In this case we can easily show
     that we can get pid and time stamp reuse. The choice of
     initializer for the counter is intended to help defeat this.
     Note that we really want a random number to initialize the
     counter, but there aren't any readily available numbers on most
     systems (<em>i.e.</em>, you can't use rand() because you need
     to seed the generator, and can't seed it with the time because
     time, at least at one second resolution, has repeated itself).
     This is not a perfect defense.</p>

     <p>How good a defense is it? Suppose that one of your machines
     serves at most 500 requests per second (which is a very
     reasonable upper bound at this writing, because systems
     generally do more than just shovel out static files). To do
     that it will require a number of children which depends on how
     many concurrent clients you have. But we'll be pessimistic and
     suppose that a single child is able to serve 500 requests per
     second. There are 1000 possible starting counter values such
     that two sequences of 500 requests overlap. So there is a 1.5%
     chance that if time (at one second resolution) repeats itself
     this child will repeat a counter value, and uniqueness will be
     broken. This was a very pessimistic example, and with real
     world values it's even less likely to occur. If your system is
     such that it's still likely to occur, then perhaps you should
     make the counter 32 bits (by editing the code).</p>

     <p>You may be concerned about the clock being "set back" during
     summer daylight savings. However this isn't an issue because
     the times used here are UTC, which "always" go forward. Note
     that x86 based Unixes may need proper configuration for this to
     be true -- they should be configured to assume that the
     motherboard clock is on UTC and compensate appropriately. But
     even still, if you're running NTP then your UTC time will be
     correct very shortly after reboot.</p>

     <p>The <code>UNIQUE_ID</code> environment variable is
     constructed by encoding the 112-bit (32-bit IP address, 32 bit
     pid, 32 bit time stamp, 16 bit counter) quadruple using the
     alphabet <code>[A-Za-z0-9@-]</code> in a manner similar to MIME
     base64 encoding, producing 19 characters. The MIME base64
     alphabet is actually <code>[A-Za-z0-9+/]</code> however
     <code>+</code> and <code>/</code> need to be specially encoded
     in URLs, which makes them less desirable. All values are
     encoded in network byte ordering so that the encoding is
     comparable across architectures of different byte ordering. The
     actual ordering of the encoding is: time stamp, IP address,
     pid, counter. This ordering has a purpose, but it should be
     emphasized that applications should not dissect the encoding.
     Applications should treat the entire encoded
     <code>UNIQUE_ID</code> as an opaque token, which can be
     compared against other <code>UNIQUE_ID</code>s for equality
     only.</p>

     <p>The ordering was chosen such that it's possible to change
     the encoding in the future without worrying about collision
     with an existing database of <code>UNIQUE_ID</code>s. The new
     encodings should also keep the time stamp as the first element,
     and can otherwise use the same alphabet and bit length. Since
     the time stamps are essentially an increasing sequence, it's
     sufficient to have a <em>flag second</em> in which all machines
     in the cluster stop serving and request, and stop using the old
     encoding format. Afterwards they can resume requests and begin
     issuing the new encodings.</p>

     <p>This we believe is a relatively portable solution to this
     problem. It can be extended to multithreaded systems like
     Windows NT, and can grow with future needs. The identifiers
     generated have essentially an infinite life-time because future
     identifiers can be made longer as required. Essentially no
     communication is required between machines in the cluster (only
     NTP synchronization is required, which is low overhead), and no
     communication between httpd processes is required (the
     communication is implicit in the pid value assigned by the
     kernel). In very specific situations the identifier can be
     shortened, but more information needs to be assumed (for
     example the 32-bit IP address is overkill for any site, but
     there is no portable shorter replacement for it). </p>
 </section>


 </modulesynopsis>
	<?xml version="1.0"?>
	<!DOCTYPE modulesynopsis SYSTEM "../style/modulesynopsis.dtd">
	<?xml-stylesheet type="text/xsl" href="../style/manual.en.xsl"?>
	<!-- $LastChangedRevision$ -->

	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->

	<modulesynopsis metafile="mod_unique_id.xml.meta">

	<name>mod_unique_id</name>
	<description>Provides an environment variable with a unique
	identifier for each request</description>
	<status>Extension</status>
	<sourcefile>mod_unique_id.c</sourcefile>
	<identifier>unique_id_module</identifier>

	<summary>

	<p>This module provides a magic token for each request which is
	guaranteed to be unique across "all" requests under very
	specific conditions. The unique identifier is even unique
	across multiple machines in a properly configured cluster of
	machines. The environment variable <code>UNIQUE_ID</code> is
	set to the identifier for each request. Unique identifiers are
	useful for various reasons which are beyond the scope of this
	document.</p>
	</summary>

	<section id="theory">
	<title>Theory</title>

	<p>First a brief recap of how the Apache server works on Unix
	machines. This feature currently isn't supported on Windows NT.
	On Unix machines, Apache creates several children, the children
	process requests one at a time. Each child can serve multiple
	requests in its lifetime. For the purpose of this discussion,
	the children don't share any data with each other. We'll refer
	to the children as <dfn>httpd processes</dfn>.</p>

	<p>Your website has one or more machines under your
	administrative control, together we'll call them a cluster of
	machines. Each machine can possibly run multiple instances of
	Apache. All of these collectively are considered "the
	universe", and with certain assumptions we'll show that in this
	universe we can generate unique identifiers for each request,
	without extensive communication between machines in the
	cluster.</p>

	<p>The machines in your cluster should satisfy these
	requirements. (Even if you have only one machine you should
	synchronize its clock with NTP.)</p>

	<ul>
	<li>The machines' times are synchronized via NTP or other
	network time protocol.</li>

	<li>The machines' hostnames all differ, such that the module
	can do a hostname lookup on the hostname and receive a
	different IP address for each machine in the cluster.</li>
	</ul>

	<p>As far as operating system assumptions go, we assume that
	pids (process ids) fit in 32-bits. If the operating system uses
	more than 32-bits for a pid, the fix is trivial but must be
	performed in the code.</p>

	<p>Given those assumptions, at a single point in time we can
	identify any httpd process on any machine in the cluster from
	all other httpd processes. The machine's IP address and the pid
	of the httpd process are sufficient to do this. So in order to
	generate unique identifiers for requests we need only
	distinguish between different points in time.</p>

	<p>To distinguish time we will use a Unix timestamp (seconds
	since January 1, 1970 UTC), and a 16-bit counter. The timestamp
	has only one second granularity, so the counter is used to
	represent up to 65536 values during a single second. The
	quadruple <em>( ip_addr, pid, time_stamp, counter )</em> is
	sufficient to enumerate 65536 requests per second per httpd
	process. There are issues however with pid reuse over time, and
	the counter is used to alleviate this issue.</p>

	<p>When an httpd child is created, the counter is initialized
	with ( current microseconds divided by 10 ) modulo 65536 (this
	formula was chosen to eliminate some variance problems with the
	low order bits of the microsecond timers on some systems). When
	a unique identifier is generated, the time stamp used is the
	time the request arrived at the web server. The counter is
	incremented every time an identifier is generated (and allowed
	to roll over).</p>

	<p>The kernel generates a pid for each process as it forks the
	process, and pids are allowed to roll over (they're 16-bits on
	many Unixes, but newer systems have expanded to 32-bits). So
	over time the same pid will be reused. However unless it is
	reused within the same second, it does not destroy the
	uniqueness of our quadruple. That is, we assume the system does
	not spawn 65536 processes in a one second interval (it may even
	be 32768 processes on some Unixes, but even this isn't likely
	to happen).</p>

	<p>Suppose that time repeats itself for some reason. That is,
	suppose that the system's clock is screwed up and it revisits a
	past time (or it is too far forward, is reset correctly, and
	then revisits the future time). In this case we can easily show
	that we can get pid and time stamp reuse. The choice of
	initializer for the counter is intended to help defeat this.
	Note that we really want a random number to initialize the
	counter, but there aren't any readily available numbers on most
	systems (<em>i.e.</em>, you can't use rand() because you need
	to seed the generator, and can't seed it with the time because
	time, at least at one second resolution, has repeated itself).
	This is not a perfect defense.</p>

	<p>How good a defense is it? Suppose that one of your machines
	serves at most 500 requests per second (which is a very
	reasonable upper bound at this writing, because systems
	generally do more than just shovel out static files). To do
	that it will require a number of children which depends on how
	many concurrent clients you have. But we'll be pessimistic and
	suppose that a single child is able to serve 500 requests per
	second. There are 1000 possible starting counter values such
	that two sequences of 500 requests overlap. So there is a 1.5%
	chance that if time (at one second resolution) repeats itself
	this child will repeat a counter value, and uniqueness will be
	broken. This was a very pessimistic example, and with real
	world values it's even less likely to occur. If your system is
	such that it's still likely to occur, then perhaps you should
	make the counter 32 bits (by editing the code).</p>

	<p>You may be concerned about the clock being "set back" during
	summer daylight savings. However this isn't an issue because
	the times used here are UTC, which "always" go forward. Note
	that x86 based Unixes may need proper configuration for this to
	be true -- they should be configured to assume that the
	motherboard clock is on UTC and compensate appropriately. But
	even still, if you're running NTP then your UTC time will be
	correct very shortly after reboot.</p>

	<p>The <code>UNIQUE_ID</code> environment variable is
	constructed by encoding the 112-bit (32-bit IP address, 32 bit
	pid, 32 bit time stamp, 16 bit counter) quadruple using the
	alphabet <code>[A-Za-z0-9@-]</code> in a manner similar to MIME
	base64 encoding, producing 19 characters. The MIME base64
	alphabet is actually <code>[A-Za-z0-9+/]</code> however
	<code>+</code> and <code>/</code> need to be specially encoded
	in URLs, which makes them less desirable. All values are
	encoded in network byte ordering so that the encoding is
	comparable across architectures of different byte ordering. The
	actual ordering of the encoding is: time stamp, IP address,
	pid, counter. This ordering has a purpose, but it should be
	emphasized that applications should not dissect the encoding.
	Applications should treat the entire encoded
	<code>UNIQUE_ID</code> as an opaque token, which can be
	compared against other <code>UNIQUE_ID</code>s for equality
	only.</p>

	<p>The ordering was chosen such that it's possible to change
	the encoding in the future without worrying about collision
	with an existing database of <code>UNIQUE_ID</code>s. The new
	encodings should also keep the time stamp as the first element,
	and can otherwise use the same alphabet and bit length. Since
	the time stamps are essentially an increasing sequence, it's
	sufficient to have a <em>flag second</em> in which all machines
	in the cluster stop serving and request, and stop using the old
	encoding format. Afterwards they can resume requests and begin
	issuing the new encodings.</p>

	<p>This we believe is a relatively portable solution to this
	problem. It can be extended to multithreaded systems like
	Windows NT, and can grow with future needs. The identifiers
	generated have essentially an infinite life-time because future
	identifiers can be made longer as required. Essentially no
	communication is required between machines in the cluster (only
	NTP synchronization is required, which is low overhead), and no
	communication between httpd processes is required (the
	communication is implicit in the pid value assigned by the
	kernel). In very specific situations the identifier can be
	shortened, but more information needs to be assumed (for
	example the 32-bit IP address is overkill for any site, but
	there is no portable shorter replacement for it). </p>
	</section>


	</modulesynopsis>