docs/manual/mod/mod_unique_id.html - httpd - Git at Google

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
 <HTML>
 <HEAD>
 <TITLE>Apache module mod_unique_id</TITLE>
 </HEAD>

 <!-- Background white, links blue (unvisited), navy (visited), red (active) -->
 <BODY
  BGCOLOR="#FFFFFF"
  TEXT="#000000"
  LINK="#0000FF"
  VLINK="#000080"
  ALINK="#FF0000"
 >
 <!--#include virtual="header.html" -->
 <H1 ALIGN="CENTER">Module mod_unique_id</H1>

 This module provides a magic token for each request which is guaranteed
 to be unique across "all" requests under very specific conditions.
 The unique identifier is even unique across multiple machines in a
 properly configured cluster of machines.  The environment variable
 <CODE>UNIQUE_ID</CODE> is set to the identifier for each request.
 Unique identifiers are useful for various reasons which are beyond the
 scope of this document.

 <H2>Theory</H2>

 <P>
 First a brief recap of how the Apache server works on Unix machines.
 This feature currently isn't supported on Windows NT.  On Unix machines,
 Apache creates several children, the children process requests one at
 a time.  Each child can serve multiple requests in its lifetime.  For the
 purpose of this discussion, the children don't share any data
 with each other.  We'll refer to the children as httpd processes.

 <P>
 Your website has one or more machines under your administrative control,
 together we'll call them a cluster of machines.  Each machine can
 possibly run multiple instances of Apache.  All of these collectively
 are considered "the universe", and with certain assumptions we'll
 show that in this universe we can generate unique identifiers for each
 request, without extensive communication between machines in the cluster.

 <P>
 The machines in your cluster should satisfy these requirements.
 (Even if you have only one machine you should synchronize its clock
 with NTP.)

 <UL>
 <LI>The machines' times are synchronized via NTP or other network time
     protocol.

 <LI>The machines' hostnames all differ, such that the module can do a
     hostname lookup on the hostname and receive a different IP address
     for each machine in the cluster.
 </UL>

 <P>
 As far as operating system assumptions go, we assume that pids (process
 ids) fit in 32-bits.  If the operating system uses more than 32-bits
 for a pid, the fix is trivial but must be performed in the code.

 <P>
 Given those assumptions, at a single point in time we can identify
 any httpd process on any machine in the cluster from all other httpd
 processes.  The machine's IP address and the pid of the httpd process
 are sufficient to do this.  So in order to generate unique identifiers
 for requests we need only distinguish between different points in time.

 <P>
 To distinguish time we will use a Unix timestamp (seconds since January
 1, 1970 UTC), and a 16-bit counter.  The timestamp has only one second
 granularity, so the counter is used to represent up to 65536 values
 during a single second.  The quadruple <EM>( ip_addr, pid, time_stamp,
 counter )</EM> is sufficient to enumerate 65536 requests per second per
 httpd process.  There are issues however with pid reuse over
 time, and the counter is used to alleviate this issue.

 <P>
 When an httpd child is created, the counter is initialized with (
 current microseconds divided by 10 ) modulo 65536 (this formula was
 chosen to eliminate some variance problems with the low order bits of
 the microsecond timers on some systems).  When a unique identifier is
 generated, the time stamp used is the time the request arrived at the
 web server.  The counter is incremented every time an identifier is
 generated (and allowed to roll over).

 <P>
 The kernel generates a pid for each process as it forks the process, and
 pids are allowed to roll over (they're 16-bits on many Unixes, but newer
 systems have expanded to 32-bits).  So over time the same pid will be
 reused.  However unless it is reused within the same second, it does not
 destroy the uniqueness of our quadruple.  That is, we assume the system
 does not spawn 65536 processes in a one second interval (it may even be
 32768 processes on some Unixes, but even this isn't likely to happen).

 <P>
 Suppose that time repeats itself for some reason.  That is, suppose that
 the system's clock is screwed up and it revisits a past time (or it is
 too far forward, is reset correctly, and then revisits the future time).
 In this case we can easily show that we can get pid and time stamp reuse.
 The choice of initializer for the counter is intended to help defeat this.
 Note that we really want a random number to initialize the counter,
 but there aren't any readily available numbers on most systems (<EM>i.e.</EM>, you
 can't use rand() because you need to seed the generator, and can't seed
 it with the time because time, at least at one second resolution, has
 repeated itself).  This is not a perfect defense.

 <P>
 How good a defense is it?  Well suppose that one of your machines serves
 at most 500 requests per second (which is a very reasonable upper bound
 at this writing, because systems generally do more than just shovel out
 static files).  To do that it will require a number of children which
 depends on how many concurrent clients you have.  But we'll be pessimistic
 and suppose that a single child is able to serve 500 requests per second.
 There are 1000 possible starting counter values such that two sequences
 of 500 requests overlap.  So there is a 1.5% chance that if time (at one
 second resolution) repeats itself this child will repeat a counter value,
 and uniqueness will be broken.  This was a very pessimistic example,
 and with real world values it's even less likely to occur.  If your
 system is such that it's still likely to occur, then perhaps you should
 make the counter 32 bits (by editing the code).

 <P>
 You may be concerned about the clock being "set back" during summer
 daylight savings.  However this isn't an issue because the times used here
 are UTC, which "always" go forward.  Note that x86 based Unixes may need
 proper configuration for this to be true -- they should be configured to
 assume that the motherboard clock is on UTC and compensate appropriately.
 But even still, if you're running NTP then your UTC time will be correct
 very shortly after reboot.

 <P>
 The <CODE>UNIQUE_ID</CODE> environment variable is constructed by
 encoding the 112-bit (32-bit IP address, 32 bit pid, 32 bit time stamp,
 16 bit counter) quadruple using the alphabet <CODE>[A-Za-z0-9@-]</CODE>
 in a manner similar to MIME base64 encoding, producing 19 characters.
 The MIME base64 alphabet is actually <CODE>[A-Za-z0-9+/]</CODE> however
 <CODE>+</CODE> and <CODE>/</CODE> need to be specially encoded in URLs,
 which makes them less desirable.  All values are encoded in network
 byte ordering so that the encoding is comparable across architectures of
 different byte ordering.  The actual ordering of the encoding is: time
 stamp, IP address, pid, counter.  This ordering has a purpose, but it
 should be emphasized that applications should not dissect the encoding.
 Applications should treat the entire encoded <CODE>UNIQUE_ID</CODE> as an
 opaque token, which can be compared against other <CODE>UNIQUE_ID</CODE>s
 for equality only.

 <P>
 The ordering was chosen such that it's possible to change the encoding
 in the future without worrying about collision with an existing database
 of <CODE>UNIQUE_ID</CODE>s.  The new encodings should also keep the time
 stamp as the first element, and can otherwise use the same alphabet and
 bit length.  Since the time stamps are essentially an increasing sequence,
 it's sufficient to have a <EM>flag second</EM> in which all machines in the
 cluster stop serving and request, and stop using the old encoding format.
 Afterwards they can resume requests and begin issuing the new encodings.

 <P>
 This we believe is a relatively portable solution to this problem.  It can
 be extended to multithreaded systems like Windows NT, and can grow with
 future needs.  The identifiers generated have essentially an infinite
 life-time because future identifiers can be made longer as required.
 Essentially no communication is required between machines in the cluster
 (only NTP synchronization is required, which is low overhead), and no
 communication between httpd processes is required (the communication is
 implicit in the pid value assigned by the kernel).  In very specific
 situations the identifier can be shortened, but more information needs
 to be assumed (for example the 32-bit IP address is overkill for any
 site, but there is no portable shorter replacement for it).

 <HR>

 <H2>Directives</H2>

 <CODE>mod_unique_id</CODE> has no directives.

 <!--#include virtual="footer.html" -->
 </BODY>
 </HTML>
	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
	<HTML>
	<HEAD>
	<TITLE>Apache module mod_unique_id</TITLE>
	</HEAD>

	<!-- Background white, links blue (unvisited), navy (visited), red (active) -->
	<BODY
	BGCOLOR="#FFFFFF"
	TEXT="#000000"
	LINK="#0000FF"
	VLINK="#000080"
	ALINK="#FF0000"
	>
	<!--#include virtual="header.html" -->
	<H1 ALIGN="CENTER">Module mod_unique_id</H1>

	This module provides a magic token for each request which is guaranteed
	to be unique across "all" requests under very specific conditions.
	The unique identifier is even unique across multiple machines in a
	properly configured cluster of machines. The environment variable
	<CODE>UNIQUE_ID</CODE> is set to the identifier for each request.
	Unique identifiers are useful for various reasons which are beyond the
	scope of this document.

	<H2>Theory</H2>

	<P>
	First a brief recap of how the Apache server works on Unix machines.
	This feature currently isn't supported on Windows NT. On Unix machines,
	Apache creates several children, the children process requests one at
	a time. Each child can serve multiple requests in its lifetime. For the
	purpose of this discussion, the children don't share any data
	with each other. We'll refer to the children as httpd processes.

	<P>
	Your website has one or more machines under your administrative control,
	together we'll call them a cluster of machines. Each machine can
	possibly run multiple instances of Apache. All of these collectively
	are considered "the universe", and with certain assumptions we'll
	show that in this universe we can generate unique identifiers for each
	request, without extensive communication between machines in the cluster.

	<P>
	The machines in your cluster should satisfy these requirements.
	(Even if you have only one machine you should synchronize its clock
	with NTP.)

	<UL>
	<LI>The machines' times are synchronized via NTP or other network time
	protocol.

	<LI>The machines' hostnames all differ, such that the module can do a
	hostname lookup on the hostname and receive a different IP address
	for each machine in the cluster.
	</UL>

	<P>
	As far as operating system assumptions go, we assume that pids (process
	ids) fit in 32-bits. If the operating system uses more than 32-bits
	for a pid, the fix is trivial but must be performed in the code.

	<P>
	Given those assumptions, at a single point in time we can identify
	any httpd process on any machine in the cluster from all other httpd
	processes. The machine's IP address and the pid of the httpd process
	are sufficient to do this. So in order to generate unique identifiers
	for requests we need only distinguish between different points in time.

	<P>
	To distinguish time we will use a Unix timestamp (seconds since January
	1, 1970 UTC), and a 16-bit counter. The timestamp has only one second
	granularity, so the counter is used to represent up to 65536 values
	during a single second. The quadruple <EM>( ip_addr, pid, time_stamp,
	counter )</EM> is sufficient to enumerate 65536 requests per second per
	httpd process. There are issues however with pid reuse over
	time, and the counter is used to alleviate this issue.

	<P>
	When an httpd child is created, the counter is initialized with (
	current microseconds divided by 10 ) modulo 65536 (this formula was
	chosen to eliminate some variance problems with the low order bits of
	the microsecond timers on some systems). When a unique identifier is
	generated, the time stamp used is the time the request arrived at the
	web server. The counter is incremented every time an identifier is
	generated (and allowed to roll over).

	<P>
	The kernel generates a pid for each process as it forks the process, and
	pids are allowed to roll over (they're 16-bits on many Unixes, but newer
	systems have expanded to 32-bits). So over time the same pid will be
	reused. However unless it is reused within the same second, it does not
	destroy the uniqueness of our quadruple. That is, we assume the system
	does not spawn 65536 processes in a one second interval (it may even be
	32768 processes on some Unixes, but even this isn't likely to happen).

	<P>
	Suppose that time repeats itself for some reason. That is, suppose that
	the system's clock is screwed up and it revisits a past time (or it is
	too far forward, is reset correctly, and then revisits the future time).
	In this case we can easily show that we can get pid and time stamp reuse.
	The choice of initializer for the counter is intended to help defeat this.
	Note that we really want a random number to initialize the counter,
	but there aren't any readily available numbers on most systems (<EM>i.e.</EM>, you
	can't use rand() because you need to seed the generator, and can't seed
	it with the time because time, at least at one second resolution, has
	repeated itself). This is not a perfect defense.

	<P>
	How good a defense is it? Well suppose that one of your machines serves
	at most 500 requests per second (which is a very reasonable upper bound
	at this writing, because systems generally do more than just shovel out
	static files). To do that it will require a number of children which
	depends on how many concurrent clients you have. But we'll be pessimistic
	and suppose that a single child is able to serve 500 requests per second.
	There are 1000 possible starting counter values such that two sequences
	of 500 requests overlap. So there is a 1.5% chance that if time (at one
	second resolution) repeats itself this child will repeat a counter value,
	and uniqueness will be broken. This was a very pessimistic example,
	and with real world values it's even less likely to occur. If your
	system is such that it's still likely to occur, then perhaps you should
	make the counter 32 bits (by editing the code).

	<P>
	You may be concerned about the clock being "set back" during summer
	daylight savings. However this isn't an issue because the times used here
	are UTC, which "always" go forward. Note that x86 based Unixes may need
	proper configuration for this to be true -- they should be configured to
	assume that the motherboard clock is on UTC and compensate appropriately.
	But even still, if you're running NTP then your UTC time will be correct
	very shortly after reboot.

	<P>
	The <CODE>UNIQUE_ID</CODE> environment variable is constructed by
	encoding the 112-bit (32-bit IP address, 32 bit pid, 32 bit time stamp,
	16 bit counter) quadruple using the alphabet <CODE>[A-Za-z0-9@-]</CODE>
	in a manner similar to MIME base64 encoding, producing 19 characters.
	The MIME base64 alphabet is actually <CODE>[A-Za-z0-9+/]</CODE> however
	<CODE>+</CODE> and <CODE>/</CODE> need to be specially encoded in URLs,
	which makes them less desirable. All values are encoded in network
	byte ordering so that the encoding is comparable across architectures of
	different byte ordering. The actual ordering of the encoding is: time
	stamp, IP address, pid, counter. This ordering has a purpose, but it
	should be emphasized that applications should not dissect the encoding.
	Applications should treat the entire encoded <CODE>UNIQUE_ID</CODE> as an
	opaque token, which can be compared against other <CODE>UNIQUE_ID</CODE>s
	for equality only.

	<P>
	The ordering was chosen such that it's possible to change the encoding
	in the future without worrying about collision with an existing database
	of <CODE>UNIQUE_ID</CODE>s. The new encodings should also keep the time
	stamp as the first element, and can otherwise use the same alphabet and
	bit length. Since the time stamps are essentially an increasing sequence,
	it's sufficient to have a <EM>flag second</EM> in which all machines in the
	cluster stop serving and request, and stop using the old encoding format.
	Afterwards they can resume requests and begin issuing the new encodings.

	<P>
	This we believe is a relatively portable solution to this problem. It can
	be extended to multithreaded systems like Windows NT, and can grow with
	future needs. The identifiers generated have essentially an infinite
	life-time because future identifiers can be made longer as required.
	Essentially no communication is required between machines in the cluster
	(only NTP synchronization is required, which is low overhead), and no
	communication between httpd processes is required (the communication is
	implicit in the pid value assigned by the kernel). In very specific
	situations the identifier can be shortened, but more information needs
	to be assumed (for example the 32-bit IP address is overkill for any
	site, but there is no portable shorter replacement for it).

	<HR>

	<H2>Directives</H2>

	<CODE>mod_unique_id</CODE> has no directives.

	<!--#include virtual="footer.html" -->
	</BODY>
	</HTML>