| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> |
| <HTML> |
| <HEAD> |
| <TITLE>Apache Performance Notes</TITLE> |
| </HEAD> |
| <!-- Background white, links blue (unvisited), navy (visited), red (active) --> |
| <BODY |
| BGCOLOR="#FFFFFF" |
| TEXT="#000000" |
| LINK="#0000FF" |
| VLINK="#000080" |
| ALINK="#FF0000" |
| > |
| <H1>Apache Performance Notes</H1> |
| |
| <P>Author: Dean Gaudet |
| |
| <H3>Introduction</H3> |
| <P>Apache is a general webserver, which is designed to be correct first, and |
| fast second. Even so, it's performance is quite satisfactory. Most |
| sites have less than 10Mbits of outgoing bandwidth, which Apache can |
| fill using only a low end Pentium-based webserver. In practice sites |
| with more bandwidth require more than one machine to fill the bandwidth |
| due to other constraints (such as CGI or database transaction overhead). |
| For these reasons the development focus has been mostly on correctness |
| and configurability. |
| |
| <P>Unfortunately many folks overlook these facts and cite raw performance |
| numbers as if they are some indication of the quality of a web server |
| product. There is a bare minimum performance that is acceptable, beyond |
| that extra speed only caters to a much smaller segment of the market. |
| But in order to avoid this hurdle to the acceptance of Apache in some |
| markets, effort was put into Apache 1.3 to bring performance up to a |
| point where the difference with other high-end webservers is minimal. |
| |
| <P>Finally there are the folks who just plain want to see how fast something |
| can go. The author falls into this category. The rest of this document |
| is dedicated to these folks who want to squeeze every last bit of |
| performance out of Apache's current model, and want to understand why |
| it does some things which slow it down. |
| |
| <P>Note that this is tailored towards Apache 1.3 on Unix. Some of it applies |
| to Apache on NT. Apache on NT has not been tuned for performance yet, |
| in fact it probably performs very poorly because NT performance requires |
| a different programming model. |
| |
| <H3>Hardware and Operating System Issues</H3> |
| |
| <P>The single biggest hardware issue affecting webserver performance |
| is RAM. A webserver should never ever have to swap, swapping increases |
| the latency of each request beyond a point that users consider "fast |
| enough". This causes users to hit stop and reload, further increasing |
| the load. You can, and should, control the <CODE>MaxClients</CODE> |
| setting so that your server does not spawn so many children it starts |
| swapping. |
| |
| <P>Beyond that the rest is mundane: get a fast enough CPU, a fast enough |
| network card, and fast enough disks, where "fast enough" is something |
| that needs to be determined by experimentation. |
| |
| <P>Operating system choice is largely a matter of local concerns. But |
| a general guideline is to always apply the latest vendor TCP/IP patches. |
| HTTP serving completely breaks many of the assumptions built into Unix |
| kernels up through 1994 and even 1995. Good choices include |
| recent FreeBSD, and Linux. |
| |
| <H3>Run-Time Configuration Issues</H3> |
| |
| <H4>HostnameLookups</H4> |
| <P>Prior to Apache 1.3, <CODE>HostnameLookups</CODE> defaulted to On. |
| This adds latency |
| to every request because it requires a DNS lookup to complete before |
| the request is finished. In Apache 1.3 this setting defaults to Off. |
| However (1.3 or later), if you use any <CODE>allow from domain</CODE> or |
| <CODE>deny from domain</CODE> directives then you will pay for a |
| double reverse DNS lookup (a reverse, followed by a forward to make sure |
| that the reverse is not being spoofed). So for the highest performance |
| avoid using these directives (it's fine to use IP addresses rather than |
| domain names). |
| |
| <P>Note that it's possible to scope the directives, such as within |
| a <CODE><Location /server-status></CODE> section. In this |
| case the DNS lookups are only performed on requests matching the |
| criteria. Here's an example which disables |
| lookups except for .html and .cgi files: |
| |
| <BLOCKQUOTE><PRE> |
| HostnameLookups off |
| <Files ~ "\.(html|cgi)$> |
| HostnameLookups on |
| </Files> |
| </PRE></BLOCKQUOTE> |
| |
| But even still, if you just need DNS names |
| in some CGIs you could consider doing the |
| <CODE>gethostbyname</CODE> call in the specific CGIs that need it. |
| |
| <H4>FollowSymLinks and SymLinksIfOwnerMatch</H4> |
| <P>Wherever in your URL-space you do not have an |
| <CODE>Options FollowSymLinks</CODE>, or you do have an |
| <CODE>Options SymLinksIfOwnerMatch</CODE> Apache will have to |
| issue extra system calls to check up on symlinks. One extra call per |
| filename component. For example, if you had: |
| |
| <BLOCKQUOTE><PRE> |
| DocumentRoot /www/htdocs |
| <Directory /> |
| Options SymLinksIfOwnerMatch |
| </Directory> |
| </PRE></BLOCKQUOTE> |
| |
| and a request is made for the URI <CODE>/index.html</CODE>. |
| Then Apache will perform <CODE>lstat(2)</CODE> on <CODE>/www</CODE>, |
| <CODE>/www/htdocs</CODE>, and <CODE>/www/htdocs/index.html</CODE>. The |
| results of these <CODE>lstats</CODE> are never cached, |
| so they will occur on every single request. If you really desire the |
| symlinks security checking you can do something like this: |
| |
| <BLOCKQUOTE><PRE> |
| DocumentRoot /www/htdocs |
| <Directory /> |
| Options FollowSymLinks |
| </Directory> |
| <Directory /www/htdocs> |
| Options -FollowSymLinks +SymLinksIfOwnerMatch |
| </Directory> |
| </PRE></BLOCKQUOTE> |
| |
| This at least avoids the extra checks for the <CODE>DocumentRoot</CODE> |
| path. Note that you'll need to add similar sections if you have any |
| <CODE>Alias</CODE> or <CODE>RewriteRule</CODE> paths outside of your |
| document root. For highest performance, and no symlink protection, |
| set <CODE>FollowSymLinks</CODE> everywhere, and never set |
| <CODE>SymLinksIfOwnerMatch</CODE>. |
| |
| <H4>AllowOverride</H4> |
| |
| <P>Wherever in your URL-space you allow overrides (typically |
| <CODE>.htaccess</CODE> files) Apache will attempt to open |
| <CODE>.htaccess</CODE> for each filename component. For example, |
| |
| <BLOCKQUOTE><PRE> |
| DocumentRoot /www/htdocs |
| <Directory /> |
| AllowOverride all |
| </Directory> |
| </PRE></BLOCKQUOTE> |
| |
| and a request is made for the URI <CODE>/index.html</CODE>. Then |
| Apache will attempt to open <CODE>/.htaccess</CODE>, |
| <CODE>/www/.htaccess</CODE>, and <CODE>/www/htdocs/.htaccess</CODE>. |
| The solutions are similar to the previous case of <CODE>Options |
| FollowSymLinks</CODE>. For highest performance use |
| <CODE>AllowOverride None</CODE> everywhere in your filesystem. |
| |
| <H4>Negotiation</H4> |
| |
| <P>If at all possible, avoid content-negotiation if you're really |
| interested in every last ounce of performance. In practice the |
| benefits of negotiation outweigh the performance penalties. There's |
| one case where you can speed up the server. Instead of using |
| a wildcard such as: |
| |
| <BLOCKQUOTE><PRE> |
| DirectoryIndex index |
| </PRE></BLOCKQUOTE> |
| |
| Use a complete list of options: |
| |
| <BLOCKQUOTE><PRE> |
| DirectoryIndex index.cgi index.pl index.shtml index.html |
| </PRE></BLOCKQUOTE> |
| |
| where you list the most common choice first. |
| |
| <H4>Process Creation</H4> |
| |
| <P>Prior to Apache 1.3 the <CODE>MinSpareServers</CODE>, |
| <CODE>MaxSpareServers</CODE>, and <CODE>StartServers</CODE> settings |
| all had drastic effects on benchmark results. In particular, Apache |
| required a "ramp-up" period in order to reach a number of children |
| sufficient to serve the load being applied. After the initial |
| spawning of <CODE>StartServers</CODE> children, only one child per |
| second would be created to satisfy the <CODE>MinSpareServers</CODE> |
| setting. So a server being accessed by 100 simultaneous clients, |
| using the default <CODE>StartServers</CODE> of 5 would take on |
| the order 95 seconds to spawn enough children to handle the load. This |
| works fine in practice on real-life servers, because they aren't restarted |
| frequently. But does really poorly on benchmarks which might only run |
| for ten minutes. |
| |
| <P>The one-per-second rule was implemented in an effort to avoid |
| swamping the machine with the startup of new children. If the machine |
| is busy spawning children it can't service requests. But it has such |
| a drastic effect on the perceived performance of Apache that it had |
| to be replaced. As of Apache 1.3, |
| the code will relax the one-per-second rule. It |
| will spawn one, wait a second, then spawn two, wait a second, then spawn |
| four, and it will continue exponentially until it is spawning 32 children |
| per second. It will stop whenever it satisfies the |
| <CODE>MinSpareServers</CODE> setting. |
| |
| <P>This appears to be responsive enough that it's |
| almost unnecessary to twiddle the <CODE>MinSpareServers</CODE>, |
| <CODE>MaxSpareServers</CODE> and <CODE>StartServers</CODE> knobs. When |
| more than 4 children are spawned per second, a message will be emitted |
| to the <CODE>ErrorLog</CODE>. If you see a lot of these errors then |
| consider tuning these settings. Use the <CODE>mod_status</CODE> output |
| as a guide. |
| |
| <P>Related to process creation is process death induced by the |
| <CODE>MaxRequestsPerChild</CODE> setting. By default this is 0, which |
| means that there is no limit to the number of requests handled |
| per child. If your configuration currently has this set to some |
| very low number, such as 30, you may want to bump this up significantly. |
| If you are running SunOS or an old version of Solaris, limit this |
| to 10000 or so because of memory leaks. |
| |
| <P>When keep-alives are in use, children will be kept busy |
| doing nothing waiting for more requests on the already open |
| connection. The default <CODE>KeepAliveTimeout</CODE> of |
| 15 seconds attempts to minimize this effect. The tradeoff |
| here is between network bandwidth and server resources. |
| In no event should you raise this above about 60 seconds, as |
| <A HREF="http://www.research.digital.com/wrl/techreports/abstracts/95.4.html" |
| >most of the benefits are lost</A>. |
| |
| <H3>Compile-Time Configuration Issues</H3> |
| |
| <H4>mod_status and ExtendedStatus On</H4> |
| |
| <P>If you include <CODE>mod_status</CODE> |
| and you also set <CODE>ExtendedStatus On</CODE> when building and running |
| Apache, then on every request Apache will perform two calls to |
| <CODE>gettimeofday(2)</CODE> (or <CODE>times(2)</CODE> depending |
| on your operating system), and (pre-1.3) several extra calls to |
| <CODE>time(2)</CODE>. This is all done so that the status report |
| contains timing indications. For highest performance, set |
| <CODE>ExtendedStatus off</CODE> (which is the default). |
| |
| <H4>accept Serialization - multiple sockets</H4> |
| |
| <P>This discusses a shortcoming in the Unix socket API. |
| Suppose your |
| web server uses multiple <CODE>Listen</CODE> statements to listen on |
| either multiple ports or multiple addresses. In order to test each |
| socket to see if a connection is ready Apache uses <CODE>select(2)</CODE>. |
| <CODE>select(2)</CODE> indicates that a socket has <EM>zero</EM> or |
| <EM>at least one</EM> connection waiting on it. Apache's model includes |
| multiple children, and all the idle ones test for new connections at the |
| same time. A naive implementation looks something like this |
| (these examples do not match the code, they're contrived for |
| pedagogical purposes): |
| |
| <BLOCKQUOTE><PRE> |
| for (;;) { |
| for (;;) { |
| fd_set accept_fds; |
| |
| FD_ZERO (&accept_fds); |
| for (i = first_socket; i <= last_socket; ++i) { |
| FD_SET (i, &accept_fds); |
| } |
| rc = select (last_socket+1, &accept_fds, NULL, NULL, NULL); |
| if (rc < 1) continue; |
| new_connection = -1; |
| for (i = first_socket; i <= last_socket; ++i) { |
| if (FD_ISSET (i, &accept_fds)) { |
| new_connection = accept (i, NULL, NULL); |
| if (new_connection != -1) break; |
| } |
| } |
| if (new_connection != -1) break; |
| } |
| process the new_connection; |
| } |
| </PRE></BLOCKQUOTE> |
| |
| But this naive implementation has a serious starvation problem. Recall |
| that multiple children execute this loop at the same time, and so multiple |
| children will block at <CODE>select</CODE> when they are in between |
| requests. All those blocked children will awaken and return from |
| <CODE>select</CODE> when a single request appears on any socket |
| (the number of children which awaken varies depending on the operating |
| system and timing issues). |
| They will all then fall down into the loop and try to <CODE>accept</CODE> |
| the connection. But only one will succeed (assuming there's still only |
| one connection ready), the rest will be <EM>blocked</EM> in |
| <CODE>accept</CODE>. |
| This effectively locks those children into serving requests from that |
| one socket and no other sockets, and they'll be stuck there until enough |
| new requests appear on that socket to wake them all up. |
| This starvation problem was first documented in |
| <A HREF="http://bugs.apache.org/index/full/467">PR#467</A>. There |
| are at least two solutions. |
| |
| <P>One solution is to make the sockets non-blocking. In this case the |
| <CODE>accept</CODE> won't block the children, and they will be allowed |
| to continue immediately. But this wastes CPU time. Suppose you have |
| ten idle children in <CODE>select</CODE>, and one connection arrives. |
| Then nine of those children will wake up, try to <CODE>accept</CODE> the |
| connection, fail, and loop back into <CODE>select</CODE>, accomplishing |
| nothing. Meanwhile none of those children are servicing requests that |
| occurred on other sockets until they get back up to the <CODE>select</CODE> |
| again. Overall this solution does not seem very fruitful unless you |
| have as many idle CPUs (in a multiprocessor box) as you have idle children, |
| not a very likely situation. |
| |
| <P>Another solution, the one used by Apache, is to serialize entry into |
| the inner loop. The loop looks like this (differences highlighted): |
| |
| <BLOCKQUOTE><PRE> |
| for (;;) { |
| <STRONG>accept_mutex_on ();</STRONG> |
| for (;;) { |
| fd_set accept_fds; |
| |
| FD_ZERO (&accept_fds); |
| for (i = first_socket; i <= last_socket; ++i) { |
| FD_SET (i, &accept_fds); |
| } |
| rc = select (last_socket+1, &accept_fds, NULL, NULL, NULL); |
| if (rc < 1) continue; |
| new_connection = -1; |
| for (i = first_socket; i <= last_socket; ++i) { |
| if (FD_ISSET (i, &accept_fds)) { |
| new_connection = accept (i, NULL, NULL); |
| if (new_connection != -1) break; |
| } |
| } |
| if (new_connection != -1) break; |
| } |
| <STRONG>accept_mutex_off ();</STRONG> |
| process the new_connection; |
| } |
| </PRE></BLOCKQUOTE> |
| |
| <A NAME="serialize">The functions</A> |
| <CODE>accept_mutex_on</CODE> and <CODE>accept_mutex_off</CODE> |
| implement a mutual exclusion semaphore. Only one child can have the |
| mutex at any time. There are several choices for implementing these |
| mutexes. The choice is defined in <CODE>src/conf.h</CODE> (pre-1.3) or |
| <CODE>src/include/ap_config.h</CODE> (1.3 or later). Some architectures |
| do not have any locking choice made, on these architectures it is unsafe |
| to use multiple <CODE>Listen</CODE> directives. |
| |
| <DL> |
| <DT><CODE>USE_FLOCK_SERIALIZED_ACCEPT</CODE> |
| <DD>This method uses the <CODE>flock(2)</CODE> system call to lock a |
| lock file (located by the <CODE>LockFile</CODE> directive). |
| |
| <DT><CODE>USE_FCNTL_SERIALIZED_ACCEPT</CODE> |
| <DD>This method uses the <CODE>fcntl(2)</CODE> system call to lock a |
| lock file (located by the <CODE>LockFile</CODE> directive). |
| |
| <DT><CODE>USE_SYSVSEM_SERIALIZED_ACCEPT</CODE> |
| <DD>(1.3 or later) This method uses SysV-style semaphores to implement the |
| mutex. Unfortunately SysV-style semaphores have some bad side-effects. |
| One is that it's possible Apache will die without cleaning up the semaphore |
| (see the <CODE>ipcs(8)</CODE> man page). The other is that the semaphore |
| API allows for a denial of service attack by any CGIs running under the |
| same uid as the webserver (<EM>i.e.</EM>, all CGIs unless you use something |
| like suexec or cgiwrapper). For these reasons this method is not used |
| on any architecture except IRIX (where the previous two are prohibitively |
| expensive on most IRIX boxes). |
| |
| <DT><CODE>USE_USLOCK_SERIALIZED_ACCEPT</CODE> |
| <DD>(1.3 or later) This method is only available on IRIX, and uses |
| <CODE>usconfig(2)</CODE> to create a mutex. While this method avoids |
| the hassles of SysV-style semaphores, it is not the default for IRIX. |
| This is because on single processor IRIX boxes (5.3 or 6.2) the |
| uslock code is two orders of magnitude slower than the SysV-semaphore |
| code. On multi-processor IRIX boxes the uslock code is an order of magnitude |
| faster than the SysV-semaphore code. Kind of a messed up situation. |
| So if you're using a multiprocessor IRIX box then you should rebuild your |
| webserver with <CODE>-DUSE_USLOCK_SERIALIZED_ACCEPT</CODE> on the |
| <CODE>EXTRA_CFLAGS</CODE>. |
| |
| <DT><CODE>USE_PTHREAD_SERIALIZED_ACCEPT</CODE> |
| <DD>(1.3 or later) This method uses POSIX mutexes and should work on |
| any architecture implementing the full POSIX threads specification, |
| however appears to only work on Solaris (2.5 or later), and even then |
| only in certain configurations. If you experiment with this you should |
| watch out for your server hanging and not responding. Static content |
| only servers may work just fine. |
| </DL> |
| |
| <P>If your system has another method of serialization which isn't in the |
| above list then it may be worthwhile adding code for it (and submitting |
| a patch back to Apache). |
| |
| <P>Another solution that has been considered but never implemented is |
| to partially serialize the loop -- that is, let in a certain number |
| of processes. This would only be of interest on multiprocessor boxes |
| where it's possible multiple children could run simultaneously, and the |
| serialization actually doesn't take advantage of the full bandwidth. |
| This is a possible area of future investigation, but priority remains |
| low because highly parallel web servers are not the norm. |
| |
| <P>Ideally you should run servers without multiple <CODE>Listen</CODE> |
| statements if you want the highest performance. But read on. |
| |
| <H4>accept Serialization - single socket</H4> |
| |
| <P>The above is fine and dandy for multiple socket servers, but what |
| about single socket servers? In theory they shouldn't experience |
| any of these same problems because all children can just block in |
| <CODE>accept(2)</CODE> until a connection arrives, and no starvation |
| results. In practice this hides almost the same "spinning" behaviour |
| discussed above in the non-blocking solution. The way that most TCP |
| stacks are implemented, the kernel actually wakes up all processes blocked |
| in <CODE>accept</CODE> when a single connection arrives. One of those |
| processes gets the connection and returns to user-space, the rest spin in |
| the kernel and go back to sleep when they discover there's no connection |
| for them. This spinning is hidden from the user-land code, but it's |
| there nonetheless. This can result in the same load-spiking wasteful |
| behaviour that a non-blocking solution to the multiple sockets case can. |
| |
| <P>For this reason we have found that many architectures behave more |
| "nicely" if we serialize even the single socket case. So this is |
| actually the default in almost all cases. Crude experiments under |
| Linux (2.0.30 on a dual Pentium pro 166 w/128Mb RAM) have shown that |
| the serialization of the single socket case causes less than a 3% |
| decrease in requests per second over unserialized single-socket. |
| But unserialized single-socket showed an extra 100ms latency on |
| each request. This latency is probably a wash on long haul lines, |
| and only an issue on LANs. If you want to override the single socket |
| serialization you can define <CODE>SINGLE_LISTEN_UNSERIALIZED_ACCEPT</CODE> |
| and then single-socket servers will not serialize at all. |
| |
| <H4>Lingering Close</H4> |
| |
| <P>As discussed in |
| <A |
| HREF="http://www.ics.uci.edu/pub/ietf/http/draft-ietf-http-connection-00.txt" |
| >draft-ietf-http-connection-00.txt</A> section 8, |
| in order for an HTTP server to <STRONG>reliably</STRONG> implement the protocol |
| it needs to shutdown each direction of the communication independently |
| (recall that a TCP connection is bi-directional, each half is independent |
| of the other). This fact is often overlooked by other servers, but |
| is correctly implemented in Apache as of 1.2. |
| |
| <P>When this feature was added to Apache it caused a flurry of |
| problems on various versions of Unix because of a shortsightedness. |
| The TCP specification does not state that the FIN_WAIT_2 state has a |
| timeout, but it doesn't prohibit it. On systems without the timeout, |
| Apache 1.2 induces many sockets stuck forever in the FIN_WAIT_2 state. |
| In many cases this can be avoided by simply upgrading to the latest |
| TCP/IP patches supplied by the vendor, in cases where the vendor has |
| never released patches (<EM>i.e.</EM>, SunOS4 -- although folks with a source |
| license can patch it themselves) we have decided to disable this feature. |
| |
| <P>There are two ways of accomplishing this. One is the |
| socket option <CODE>SO_LINGER</CODE>. But as fate would have it, |
| this has never been implemented properly in most TCP/IP stacks. Even |
| on those stacks with a proper implementation (<EM>i.e.</EM>, Linux 2.0.31) this |
| method proves to be more expensive (cputime) than the next solution. |
| |
| <P>For the most part, Apache implements this in a function called |
| <CODE>lingering_close</CODE> (in <CODE>http_main.c</CODE>). The |
| function looks roughly like this: |
| |
| <BLOCKQUOTE><PRE> |
| void lingering_close (int s) |
| { |
| char junk_buffer[2048]; |
| |
| /* shutdown the sending side */ |
| shutdown (s, 1); |
| |
| signal (SIGALRM, lingering_death); |
| alarm (30); |
| |
| for (;;) { |
| select (s for reading, 2 second timeout); |
| if (error) break; |
| if (s is ready for reading) { |
| read (s, junk_buffer, sizeof (junk_buffer)); |
| /* just toss away whatever is here */ |
| } |
| } |
| |
| close (s); |
| } |
| </PRE></BLOCKQUOTE> |
| |
| This naturally adds some expense at the end of a connection, but it |
| is required for a reliable implementation. As HTTP/1.1 becomes more |
| prevalent, and all connections are persistent, this expense will be |
| amortized over more requests. If you want to play with fire and |
| disable this feature you can define <CODE>NO_LINGCLOSE</CODE>, but |
| this is not recommended at all. In particular, as HTTP/1.1 pipelined |
| persistent connections come into use <CODE>lingering_close</CODE> |
| is an absolute necessity (and |
| <A HREF="http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html"> |
| pipelined connections are faster</A>, so you |
| want to support them). |
| |
| <H4>Scoreboard File</H4> |
| |
| <P>Apache's parent and children communicate with each other through |
| something called the scoreboard. Ideally this should be implemented |
| in shared memory. For those operating systems that we either have |
| access to, or have been given detailed ports for, it typically is |
| implemented using shared memory. The rest default to using an |
| on-disk file. The on-disk file is not only slow, but it is unreliable |
| (and less featured). Peruse the <CODE>src/main/conf.h</CODE> file |
| for your architecture and look for either <CODE>USE_MMAP_SCOREBOARD</CODE> or |
| <CODE>USE_SHMGET_SCOREBOARD</CODE>. Defining one of those two (as |
| well as their companions <CODE>HAVE_MMAP</CODE> and <CODE>HAVE_SHMGET</CODE> |
| respectively) enables the supplied shared memory code. If your system has |
| another type of shared memory, edit the file <CODE>src/main/http_main.c</CODE> |
| and add the hooks necessary to use it in Apache. (Send us back a patch |
| too please.) |
| |
| <P>Historical note: The Linux port of Apache didn't start to use |
| shared memory until version 1.2 of Apache. This oversight resulted |
| in really poor and unreliable behaviour of earlier versions of Apache |
| on Linux. |
| |
| <H4><CODE>DYNAMIC_MODULE_LIMIT</CODE></H4> |
| |
| <P>If you have no intention of using dynamically loaded modules |
| (you probably don't if you're reading this and tuning your |
| server for every last ounce of performance) then you should add |
| <CODE>-DDYNAMIC_MODULE_LIMIT=0</CODE> when building your server. |
| This will save RAM that's allocated only for supporting dynamically |
| loaded modules. |
| |
| <H3>Appendix: Detailed Analysis of a Trace</H3> |
| |
| Here is a system call trace of Apache 1.3 running on Linux. The run-time |
| configuration file is essentially the default plus: |
| |
| <BLOCKQUOTE><PRE> |
| <Directory /> |
| AllowOverride none |
| Options FollowSymLinks |
| </Directory> |
| </PRE></BLOCKQUOTE> |
| |
| The file being requested is a static 6K file of no particular content. |
| Traces of non-static requests or requests with content negotiation |
| look wildly different (and quite ugly in some cases). First the |
| entire trace, then we'll examine details. (This was generated by |
| the <CODE>strace</CODE> program, other similar programs include |
| <CODE>truss</CODE>, <CODE>ktrace</CODE>, and <CODE>par</CODE>.) |
| |
| <BLOCKQUOTE><PRE> |
| accept(15, {sin_family=AF_INET, sin_port=htons(22283), sin_addr=inet_addr("127.0.0.1")}, [16]) = 3 |
| flock(18, LOCK_UN) = 0 |
| sigaction(SIGUSR1, {SIG_IGN}, {0x8059954, [], SA_INTERRUPT}) = 0 |
| getsockname(3, {sin_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 |
| setsockopt(3, IPPROTO_TCP1, [1], 4) = 0 |
| read(3, "GET /6k HTTP/1.0\r\nUser-Agent: "..., 4096) = 60 |
| sigaction(SIGUSR1, {SIG_IGN}, {SIG_IGN}) = 0 |
| time(NULL) = 873959960 |
| gettimeofday({873959960, 404935}, NULL) = 0 |
| stat("/home/dgaudet/ap/apachen/htdocs/6k", {st_mode=S_IFREG|0644, st_size=6144, ...}) = 0 |
| open("/home/dgaudet/ap/apachen/htdocs/6k", O_RDONLY) = 4 |
| mmap(0, 6144, PROT_READ, MAP_PRIVATE, 4, 0) = 0x400ee000 |
| writev(3, [{"HTTP/1.1 200 OK\r\nDate: Thu, 11"..., 245}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 6144}], 2) = 6389 |
| close(4) = 0 |
| time(NULL) = 873959960 |
| write(17, "127.0.0.1 - - [10/Sep/1997:23:39"..., 71) = 71 |
| gettimeofday({873959960, 417742}, NULL) = 0 |
| times({tms_utime=5, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 446747 |
| shutdown(3, 1 /* send */) = 0 |
| oldselect(4, [3], NULL, [3], {2, 0}) = 1 (in [3], left {2, 0}) |
| read(3, "", 2048) = 0 |
| close(3) = 0 |
| sigaction(SIGUSR1, {0x8059954, [], SA_INTERRUPT}, {SIG_IGN}) = 0 |
| munmap(0x400ee000, 6144) = 0 |
| flock(18, LOCK_EX) = 0 |
| </PRE></BLOCKQUOTE> |
| |
| <P>Notice the accept serialization: |
| |
| <BLOCKQUOTE><PRE> |
| flock(18, LOCK_UN) = 0 |
| ... |
| flock(18, LOCK_EX) = 0 |
| </PRE></BLOCKQUOTE> |
| |
| These two calls can be removed by defining |
| <CODE>SINGLE_LISTEN_UNSERIALIZED_ACCEPT</CODE> as described earlier. |
| |
| <P>Notice the <CODE>SIGUSR1</CODE> manipulation: |
| |
| <BLOCKQUOTE><PRE> |
| sigaction(SIGUSR1, {SIG_IGN}, {0x8059954, [], SA_INTERRUPT}) = 0 |
| ... |
| sigaction(SIGUSR1, {SIG_IGN}, {SIG_IGN}) = 0 |
| ... |
| sigaction(SIGUSR1, {0x8059954, [], SA_INTERRUPT}, {SIG_IGN}) = 0 |
| </PRE></BLOCKQUOTE> |
| |
| This is caused by the implementation of graceful restarts. When the |
| parent receives a <CODE>SIGUSR1</CODE> it sends a <CODE>SIGUSR1</CODE> |
| to all of its children (and it also increments a "generation counter" |
| in shared memory). Any children that are idle (between connections) |
| will immediately die |
| off when they receive the signal. Any children that are in keep-alive |
| connections, but are in between requests will die off immediately. But |
| any children that have a connection and are still waiting for the first |
| request will not die off immediately. |
| |
| <P>To see why this is necessary, consider how a browser reacts to a closed |
| connection. If the connection was a keep-alive connection and the request |
| being serviced was not the first request then the browser will quietly |
| reissue the request on a new connection. It has to do this because the |
| server is always free to close a keep-alive connection in between requests |
| (<EM>i.e.</EM>, due to a timeout or because of a maximum number of requests). |
| But, if the connection is closed before the first response has been |
| received the typical browser will display a "document contains no data" |
| dialogue (or a broken image icon). This is done on the assumption that |
| the server is broken in some way (or maybe too overloaded to respond |
| at all). So Apache tries to avoid ever deliberately closing the connection |
| before it has sent a single response. This is the cause of those |
| <CODE>SIGUSR1</CODE> manipulations. |
| |
| <P>Note that it is theoretically possible to eliminate all three of |
| these calls. But in rough tests the gain proved to be almost unnoticeable. |
| |
| <P>In order to implement virtual hosts, Apache needs to know the |
| local socket address used to accept the connection: |
| |
| <BLOCKQUOTE><PRE> |
| getsockname(3, {sin_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 |
| </PRE></BLOCKQUOTE> |
| |
| It is possible to eliminate this call in many situations (such as when |
| there are no virtual hosts, or when <CODE>Listen</CODE> directives are |
| used which do not have wildcard addresses). But no effort has yet been |
| made to do these optimizations. |
| |
| <P>Apache turns off the Nagle algorithm: |
| |
| <BLOCKQUOTE><PRE> |
| setsockopt(3, IPPROTO_TCP1, [1], 4) = 0 |
| </PRE></BLOCKQUOTE> |
| |
| because of problems described in |
| <A HREF="http://www.isi.edu/~johnh/PAPERS/Heidemann97a.html">a |
| paper by John Heidemann</A>. |
| |
| <P>Notice the two <CODE>time</CODE> calls: |
| |
| <BLOCKQUOTE><PRE> |
| time(NULL) = 873959960 |
| ... |
| time(NULL) = 873959960 |
| </PRE></BLOCKQUOTE> |
| |
| One of these occurs at the beginning of the request, and the other occurs |
| as a result of writing the log. At least one of these is required to |
| properly implement the HTTP protocol. The second occurs because the |
| Common Log Format dictates that the log record include a timestamp of the |
| end of the request. A custom logging module could eliminate one of the |
| calls. Or you can use a method which moves the time into shared memory, |
| see the <A HREF="#patches">patches section below</A>. |
| |
| <P>As described earlier, <CODE>ExtendedStatus On</CODE> causes two |
| <CODE>gettimeofday</CODE> calls and a call to <CODE>times</CODE>: |
| |
| <BLOCKQUOTE><PRE> |
| gettimeofday({873959960, 404935}, NULL) = 0 |
| ... |
| gettimeofday({873959960, 417742}, NULL) = 0 |
| times({tms_utime=5, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 446747 |
| </PRE></BLOCKQUOTE> |
| |
| These can be removed by setting <CODE>ExtendedStatus Off</CODE> (which |
| is the default). |
| |
| <P>It might seem odd to call <CODE>stat</CODE>: |
| |
| <BLOCKQUOTE><PRE> |
| stat("/home/dgaudet/ap/apachen/htdocs/6k", {st_mode=S_IFREG|0644, st_size=6144, ...}) = 0 |
| </PRE></BLOCKQUOTE> |
| |
| This is part of the algorithm which calculates the |
| <CODE>PATH_INFO</CODE> for use by CGIs. In fact if the request had been |
| for the URI <CODE>/cgi-bin/printenv/foobar</CODE> then there would be |
| two calls to <CODE>stat</CODE>. The first for |
| <CODE>/home/dgaudet/ap/apachen/cgi-bin/printenv/foobar</CODE> |
| which does not exist, and the second for |
| <CODE>/home/dgaudet/ap/apachen/cgi-bin/printenv</CODE>, which does exist. |
| Regardless, at least one <CODE>stat</CODE> call is necessary when |
| serving static files because the file size and modification times are |
| used to generate HTTP headers (such as <CODE>Content-Length</CODE>, |
| <CODE>Last-Modified</CODE>) and implement protocol features (such |
| as <CODE>If-Modified-Since</CODE>). A somewhat more clever server |
| could avoid the <CODE>stat</CODE> when serving non-static files, |
| however doing so in Apache is very difficult given the modular structure. |
| |
| <P>All static files are served using <CODE>mmap</CODE>: |
| |
| <BLOCKQUOTE><PRE> |
| mmap(0, 6144, PROT_READ, MAP_PRIVATE, 4, 0) = 0x400ee000 |
| ... |
| munmap(0x400ee000, 6144) = 0 |
| </PRE></BLOCKQUOTE> |
| |
| On some architectures it's slower to <CODE>mmap</CODE> small |
| files than it is to simply <CODE>read</CODE> them. The define |
| <CODE>MMAP_THRESHOLD</CODE> can be set to the minimum |
| size required before using <CODE>mmap</CODE>. By default |
| it's set to 0 (except on SunOS4 where experimentation has |
| shown 8192 to be a better value). Using a tool such as <A |
| HREF="http://www.bitmover.com/lmbench/">lmbench</A> you |
| can determine the optimal setting for your environment. |
| |
| <P>You may also wish to experiment with <CODE>MMAP_SEGMENT_SIZE</CODE> |
| (default 32768) which determines the maximum number of bytes that |
| will be written at a time from mmap()d files. Apache only resets the |
| client's <CODE>Timeout</CODE> in between write()s. So setting this |
| large may lock out low bandwidth clients unless you also increase the |
| <CODE>Timeout</CODE>. |
| |
| <P>It may even be the case that <CODE>mmap</CODE> isn't |
| used on your architecture, if so then defining <CODE>USE_MMAP_FILES</CODE> |
| and <CODE>HAVE_MMAP</CODE> might work (if it works then report back to us). |
| |
| <P>Apache does its best to avoid copying bytes around in memory. The |
| first write of any request typically is turned into a <CODE>writev</CODE> |
| which combines both the headers and the first hunk of data: |
| |
| <BLOCKQUOTE><PRE> |
| writev(3, [{"HTTP/1.1 200 OK\r\nDate: Thu, 11"..., 245}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 6144}], 2) = 6389 |
| </PRE></BLOCKQUOTE> |
| |
| When doing HTTP/1.1 chunked encoding Apache will generate up to four |
| element <CODE>writev</CODE>s. The goal is to push the byte copying |
| into the kernel, where it typically has to happen anyhow (to assemble |
| network packets). On testing, various Unixes (BSDI 2.x, Solaris 2.5, |
| Linux 2.0.31+) properly combine the elements into network packets. |
| Pre-2.0.31 Linux will not combine, and will create a packet for |
| each element, so upgrading is a good idea. Defining <CODE>NO_WRITEV</CODE> |
| will disable this combining, but result in very poor chunked encoding |
| performance. |
| |
| <P>The log write: |
| |
| <BLOCKQUOTE><PRE> |
| write(17, "127.0.0.1 - - [10/Sep/1997:23:39"..., 71) = 71 |
| </PRE></BLOCKQUOTE> |
| |
| can be deferred by defining <CODE>BUFFERED_LOGS</CODE>. In this case |
| up to <CODE>PIPE_BUF</CODE> bytes (a POSIX defined constant) of log entries |
| are buffered before writing. At no time does it split a log entry |
| across a <CODE>PIPE_BUF</CODE> boundary because those writes may not |
| be atomic. (<EM>i.e.</EM>, entries from multiple children could become mixed together). |
| The code does it best to flush this buffer when a child dies. |
| |
| <P>The lingering close code causes four system calls: |
| |
| <BLOCKQUOTE><PRE> |
| shutdown(3, 1 /* send */) = 0 |
| oldselect(4, [3], NULL, [3], {2, 0}) = 1 (in [3], left {2, 0}) |
| read(3, "", 2048) = 0 |
| close(3) = 0 |
| </PRE></BLOCKQUOTE> |
| |
| which were described earlier. |
| |
| <P>Let's apply some of these optimizations: |
| <CODE>-DSINGLE_LISTEN_UNSERIALIZED_ACCEPT -DBUFFERED_LOGS</CODE> and |
| <CODE>ExtendedStatus Off</CODE>. Here's the final trace: |
| |
| <BLOCKQUOTE><PRE> |
| accept(15, {sin_family=AF_INET, sin_port=htons(22286), sin_addr=inet_addr("127.0.0.1")}, [16]) = 3 |
| sigaction(SIGUSR1, {SIG_IGN}, {0x8058c98, [], SA_INTERRUPT}) = 0 |
| getsockname(3, {sin_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0 |
| setsockopt(3, IPPROTO_TCP1, [1], 4) = 0 |
| read(3, "GET /6k HTTP/1.0\r\nUser-Agent: "..., 4096) = 60 |
| sigaction(SIGUSR1, {SIG_IGN}, {SIG_IGN}) = 0 |
| time(NULL) = 873961916 |
| stat("/home/dgaudet/ap/apachen/htdocs/6k", {st_mode=S_IFREG|0644, st_size=6144, ...}) = 0 |
| open("/home/dgaudet/ap/apachen/htdocs/6k", O_RDONLY) = 4 |
| mmap(0, 6144, PROT_READ, MAP_PRIVATE, 4, 0) = 0x400e3000 |
| writev(3, [{"HTTP/1.1 200 OK\r\nDate: Thu, 11"..., 245}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 6144}], 2) = 6389 |
| close(4) = 0 |
| time(NULL) = 873961916 |
| shutdown(3, 1 /* send */) = 0 |
| oldselect(4, [3], NULL, [3], {2, 0}) = 1 (in [3], left {2, 0}) |
| read(3, "", 2048) = 0 |
| close(3) = 0 |
| sigaction(SIGUSR1, {0x8058c98, [], SA_INTERRUPT}, {SIG_IGN}) = 0 |
| munmap(0x400e3000, 6144) = 0 |
| </PRE></BLOCKQUOTE> |
| |
| That's 19 system calls, of which 4 remain relatively easy to remove, |
| but don't seem worth the effort. |
| |
| <H3><A NAME="patches">Appendix: Patches Available</A></H3> |
| |
| There are |
| <A HREF="http://www.arctic.org/~dgaudet/apache/1.3/"> |
| several performance patches available for 1.3.</A> But they may |
| be slightly out of date by the time Apache 1.3.0 has been released, |
| it shouldn't be difficult for someone with a little C knowledge to |
| update them. In particular: |
| |
| <UL> |
| <LI>A |
| <A HREF="http://www.arctic.org/~dgaudet/apache/1.3/shared_time.patch" |
| >patch</A> to remove all <CODE>time(2)</CODE> system calls. |
| <LI>A |
| <A HREF="http://www.arctic.org/~dgaudet/apache/1.3/mod_include_speedups.patch" |
| >patch</A> to remove various system calls from <CODE>mod_include</CODE>, |
| these calls are used by few sites but required for backwards compatibility. |
| <LI>A |
| <A HREF="http://www.arctic.org/~dgaudet/apache/1.3/top_fuel.patch" |
| >patch</A> which integrates the above two plus a few other speedups at the |
| cost of removing some functionality. |
| </UL> |
| |
| <H3>Appendix: The Pre-Forking Model</H3> |
| |
| <P>Apache (on Unix) is a <EM>pre-forking</EM> model server. The |
| <EM>parent</EM> process is responsible only for forking <EM>child</EM> |
| processes, it does not serve any requests or service any network |
| sockets. The child processes actually process connections, they serve |
| multiple connections (one at a time) before dying. |
| The parent spawns new or kills off old |
| children in response to changes in the load on the server (it does so |
| by monitoring a scoreboard which the children keep up to date). |
| |
| <P>This model for servers offers a robustness that other models do |
| not. In particular, the parent code is very simple, and with a high |
| degree of confidence the parent will continue to do its job without |
| error. The children are complex, and when you add in third party |
| code via modules, you risk segmentation faults and other forms of |
| corruption. Even should such a thing happen, it only affects one |
| connection and the server continues serving requests. The parent |
| quickly replaces the dead child. |
| |
| <P>Pre-forking is also very portable across dialects of Unix. |
| Historically this has been an important goal for Apache, and it continues |
| to remain so. |
| |
| <P>The pre-forking model comes under criticism for various |
| performance aspects. Of particular concern are the overhead |
| of forking a process, the overhead of context switches between |
| processes, and the memory overhead of having multiple processes. |
| Furthermore it does not offer as many opportunities for data-caching |
| between requests (such as a pool of <CODE>mmapped</CODE> files). |
| Various other models exist and extensive analysis can be found in the |
| <A HREF="http://www.cs.wustl.edu/~jxh/research/research.html"> papers |
| of the JAWS project</A>. In practice all of these costs vary drastically |
| depending on the operating system. |
| |
| <P>Apache's core code is already multithread aware, and Apache version |
| 1.3 is multithreaded on NT. There have been at least two other experimental |
| implementations of threaded Apache, one using the 1.3 code base on DCE, |
| and one using a custom user-level threads package and the 1.0 code base, |
| neither are available publically. There is also an experimental port of |
| Apache 1.3 to <A HREF="http://www.mozilla.org/docs/refList/refNSPR/"> |
| Netscape's Portable Run Time</A>, which |
| <A HREF="http://www.arctic.org/~dgaudet/apache/2.0/">is available</A> |
| (but you're encouraged to join the |
| <A HREF="http://dev.apache.org/mailing-lists">new-httpd mailing list</A> |
| if you intend to use it). |
| Part of our redesign for version 2.0 |
| of Apache will include abstractions of the server model so that we |
| can continue to support the pre-forking model, and also support various |
| threaded models. |
| |
| </BODY> |
| </HTML> |