doc/developer-guide/cache-architecture/consistency.en.rst - trafficserver - Git at Google

 .. Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

    Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.

 .. include:: ../../common.defs

 .. _developer-cache-consistency:

 Cache Tools
 ~~~~~~~~~~~

 Tools and techniques for cache monitoring and inspection.

 * :ref:`The cache inspector <inspecting-the-cache>`.

 Topics to be done
 ~~~~~~~~~~~~~~~~~

 * Resident alternates
 * Object refresh

 Cache Consistency
 ~~~~~~~~~~~~~~~~~

 The cache is completely consistent, up to and including kicking the power cord
 out, if the write buffer on consumer disk drives is disabled. You need to use::

   hdparm -W0

 The cache validates that all the data for the document is available and will
 silently mark a partial document as a miss on read. There is no gentle
 shutdown for Traffic Server. You simply kill the process and the recovery code
 (fsck) is run every time Traffic Server starts up.

 On startup the two versions of the index are checked, and the last valid one is
 read into memory. |TS| then moves forward from the last snapped write
 cursor and reads all the fragments written to disk and updates the directory
 (as in a log-based file system). It stops reading at the write before the last
 valid write header it sees (as a write is not necessarily atomic because of
 sector reordering). Then the new updated index is written to the invalid
 version (in case of a crash during startup) and the system starts.

 .. _volume tagging:

 Volume Tagging
 ~~~~~~~~~~~~~~

 Currently, :term:`cache volumes <cache volume>` are allocated somewhat
 arbitrarily from storage elements. `This enhancement <https://issues.apache.org/jira/browse/TS-1728>`__
 allows :file:`storage.config` to assign :term:`storage units <storage unit>` to
 specific :term:`volumes <cache volume>` although the volumes must still be
 listed in :file:`volume.config` in general and in particular to map domains to
 specific volumes. A primary use case for this is to be able to map specific
 types of content to different storage elements. This can be employed to have
 different storage devices for various types of content (SSD vs. rotational).

 Version Upgrade
 ---------------

 It is currently the case that any change to the cache format will clear the
 cache. This is an issue when upgrading the |TS| version and should be kept in mind.

 .. _cache-key:

 Controlling the cache key
 -------------------------

 The :term:`cache key` is by default the URL of the request. There are two
 possible choices, the original (pristine) URL and the remapped URL. Which of
 these is used is determined by the configuration value
 :ts:cv:`proxy.config.url_remap.pristine_host_hdr`.

 This is an ``INT`` value. If set to ``0`` (disabled) then the remapped URL is
 used, and if it is not ``0`` (enabled) then the original URL is used. This
 setting also controls the value of the ``HOST`` header that is placed in the
 request sent to the :term:`origin server`, using the hostname from the original
 URL if not ``0`` and the host name from the remapped URL if ``0``. It has no
 other effects.

 For caching, this setting is irrelevant if no remapping is done or there is a
 one-to-one mapping between the original and remapped URLs.

 It becomes significant if multiple original URLs are mapped to the same
 remapped URL. If pristine headers are enabled, requests to different original
 URLs will be stored as distinct :term:`objects <cache object>` in the cache. If
 disabled, the remapped URL will be used and there may be collisions. This is
 bad if the contents different, but quite useful if they are the same (as in
 situations where the original URLs are just aliases for the same underlying
 server resource).

 This is also an issue if a remapping is changed because it is effectively a
 time axis version of the previous case. If an original URL is remapped to a
 different server address then the setting determines if existing cached objects
 will be served for new requests (enabled) or not (disabled). Similarly, if the
 original URL mapped to a particular URL is changed then cached objects from the
 initial original URL will be served from the updated original URL if pristine
 headers is disabled.

 These collisions are not by themselves good or bad. An administrator needs to
 decide which is appropriate for their situation and set the value correspondingly.

 If a greater degree of control is desired, a plugin must be used to invoke the
 API calls :c:func:`TSHttpTxnCacheLookupUrlSet()` or  :c:func:`TSCacheUrlSet()`
 to provide a specific :term:`cache key`. The :c:func:`TSCacheUrlSet()` API can
 be called as early as ``TS_HTTP_READ_REQUEST_HDR_HOOK`` but no later than
 ``TS_HTTP_POST_REMAP_HOOK``. It can be called only once per transaction;
 calling it multiple times has no additional effect.

 A plugin that changes the cache key must do so consistently for both cache hit
 and cache miss requests because two different requests that map to the same
 cache key will be considered equivalent by the cache. Use of the URL directly
 provides this and so must any substitute. This is entirely the responsibility
 of the plugin; there is no way for the |TS| core to detect such an occurrence.

 If :c:func:`TSHttpTxnCacheLookupUrlGet()` is called after new cache url set by
 :c:func:`TSHttpTxnCacheLookupUrlSet()` or :c:func:`TSCacheUrlSet()`, it should
 use a URL location created by :c:func:`TSUrlCreate()` as its third input
 parameter instead of getting ``url_loc`` from the client request.

 It is a requirement that the string be syntactically a URL but otherwise it is
 completely arbitrary and need not have any path. For instance, if the company
 Network Geographics wanted to store certain content under its own
 :term:`cache key`, using a document GUID as part of the key, it could use a
 cache key like ::

    ngeo://W39WaGTPnvg

 The scheme ``ngeo`` was picked specifically because it is not a valid URL
 scheme, and so will never collide with any valid URL.

 This can be useful if the URL encodes both important and unimportant data.
 Instead of storing potentially identical content under different URLs (because
 they differ on the unimportant parts) a url containing only the important parts
 could be created and used.

 For example, suppose the URL for Network Geographics content encoded both the
 document GUID and a referral key. ::

    http://network-geographics-farm-1.com/doc/W39WaGTPnvg.2511635.UQB_zCc8B8H

 We don't want to serve the same content for every possible referrer. Instead,
 we could use a plugin to convert this to the previous example and requests that
 differed only in the referrer key would all reference the same cache entry.
 Note that we would also map the following to the same cache key ::

    http://network-geographics-farm-56.com/doc/W39WaGTPnvg.2511635.UQB_zCc8B8H

 This can be handy for sharing content between servers when that content is
 identical. Plugins can change the cache key, or not, depending on any data in
 the request header. For instance, not changing the cache key if the request is
 not in the ``doc`` directory. If distinguishing servers is important, that can
 easily be pulled from the request URL and used in the synthetic cache key. The
 implementer is free to extract all relevant elements for use in the cache key.

 While there is no explicit requirement that the synthetic cache key be based on
 the HTTP request header, in practice it is generally necessary due to the
 consistency requirement. Because cache lookup happens before attempting to
 connect to the :term:`origin server`, no data from the HTTP response header is
 available, leaving only the request header. The most common case is the one
 described above where the goal is to elide elements of the URL that do not
 affect the content to minimize cache footprint and improve cache hit rates.
	.. Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.

	.. include:: ../../common.defs

	.. _developer-cache-consistency:

	Cache Tools
	~~~~~~~~~~~

	Tools and techniques for cache monitoring and inspection.

	* :ref:`The cache inspector <inspecting-the-cache>`.

	Topics to be done
	~~~~~~~~~~~~~~~~~

	* Resident alternates
	* Object refresh

	Cache Consistency
	~~~~~~~~~~~~~~~~~

	The cache is completely consistent, up to and including kicking the power cord
	out, if the write buffer on consumer disk drives is disabled. You need to use::

	hdparm -W0

	The cache validates that all the data for the document is available and will
	silently mark a partial document as a miss on read. There is no gentle
	shutdown for Traffic Server. You simply kill the process and the recovery code
	(fsck) is run every time Traffic Server starts up.

	On startup the two versions of the index are checked, and the last valid one is
	read into memory. \|TS\| then moves forward from the last snapped write
	cursor and reads all the fragments written to disk and updates the directory
	(as in a log-based file system). It stops reading at the write before the last
	valid write header it sees (as a write is not necessarily atomic because of
	sector reordering). Then the new updated index is written to the invalid
	version (in case of a crash during startup) and the system starts.

	.. _volume tagging:

	Volume Tagging
	~~~~~~~~~~~~~~

	Currently, :term:`cache volumes <cache volume>` are allocated somewhat
	arbitrarily from storage elements. `This enhancement <https://issues.apache.org/jira/browse/TS-1728>`__
	allows :file:`storage.config` to assign :term:`storage units <storage unit>` to
	specific :term:`volumes <cache volume>` although the volumes must still be
	listed in :file:`volume.config` in general and in particular to map domains to
	specific volumes. A primary use case for this is to be able to map specific
	types of content to different storage elements. This can be employed to have
	different storage devices for various types of content (SSD vs. rotational).

	Version Upgrade
	---------------

	It is currently the case that any change to the cache format will clear the
	cache. This is an issue when upgrading the \|TS\| version and should be kept in mind.

	.. _cache-key:

	Controlling the cache key
	-------------------------

	The :term:`cache key` is by default the URL of the request. There are two
	possible choices, the original (pristine) URL and the remapped URL. Which of
	these is used is determined by the configuration value
	:ts:cv:`proxy.config.url_remap.pristine_host_hdr`.

	This is an ``INT`` value. If set to ``0`` (disabled) then the remapped URL is
	used, and if it is not ``0`` (enabled) then the original URL is used. This
	setting also controls the value of the ``HOST`` header that is placed in the
	request sent to the :term:`origin server`, using the hostname from the original
	URL if not ``0`` and the host name from the remapped URL if ``0``. It has no
	other effects.

	For caching, this setting is irrelevant if no remapping is done or there is a
	one-to-one mapping between the original and remapped URLs.

	It becomes significant if multiple original URLs are mapped to the same
	remapped URL. If pristine headers are enabled, requests to different original
	URLs will be stored as distinct :term:`objects <cache object>` in the cache. If
	disabled, the remapped URL will be used and there may be collisions. This is
	bad if the contents different, but quite useful if they are the same (as in
	situations where the original URLs are just aliases for the same underlying
	server resource).

	This is also an issue if a remapping is changed because it is effectively a
	time axis version of the previous case. If an original URL is remapped to a
	different server address then the setting determines if existing cached objects
	will be served for new requests (enabled) or not (disabled). Similarly, if the
	original URL mapped to a particular URL is changed then cached objects from the
	initial original URL will be served from the updated original URL if pristine
	headers is disabled.

	These collisions are not by themselves good or bad. An administrator needs to
	decide which is appropriate for their situation and set the value correspondingly.

	If a greater degree of control is desired, a plugin must be used to invoke the
	API calls :c:func:`TSHttpTxnCacheLookupUrlSet()` or :c:func:`TSCacheUrlSet()`
	to provide a specific :term:`cache key`. The :c:func:`TSCacheUrlSet()` API can
	be called as early as ``TS_HTTP_READ_REQUEST_HDR_HOOK`` but no later than
	``TS_HTTP_POST_REMAP_HOOK``. It can be called only once per transaction;
	calling it multiple times has no additional effect.

	A plugin that changes the cache key must do so consistently for both cache hit
	and cache miss requests because two different requests that map to the same
	cache key will be considered equivalent by the cache. Use of the URL directly
	provides this and so must any substitute. This is entirely the responsibility
	of the plugin; there is no way for the \|TS\| core to detect such an occurrence.

	If :c:func:`TSHttpTxnCacheLookupUrlGet()` is called after new cache url set by
	:c:func:`TSHttpTxnCacheLookupUrlSet()` or :c:func:`TSCacheUrlSet()`, it should
	use a URL location created by :c:func:`TSUrlCreate()` as its third input
	parameter instead of getting ``url_loc`` from the client request.

	It is a requirement that the string be syntactically a URL but otherwise it is
	completely arbitrary and need not have any path. For instance, if the company
	Network Geographics wanted to store certain content under its own
	:term:`cache key`, using a document GUID as part of the key, it could use a
	cache key like ::

	ngeo://W39WaGTPnvg

	The scheme ``ngeo`` was picked specifically because it is not a valid URL
	scheme, and so will never collide with any valid URL.

	This can be useful if the URL encodes both important and unimportant data.
	Instead of storing potentially identical content under different URLs (because
	they differ on the unimportant parts) a url containing only the important parts
	could be created and used.

	For example, suppose the URL for Network Geographics content encoded both the
	document GUID and a referral key. ::

	http://network-geographics-farm-1.com/doc/W39WaGTPnvg.2511635.UQB_zCc8B8H

	We don't want to serve the same content for every possible referrer. Instead,
	we could use a plugin to convert this to the previous example and requests that
	differed only in the referrer key would all reference the same cache entry.
	Note that we would also map the following to the same cache key ::

	http://network-geographics-farm-56.com/doc/W39WaGTPnvg.2511635.UQB_zCc8B8H

	This can be handy for sharing content between servers when that content is
	identical. Plugins can change the cache key, or not, depending on any data in
	the request header. For instance, not changing the cache key if the request is
	not in the ``doc`` directory. If distinguishing servers is important, that can
	easily be pulled from the request URL and used in the synthetic cache key. The
	implementer is free to extract all relevant elements for use in the cache key.

	While there is no explicit requirement that the synthetic cache key be based on
	the HTTP request header, in practice it is generally necessary due to the
	consistency requirement. Because cache lookup happens before attempting to
	connect to the :term:`origin server`, no data from the HTTP response header is
	available, leaving only the request header. The most common case is the one
	described above where the goal is to elide elements of the URL that do not
	affect the content to minimize cache footprint and improve cache hit rates.