blob: d052f37c42c081acee448c9146fdc98f70f534db [file] [log] [blame]
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<HTML>
<BODY>
<P>This package contains internal GemStone classes for implementating
caching on top of the GemFire Distributed System.</P>
<H1>Local Regions</H1>
<P>LocalRegion implements the basic caching mechanism and allows for
subclasses to perform message distribution and other specialization
of LocalRegion functionality. A LocalRegion is an implementation of the
jre Map interface that supports expiration, callbacks, server cache
communication, and so on.</P>
<P><a href="LocalRegion.html">LocalRegion</a> has a <a href="RegionMap.html">RegionMap</a> that
holds the actual data for the region.</P>
Most changes to an entry in a LocalRegion are performed in three steps:
<pre>
* The entry is modified under synchronization using an instance of
<a href="EntryEventImpl.html">EntryEventImpl</a> object. The event
is also queued for later callback invokation under this synchronization.
* Distribution is allowed to occur outside of synchronization
* Synchronization is again obtained on the entry and callbacks are invoked
</pre>
<p>A LocalRegion may also have a <a href="DiskRegion.html"/>DiskRegion</a>
associated with it for persistence or overflow to disk.</p>
<H1>Distributed Regions</H1>
<p><a href="DistributedRegion.html">Distributed Regions</a> are a subclass
of LocalRegion that interact with locking and the DistributedSystem to
implement distributed caching. Most DistributedRegion operations are
carried out using subclasses of
<a href="DistributedCacheOperation.html">DistributedCacheOperation</a>.
<H1>Partitioned Regions</H1>
<P>The contents of a partitioned region is spread evenly across
multiple members of a distributed system. From the user's standpoint,
each member hosts a partition of the region and data is moved from
partition to partition in order to provide scalability and high
availability. The actual implementation of partitioned regions
divides each partition into sub-partitions named "buckets". A bucket
may be moved from one partition to another partition in a process called
"migration" when GemFire determines that the partitioned region's data
is not spread evenly across all members. When a bucket reaches a
maximum size, it is split in two and may be migrated to a different
partition.</P>
<P>Data is split among buckets using the Extensible Hashing algorithm
that hashes data based upon the lower-order bits ("mask") of the
data's (the <code>Region</code> entry's key in the case of GemFire)
value. All partitions of a given region share a
<emph>directory</emph> that maintains a mapping between a mask and
information about the bucket that holds data that applies to the mask.
When an entry is placed into a partitioned region, the bucket
directory is consulted to determine which member(s) of the distributed
system should be updated. The Extensible Hashing algorithm is useful
when a bucket fills us with data and needs to be split. Other hashing
algorithm require a complete rebalancing of the partitioned region
when a bucket is full. Extensible Hashing, however, only requires
that the full bucket be split into two, thus allowing the other
buckets to be accessed without delay. The below diagram demonstrates
bucket splitting with extensible hashing.</P>
<P align="center">
<IMG src="{@docRoot}/javadoc-images/extensible-hashing.gif" width="641" height="292">
</P>
<P>A <code>BucketInfo</code> contains metadata about a bucket (most
importantly the locations of all copies of the bucket) that is
distributed to members that access the partitioned region. Changes to
the <code>BucketDirectory</code> metadata are coordinate through
GemFire's distributed lock service. Inside of a region partition are
a number of <code>Bucket</code>s that hold the values for keys that
match the bucket's mask as shown in the below diagram.</P>
<P align="center">
<IMG src="{@docRoot}/javadoc-images/partitioned-regions.gif" width="592" height="483">
</P>
<P>The total size (in bytes) of a bucket is maintained as key/value
pairs are added. It is not necessary for the bucket to store the
value of a region entry as an actual object. So, the bucket stores
the value in its serialized <code>byte</code> form. This takes up
less space in the VM's heap and allows us to accurately calculate its
size. The entry's key, however, is used when looking up data in the
bucket and must be deserialized. As an estimate, the size of the key
object is assumed to the size of object's serialized
<code>byte</code>s. When a entry's value is replaced via an update
operation, the size of the old value is subtracted from the total size
before the size of the new value is added in. It is assumed that the
key does not change size.</P>
<P>When a bucket's size exceeds the "maximum bucket size", it is split
in two based on the extensible hashing algorithm: a new
<code>Bucket</code> is created and is populated with the key/value
pairs that match its mask, the <code>Bucket</code>'s local depth is
incremented by 1, update the global depth if the new local depth
exceeds the current global depth. The splitting process is repeated
while all of the following conditions are met: the size of either
bucket continues to exceed the "maximum bucket size", the full bucket
has more than 1 element, the global depth is less than the "maximum
global depth".</P>
<h3>Primary Bucket</h3>
<P>One bucket instance is selected as the primary. All bucket operations
target the primary and are passed on to the backups from the primary.</P>
<P>Identification of the primary is tracked using metadata in
<code>BucketAdvisor</code>. The following diagram shows the standard
state transitions of the <code>BucketAdvisor</code>:</P>
<P align="center">
<IMG src="{@docRoot}/javadoc-images/BucketAdvisor-state.png" width="640" height="372">
</P>
<H3>Partitioned Region Cache Listeners</H3>
<P>User CacheListeners are registered on the PartitionedRegion. Activity
in the Buckets may fire callbacks on the PartitionedRegion's CacheListeners.
The following tabled figures attempt to demonstrate the logic and sequence
involved.</P>
<P><TABLE border="1">
<TR><TD colspan="2"><B>Definition of Participants</B></TD></TR>
<TR><TD>pr_A1</TD><TD>a pure accessor</TD></TR>
<TR><TD>pr_B1_pri</TD><TD>a datastore which hosts primary for bucket B1</TD></TR>
<TR><TD>pr_B1_c1</TD><TD>a datastore which hosts copy 1 for Bucket B1</TD></TR>
<TR><TD>pr_B1_c2</TD><TD>a datastore which hosts copy 2 for Bucket B1</TD></TR>
<TR><TD>pr_A2_listener<BR>pr_A3_bridge<BR>pr_A4_gateway</TD><TD valign="top">pure accessors with CacheListener, Bridge or Gateway</TD></TR>
</TABLE></P>
<P><TABLE border="1">
<TR><TD colspan="5"><B>Fig. 1 (Flow of Put to CacheListeners)</B></TD></TR>
<TR><TD valign="top"><B>pr_A1</B></TD><TD valign="top"><B>pr_B1_pri</B></TD><TD valign="top"><B>pr_B1_c1</B></TD><TD valign="top"><B>pr_B1_c2</B></TD><TD><B>pr_A2_listener<BR>pr_A3_bridge<BR>pr_A4_gateway</B></TD></TR>
<TR><TD>putMessage1 --&gt;</TD><TD>operateOnPartitionRegion()</TD><TD></TD><TD></TD><TD></TD></TR>
<TR><TD></TD><TD>sync entry</TD><TD></TD><TD></TD><TD></TD></TR>
<TR><TD></TD><TD>&nbsp;&nbsp;update entry</TD><TD></TD><TD></TD><TD></TD></TR>
<TR><TD></TD><TD>&nbsp;&nbsp;UpdateOperation.distribute</TD><TD></TD><TD></TD><TD></TD></TR>
<TR><TD></TD><TD>&nbsp;&nbsp;&nbsp;&nbsp;if bucket add adjunct.recips</TD><TD></TD><TD></TD><TD></TD></TR>
<TR><TD></TD><TD valign="top">&nbsp;&nbsp;&nbsp;&nbsp;send ----------------------&gt;</TD><TD>--&gt; see <B>Fig. 2</B><BR>&lt;------- reply</TD><TD></TD><TD></TD></TR>
<TR><TD></TD><TD valign="top">&nbsp;&nbsp;&nbsp;&nbsp;send ----------------------&gt;</TD><TD valign="top">---------&gt;</TD><TD>--&gt; see <B>Fig. 2</B><BR>&lt;------- reply</TD><TD></TD></TR>
<TR><TD></TD><TD>&nbsp;&nbsp;&nbsp;&nbsp;if adjunct.recips > 0</TD><TD></TD><TD></TD><TD></TD></TR>
<TR><TD></TD><TD>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;PutMessage2 (notificationOnly == true)</TD><TD></TD><TD></TD><TD></TD></TR>
<TR><TD></TD><TD valign="top">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;send --------------------------&gt;</TD><TD valign="top">---------&gt;</TD><TD valign="top">---------&gt;</TD><TD>--&gt; CacheListener fires on pr iif<br>InterestPolicy != CACHE_CONTENT<BR>&lt;------- reply</TD></TR>
<TR><TD></TD><TD>&nbsp;&nbsp;waitForReplies (from all above msgs)</TD><TD></TD><TD></TD><TD></TD></TR>
<TR><TD></TD><TD>&nbsp;&nbsp;fire local CacheListener on pr</TD><TD></TD><TD></TD><TD></TD></TR>
<TR><TD></TD><TD>release entry sync</TD><TD></TD><TD></TD><TD></TD></TR>
</TABLE></P>
<P><TABLE border="1">
<TR><TD><B>Fig. 2 (Processing of UpdateOperation by non-Primary Bucket Host)</B></TD></TR>
<TR><TD>sync entry</TD></TR>
<TR><TD>&nbsp;&nbsp;update entry</TD></TR>
<TR><TD>&nbsp;&nbsp;CacheListener fires on pr iif InterestPolicy != CACHE_CONTENT</TD></TR>
<TR><TD>release entry sync</TD></TR>
<TR><TD>reply</TD></TR>
</TABLE></P>
<h3>Migration</h3>
<P>Buckets are "migrated" to other members of the distributed system
to ensure that the contents of a partitioned region are evenly spread
across all members of a distributed system that wish to host
partitioned data. After a bucket is split, a migration operation is
triggered. Migration may also occur when a <code>Cache</code> exceeds
its <code>maxParitionedData</code> threshold and when a new member
that can host partitioned data joins the distributed system. Each
member is consulted to determine how much partitioned region data it
is currently hosting and the org.apache.geode.cache.Cache#getMaxPartitionedData maximum amount of
partitioned region data it can host. The largest bucket hosted by the
VM is migrated to the member with the large percentage of space
available for partitioned data. This ensures that data is spread
evenly among members of the distributed system and that their space
available partitioned region data fills consistently. Migration will
continue until the amount of partitioned data hosted by the member
initiating the migration falls below the average for all members.
When a member that hosts partitions {@linkplain
org.apache.geode.cache.Cache#close closes} its <code>Cache</code>,
the partitions are migrated to other hosts.</P>
<h3>High Availability</h3>
The high availability (
org.apache.geode.cache.PartitionAttributes#getRedundancy
redundancy) feature of partitioned regions effects the implementation
in a number of ways. When a bucket is created, the implementation
uses the migration algorithm to determine the location(s) of any
redundant copies of the buckets. A warning is logged if there is not
enough room (or not enough members) to guarantee the redundancy of the
partitioned region. When an entry is <code>put</code> into a
redundant partitioned region, the key/value is distributed to each
bucket according to the consistency specified by the region's scope.
That is, is the region is <code>DISTRIBUTED_ACK</code>, the
<code>put</code> operation will not return until it has received an
acknowledgment from each bucket. When a <code>get</code> is performed
on a partitioned region and the value is not already in the
partitioned region's local cache, a targeted <code>netSearch</code>
is performed. When there are redundant copies of the region's
buckets, the <code>netSearch</code> chooses one bucket at random from
which to fetch the value. If the bucket does not respond within a
given timeout, then the process is repeated on another randomly
chosen redundant bucket. If the bucket has been migrated to another
member, then the member operating on the region will re-consult its
metadata and retry the operation. When redundant buckets are migrated
from one machine to another, the implementation is careful to ensure
that multiple copies of a bucket are not hosted by the same member.
<P>
<H1>System Properties</H1>
All of the system properties used by GemFire are discussed
<a href="properties.html">here</a>.
</BODY>
</HTML>