| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <HTML> |
| |
| <BODY> |
| |
| <P>This package contains internal GemStone classes for implementating |
| caching on top of the GemFire Distributed System.</P> |
| |
| <H1>Local Regions</H1> |
| <P>LocalRegion implements the basic caching mechanism and allows for |
| subclasses to perform message distribution and other specialization |
| of LocalRegion functionality. A LocalRegion is an implementation of the |
| jre Map interface that supports expiration, callbacks, server cache |
| communication, and so on.</P> |
| |
| <P><a href="LocalRegion.html">LocalRegion</a> has a <a href="RegionMap.html">RegionMap</a> that |
| holds the actual data for the region.</P> |
| |
| Most changes to an entry in a LocalRegion are performed in three steps: |
| <pre> |
| * The entry is modified under synchronization using an instance of |
| <a href="EntryEventImpl.html">EntryEventImpl</a> object. The event |
| is also queued for later callback invokation under this synchronization. |
| |
| * Distribution is allowed to occur outside of synchronization |
| |
| * Synchronization is again obtained on the entry and callbacks are invoked |
| </pre> |
| |
| <p>A LocalRegion may also have a <a href="DiskRegion.html"/>DiskRegion</a> |
| associated with it for persistence or overflow to disk.</p> |
| |
| |
| <H1>Distributed Regions</H1> |
| <p><a href="DistributedRegion.html">Distributed Regions</a> are a subclass |
| of LocalRegion that interact with locking and the DistributedSystem to |
| implement distributed caching. Most DistributedRegion operations are |
| carried out using subclasses of |
| <a href="DistributedCacheOperation.html">DistributedCacheOperation</a>. |
| |
| |
| <H1>Partitioned Regions</H1> |
| |
| <P>The contents of a partitioned region is spread evenly across |
| multiple members of a distributed system. From the user's standpoint, |
| each member hosts a partition of the region and data is moved from |
| partition to partition in order to provide scalability and high |
| availability. The actual implementation of partitioned regions |
| divides each partition into sub-partitions named "buckets". A bucket |
| may be moved from one partition to another partition in a process called |
| "migration" when GemFire determines that the partitioned region's data |
| is not spread evenly across all members. When a bucket reaches a |
| maximum size, it is split in two and may be migrated to a different |
| partition.</P> |
| |
| <P>Data is split among buckets using the Extensible Hashing algorithm |
| that hashes data based upon the lower-order bits ("mask") of the |
| data's (the <code>Region</code> entry's key in the case of GemFire) |
| value. All partitions of a given region share a |
| <emph>directory</emph> that maintains a mapping between a mask and |
| information about the bucket that holds data that applies to the mask. |
| When an entry is placed into a partitioned region, the bucket |
| directory is consulted to determine which member(s) of the distributed |
| system should be updated. The Extensible Hashing algorithm is useful |
| when a bucket fills us with data and needs to be split. Other hashing |
| algorithm require a complete rebalancing of the partitioned region |
| when a bucket is full. Extensible Hashing, however, only requires |
| that the full bucket be split into two, thus allowing the other |
| buckets to be accessed without delay. The below diagram demonstrates |
| bucket splitting with extensible hashing.</P> |
| |
| <P align="center"> |
| <IMG src="{@docRoot}/javadoc-images/extensible-hashing.gif" width="641" height="292"> |
| </P> |
| |
| <P>A <code>BucketInfo</code> contains metadata about a bucket (most |
| importantly the locations of all copies of the bucket) that is |
| distributed to members that access the partitioned region. Changes to |
| the <code>BucketDirectory</code> metadata are coordinate through |
| GemFire's distributed lock service. Inside of a region partition are |
| a number of <code>Bucket</code>s that hold the values for keys that |
| match the bucket's mask as shown in the below diagram.</P> |
| |
| <P align="center"> |
| <IMG src="{@docRoot}/javadoc-images/partitioned-regions.gif" width="592" height="483"> |
| </P> |
| |
| <P>The total size (in bytes) of a bucket is maintained as key/value |
| pairs are added. It is not necessary for the bucket to store the |
| value of a region entry as an actual object. So, the bucket stores |
| the value in its serialized <code>byte</code> form. This takes up |
| less space in the VM's heap and allows us to accurately calculate its |
| size. The entry's key, however, is used when looking up data in the |
| bucket and must be deserialized. As an estimate, the size of the key |
| object is assumed to the size of object's serialized |
| <code>byte</code>s. When a entry's value is replaced via an update |
| operation, the size of the old value is subtracted from the total size |
| before the size of the new value is added in. It is assumed that the |
| key does not change size.</P> |
| |
| <P>When a bucket's size exceeds the "maximum bucket size", it is split |
| in two based on the extensible hashing algorithm: a new |
| <code>Bucket</code> is created and is populated with the key/value |
| pairs that match its mask, the <code>Bucket</code>'s local depth is |
| incremented by 1, update the global depth if the new local depth |
| exceeds the current global depth. The splitting process is repeated |
| while all of the following conditions are met: the size of either |
| bucket continues to exceed the "maximum bucket size", the full bucket |
| has more than 1 element, the global depth is less than the "maximum |
| global depth".</P> |
| |
| <h3>Primary Bucket</h3> |
| |
| <P>One bucket instance is selected as the primary. All bucket operations |
| target the primary and are passed on to the backups from the primary.</P> |
| |
| <P>Identification of the primary is tracked using metadata in |
| <code>BucketAdvisor</code>. The following diagram shows the standard |
| state transitions of the <code>BucketAdvisor</code>:</P> |
| |
| <P align="center"> |
| <IMG src="{@docRoot}/javadoc-images/BucketAdvisor-state.png" width="640" height="372"> |
| </P> |
| |
| <H3>Partitioned Region Cache Listeners</H3> |
| |
| <P>User CacheListeners are registered on the PartitionedRegion. Activity |
| in the Buckets may fire callbacks on the PartitionedRegion's CacheListeners. |
| The following tabled figures attempt to demonstrate the logic and sequence |
| involved.</P> |
| |
| <P><TABLE border="1"> |
| <TR><TD colspan="2"><B>Definition of Participants</B></TD></TR> |
| <TR><TD>pr_A1</TD><TD>a pure accessor</TD></TR> |
| <TR><TD>pr_B1_pri</TD><TD>a datastore which hosts primary for bucket B1</TD></TR> |
| <TR><TD>pr_B1_c1</TD><TD>a datastore which hosts copy 1 for Bucket B1</TD></TR> |
| <TR><TD>pr_B1_c2</TD><TD>a datastore which hosts copy 2 for Bucket B1</TD></TR> |
| <TR><TD>pr_A2_listener<BR>pr_A3_bridge<BR>pr_A4_gateway</TD><TD valign="top">pure accessors with CacheListener, Bridge or Gateway</TD></TR> |
| </TABLE></P> |
| |
| <P><TABLE border="1"> |
| <TR><TD colspan="5"><B>Fig. 1 (Flow of Put to CacheListeners)</B></TD></TR> |
| <TR><TD valign="top"><B>pr_A1</B></TD><TD valign="top"><B>pr_B1_pri</B></TD><TD valign="top"><B>pr_B1_c1</B></TD><TD valign="top"><B>pr_B1_c2</B></TD><TD><B>pr_A2_listener<BR>pr_A3_bridge<BR>pr_A4_gateway</B></TD></TR> |
| <TR><TD>putMessage1 --></TD><TD>operateOnPartitionRegion()</TD><TD></TD><TD></TD><TD></TD></TR> |
| <TR><TD></TD><TD>sync entry</TD><TD></TD><TD></TD><TD></TD></TR> |
| <TR><TD></TD><TD> update entry</TD><TD></TD><TD></TD><TD></TD></TR> |
| <TR><TD></TD><TD> UpdateOperation.distribute</TD><TD></TD><TD></TD><TD></TD></TR> |
| <TR><TD></TD><TD> if bucket add adjunct.recips</TD><TD></TD><TD></TD><TD></TD></TR> |
| <TR><TD></TD><TD valign="top"> send ----------------------></TD><TD>--> see <B>Fig. 2</B><BR><------- reply</TD><TD></TD><TD></TD></TR> |
| <TR><TD></TD><TD valign="top"> send ----------------------></TD><TD valign="top">---------></TD><TD>--> see <B>Fig. 2</B><BR><------- reply</TD><TD></TD></TR> |
| <TR><TD></TD><TD> if adjunct.recips > 0</TD><TD></TD><TD></TD><TD></TD></TR> |
| <TR><TD></TD><TD> PutMessage2 (notificationOnly == true)</TD><TD></TD><TD></TD><TD></TD></TR> |
| <TR><TD></TD><TD valign="top"> send --------------------------></TD><TD valign="top">---------></TD><TD valign="top">---------></TD><TD>--> CacheListener fires on pr iif<br>InterestPolicy != CACHE_CONTENT<BR><------- reply</TD></TR> |
| <TR><TD></TD><TD> waitForReplies (from all above msgs)</TD><TD></TD><TD></TD><TD></TD></TR> |
| <TR><TD></TD><TD> fire local CacheListener on pr</TD><TD></TD><TD></TD><TD></TD></TR> |
| <TR><TD></TD><TD>release entry sync</TD><TD></TD><TD></TD><TD></TD></TR> |
| </TABLE></P> |
| |
| <P><TABLE border="1"> |
| <TR><TD><B>Fig. 2 (Processing of UpdateOperation by non-Primary Bucket Host)</B></TD></TR> |
| <TR><TD>sync entry</TD></TR> |
| <TR><TD> update entry</TD></TR> |
| <TR><TD> CacheListener fires on pr iif InterestPolicy != CACHE_CONTENT</TD></TR> |
| <TR><TD>release entry sync</TD></TR> |
| <TR><TD>reply</TD></TR> |
| </TABLE></P> |
| |
| <h3>Migration</h3> |
| |
| <P>Buckets are "migrated" to other members of the distributed system |
| to ensure that the contents of a partitioned region are evenly spread |
| across all members of a distributed system that wish to host |
| partitioned data. After a bucket is split, a migration operation is |
| triggered. Migration may also occur when a <code>Cache</code> exceeds |
| its <code>maxParitionedData</code> threshold and when a new member |
| that can host partitioned data joins the distributed system. Each |
| member is consulted to determine how much partitioned region data it |
| is currently hosting and the org.apache.geode.cache.Cache#getMaxPartitionedData maximum amount of |
| partitioned region data it can host. The largest bucket hosted by the |
| VM is migrated to the member with the large percentage of space |
| available for partitioned data. This ensures that data is spread |
| evenly among members of the distributed system and that their space |
| available partitioned region data fills consistently. Migration will |
| continue until the amount of partitioned data hosted by the member |
| initiating the migration falls below the average for all members. |
| When a member that hosts partitions {@linkplain |
| org.apache.geode.cache.Cache#close closes} its <code>Cache</code>, |
| the partitions are migrated to other hosts.</P> |
| |
| <h3>High Availability</h3> |
| |
| The high availability ( |
| org.apache.geode.cache.PartitionAttributes#getRedundancy |
| redundancy) feature of partitioned regions effects the implementation |
| in a number of ways. When a bucket is created, the implementation |
| uses the migration algorithm to determine the location(s) of any |
| redundant copies of the buckets. A warning is logged if there is not |
| enough room (or not enough members) to guarantee the redundancy of the |
| partitioned region. When an entry is <code>put</code> into a |
| redundant partitioned region, the key/value is distributed to each |
| bucket according to the consistency specified by the region's scope. |
| That is, is the region is <code>DISTRIBUTED_ACK</code>, the |
| <code>put</code> operation will not return until it has received an |
| acknowledgment from each bucket. When a <code>get</code> is performed |
| on a partitioned region and the value is not already in the |
| partitioned region's local cache, a targeted <code>netSearch</code> |
| is performed. When there are redundant copies of the region's |
| buckets, the <code>netSearch</code> chooses one bucket at random from |
| which to fetch the value. If the bucket does not respond within a |
| given timeout, then the process is repeated on another randomly |
| chosen redundant bucket. If the bucket has been migrated to another |
| member, then the member operating on the region will re-consult its |
| metadata and retry the operation. When redundant buckets are migrated |
| from one machine to another, the implementation is careful to ensure |
| that multiple copies of a bucket are not hosted by the same member. |
| |
| <P> |
| |
| <H1>System Properties</H1> |
| |
| All of the system properties used by GemFire are discussed |
| <a href="properties.html">here</a>. |
| |
| </BODY> |
| |
| </HTML> |