| = Aliases |
| // Licensed to the Apache Software Foundation (ASF) under one |
| // or more contributor license agreements. See the NOTICE file |
| // distributed with this work for additional information |
| // regarding copyright ownership. The ASF licenses this file |
| // to you under the Apache License, Version 2.0 (the |
| // "License"); you may not use this file except in compliance |
| // with the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, |
| // software distributed under the License is distributed on an |
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| // KIND, either express or implied. See the License for the |
| // specific language governing permissions and limitations |
| // under the License. |
| |
| |
| SolrCloud has the ability to query one or more collections via an alternative name. These |
| alternative names for collections are known as aliases, and are useful when you want to: |
| |
| . Atomically switch to using a newly (re)indexed collection with zero down time (by re-defining the alias) |
| . Insulate the client programming versus changes in collection names |
| . Issue a single query against several collections with identical schemas |
| |
| There are two types of aliases: standard aliases and routed aliases. Within routed aliases, there are two types: category-routed aliases and time-routed aliases. These types are discussed in this section. |
| |
| It's possible to send collection update commands to aliases, but only to those that either resolve to a single collection |
| or those that define the routing between multiple collections (<<Routed Aliases>>). In other cases update commands are |
| rejected with an error since there is no logic by which to distribute documents among the multiple collections. |
| |
| == Standard Aliases |
| |
| Standard aliases are created and updated using the <<collection-aliasing.adoc#createalias,CREATEALIAS>> command. |
| |
| The current list of collections that are members of an alias can be verified via the |
| <<cluster-node-management.adoc#clusterstatus,CLUSTERSTATUS>> command. |
| |
| The full definition of all aliases including metadata about that alias (in the case of routed aliases, see below) |
| can be verified via the <<collection-aliasing.adoc#listaliases,LISTALIASES>> command. |
| |
| Alternatively this information is available by checking `/aliases.json` in ZooKeeper with either the native ZooKeeper |
| client or in the <<cloud-screens.adoc#tree-view,tree page>> of the cloud menu in the admin UI. |
| |
| Aliases may be deleted via the <<collection-aliasing.adoc#deletealias,DELETEALIAS>> command. |
| When deleting an alias, underlying collections are *unaffected*. |
| |
| [TIP] |
| ==== |
| Any alias (standard or routed) that references multiple collections may complicate relevancy. |
| By default, SolrCloud scores documents on a per-shard basis. |
| |
| With multiple collections in an alias this is always a problem, so if you have a use case for which BM25 or |
| TF/IDF relevancy is important you will want to turn on one of the |
| <<distributed-requests.adoc#distributedidf,ExactStatsCache>> implementations. |
| |
| However, for analytical use cases where results are sorted on numeric, date, or alphanumeric field values, rather |
| than relevancy calculations, this is not a problem. |
| ==== |
| |
| == Routed Aliases |
| |
| To address the update limitations associated with standard aliases and provide additional useful features, the concept of |
| routed aliases has been developed. |
| There are presently two types of routed alias: time routed and category routed. These are described in detail below, |
| but share some common behavior. |
| |
| When processing an update for a routed alias, Solr initializes its |
| <<update-request-processors.adoc#,UpdateRequestProcessor>> chain as usual, but |
| when `DistributedUpdateProcessor` (DUP) initializes, it detects that the update targets a routed alias and injects |
| `RoutedAliasUpdateProcessor` (RAUP) in front of itself. |
| RAUP, in coordination with the Overseer, is the main part of a routed alias, and must immediately precede DUP. It is not |
| possible to configure custom chains with other types of UpdateRequestProcessors between RAUP and DUP. |
| |
| Ideally, as a user of a routed alias, you needn't concern yourself with the particulars of the collection naming pattern |
| since both queries and updates may be done via the alias. |
| When adding data, you should usually direct documents to the alias (e.g., reference the alias name instead of any collection). |
| The Solr server and `CloudSolrClient` will direct an update request to the first collection that an alias points to. |
| Once the server receives the data it will perform the necessary routing. |
| |
| WARNING: It's extremely important with all routed aliases that the route values NOT change. Reindexing a document |
| with a different route value for the same ID produces two distinct documents with the same ID accessible via the alias. |
| All query time behavior of the routed alias is *_undefined_* and not easily predictable once duplicate ID's exist. |
| |
| CAUTION: It is a bad idea to use "data driven" mode (aka <<schemaless-mode.adoc#,schemaless-mode>>) with |
| routed aliases, as duplicate schema mutations might happen concurrently leading to errors. |
| |
| |
| === Time Routed Aliases |
| |
| Time Routed Aliases (TRAs) are a SolrCloud feature that manages an alias and a time sequential |
| series of collections. |
| |
| It automatically creates new collections and (optionally) deletes old ones as it routes documents to the correct |
| collection based on its timestamp. |
| This approach allows for indefinite indexing of data without degradation of performance otherwise experienced due to the |
| continuous growth of a single index. |
| |
| If you need to store a lot of timestamped data in Solr, such as logs or IoT sensor data, then this feature probably |
| makes more sense than creating one sharded hash-routed collection. |
| |
| ==== How It Works |
| |
| First you create a time routed aliases using the <<collection-aliasing.adoc#createalias,CREATEALIAS>> command with the |
| desired router settings. |
| Most of the settings are editable at a later time using the <<collection-aliasing.adoc#aliasprop,ALIASPROP>> command. |
| |
| The first collection will be created automatically, along with an alias pointing to it. |
| Each underlying Solr "core" in a collection that is a member of a TRA has a special core property referencing the alias. |
| The name of each collection is comprised of the TRA name and the start timestamp (UTC), with trailing zeros and symbols |
| truncated. |
| |
| The collections list for a TRA is always reverse sorted, and thus the connection path of the request will route to the |
| lead collection. Using `CloudSolrClient` is preferable as it can reduce the number of underlying physical HTTP requests by one. |
| If you know that a particular set of documents to be delivered is going to a particular older collection then you could |
| direct it there from the client side as an optimization but it's not necessary. `CloudSolrClient` does not (yet) do this. |
| |
| RAUP first reads TRA configuration from the alias properties when it is initialized. As it sees each document, it checks for |
| changes to TRA properties, updates its cached configuration if needed, and then determines which collection the |
| document belongs to: |
| |
| * If RAUP needs to send it to a time segment represented by a collection other than the one that |
| the client chose to communicate with, then it will do so using mechanisms shared with DUP. |
| Once the document is forwarded to the correct collection (i.e., the correct TRA time segment), it skips directly to |
| DUP on the target collection and continues normally, potentially being routed again to the correct shard & replica |
| within the target collection. |
| |
| * If it belongs in the current collection (which is usually the case if processing events as they occur), the document |
| passes through to DUP. DUP does its normal collection-level processing that may involve routing the document |
| to another shard & replica. |
| |
| * If the timestamp on the document is more recent than the most recent TRA segment, then a new collection needs to be |
| added at the front of the TRA. |
| RAUP will create this collection, add it to the alias, and then forward the document to the collection it just created. |
| This can happen recursively if more than one collection needs to be created. |
| + |
| Each time a new collection is added, the oldest collections in the TRA are examined for possible deletion, if that has |
| been configured. |
| All this happens synchronously, potentially adding seconds to the update request and indexing latency. |
| + |
| If `router.preemptiveCreateMath` is configured and if the document arrives within this window then it will occur |
| asynchronously. See <<collection-aliasing.adoc#time-routed-alias-parameters,Time Routed Alias Parameters>> for more information. |
| |
| Any other type of update like a commit or delete is routed by RAUP to all collections. |
| Generally speaking, this is not a performance concern. When Solr receives a delete or commit wherein nothing is deleted |
| or nothing needs to be committed, then it's pretty cheap. |
| |
| ==== Limitations & Assumptions |
| |
| * Only *time* routed aliases are supported. If you instead have some other sequential number, you could fake it |
| as a time (e.g., convert to a timestamp assuming some epoch and increment). |
| + |
| The smallest possible interval is one second. |
| No other routing scheme is supported, although this feature was developed with considerations that it could be |
| extended/improved to other schemes. |
| |
| * The underlying collections form a contiguous sequence without gaps. This will not be suitable when there are |
| large gaps in the underlying data, as Solr will insist that there be a collection for each increment. This |
| is due in part to Solr calculating the end time of each interval collection based on the timestamp of |
| the next collection, since it is otherwise not stored in any way. |
| |
| * Avoid sending updates to the oldest collection if you have also configured that old collections should be |
| automatically deleted. It could lead to exceptions bubbling back to the indexing client. |
| |
| === Category Routed Aliases |
| |
| Category Routed Aliases (CRAs) are a feature to manage aliases and a set of dependent collections |
| based on the value of a single field. |
| |
| CRAs automatically create new collections but because the partitioning is on categorical information rather than continuous |
| numerically based values there's no logic for automatic deletion. This approach allows for simplified indexing of data |
| that must be segregated into collections for cluster management or security reasons. |
| |
| ==== How It Works |
| |
| First you create a category routed alias using the <<collection-aliasing.adoc#createalias,CREATEALIAS>> command with the |
| desired router settings. |
| Most of the settings are editable at a later time using the <<collection-aliasing.adoc#aliasprop,ALIASPROP>> command. |
| |
| The alias will be created with a special place-holder collection which will always be named |
| `myAlias\__CRA__NEW_CATEGORY_ROUTED_ALIAS_WAITING_FOR_DATA\__TEMP`. The first document indexed into the CRA |
| will create a second collection named `myAlias__CRA__foo` (for a routed field value of `foo`). The second document |
| indexed will cause the temporary place holder collection to be deleted. Thereafter collections will be created whenever |
| a new value for the field is encountered. |
| |
| CAUTION: To guard against runaway collection creation options for limiting the total number of categories, and for |
| rejecting values that don't match, a regular expression parameter is provided (see <<collection-aliasing.adoc#category-routed-alias-parameters,Category Routed Alias Parameters>> for |
| details). |
| + |
| Note that by providing very large or very permissive values for these options you are accepting the risk that |
| garbled data could potentially create thousands of collections and bring your cluster to a grinding halt. |
| |
| Field values (and thus the collection names) are case sensitive. |
| |
| As elsewhere in Solr, manipulation and |
| cleaning of the data is expected to be done by external processes before data is sent to Solr, with one exception. |
| Throughout Solr there are limitations on the allowable characters in collection names. Any characters other than ASCII |
| alphanumeric characters (`A-Za-z0-9`), hyphen (`-`) or underscore (`_`) are replaced with an underscore when calculating |
| the collection name for a category. For a CRA named `myAlias` the following table shows how collection names would be |
| calculated: |
| |
| |=== |
| |Value |CRA Collection Name |
| |
| |foo |
| |+myAlias__CRA__foo+ |
| |
| |Foo |
| |+myAlias__CRA__Foo+ |
| |
| |foo bar |
| |+myAlias__CRA__foo_bar+ |
| |
| |+FOÓB&R+ |
| |+myAlias__CRA__FO_B_R+ |
| |
| |+中文的东西+ |
| |+myAlias__CRA_______+ |
| |
| |+foo__CRA__bar+ |
| |*Causes 400 Bad Request* |
| |
| |+<null>+ |
| |*Causes 400 Bad Request* |
| |
| |=== |
| |
| Since collection creation can take upwards of 1-3 seconds, systems inserting data in a CRA should be |
| constructed to handle such pauses whenever a new collection is created. |
| Unlike time routed aliases, there is no way to predict the next value so such pauses are unavoidable. |
| |
| There is no automated means of removing a category. If a category needs to be removed from a CRA |
| the following procedure is recommended: |
| |
| // TODO: This should have example instructions |
| . Ensure that no documents with the value corresponding to the category to be removed will be sent |
| either by stopping indexing or by fixing the incoming data stream |
| . Modify the alias definition in ZooKeeper, removing the collection corresponding to the category. |
| . Delete the collection corresponding to the category. Note that if the collection is not removed |
| from the alias first, this step will fail. |
| |
| ==== Limitations & Assumptions |
| |
| * CRAs are presently unsuitable for non-English data values due to the limits on collection names. |
| This can be worked around by duplicating the route value to a *_url safe_* Base64-encoded field |
| and routing on that value instead. |
| |
| * The check for the __CRA__ infix is independent of the regular expression validation and occurs after |
| the name of the collection to be created has been calculated. It may not be avoided and is necessary |
| to support future features. |
| |
| === Dimensional Routed Aliases |
| |
| For cases where the desired segregation of of data relates to two fields and combination into a single |
| field during indexing is impractical, or the TRA behavior is desired across multiple categories, |
| Dimensional Routed aliases may be used. This feature has been designed to handle an arbitrary number |
| and combination of category and time dimensions in any order, but users are cautioned to carefully |
| consider the total number of collections that will result from such configurations. Collection counts |
| in the high hundreds or low 1000's begin to pose significant challenges with ZooKeeper. |
| |
| NOTE: DRA's are a new feature and presently only 2 dimensions are supported. More dimensions will |
| be supported in the future (see https://issues.apache.org/jira/browse/SOLR-13628 for progress) |
| |
| ==== How It Works |
| |
| First you create a dimensional routed alias with the desired router settings for each dimension. See the |
| <<collection-aliasing.adoc#createalias,CREATEALIAS>> command documentation for details on how to specify the |
| per-dimension configuration. Typical collection names will be of the form (example is for category x time example, |
| with 30 minute intervals): |
| |
| myalias__CRA__someCategory__TRA__2019-07-01_00_30 |
| |
| Note that the initial collection will be a throw away place holder for any DRA containing a category based dimension. |
| Name generation for each sub-part of a collection name is identical to the corresponding potion of the component |
| dimension type. (e.g., a category value generating __CRA__ or __TRA__ would still produce an error) |
| |
| WARNING: The prior warning about reindexing documents with different route value applies to every dimension of |
| a DRA. DRA's are inappropriate for documents where categories or timestamps used in routing will change (this of |
| course applies to other route values in future RA types too). |
| |
| As with all Routed Aliases, DRA's impose some costs if your data is not well behaved. In addition to the |
| normal caveats of each component dimension there is a need for care in sending new categories after the DRA has been |
| running for a while. Ordered Dimensions (time) behave slightly differently from Unordered (category) dimensions. |
| Ordered dimensions rely on the iteration order of the collections in the alias and therefore cannot tolerate the |
| generation of collection names out of order. This means that of this is that when an ordered dimension such as time |
| is a component of a DRA and the DRA experiences receipt of a document with a novel category with a time value |
| corresponding to a time slice other than the starting time-slice for the time dimension, several collections will |
| need to be created before the document can be indexed. This "new category effect" is identical to the behavior |
| you would get with a TRA if you picked a start-date too far in the past. |
| |
| For example given a Dimensional[time,category] DRA with start time of 2019-07-01T00:00:00Z the pattern of collections |
| created for 4 documents might look like this: |
| |
| *No documents* |
| |
| *Aliased collections:* |
| |
| // temp avoids empty alias error conditions |
| myalias__TRA__2019-07-01__CRA__NEW_CATEGORY_ROUTED_ALIAS_WAITING_FOR_DATA_TEMP |
| |
| *Doc 1* |
| |
| * time: 2019-07-01T00:00:00Z |
| * category: someCategory |
| |
| *Aliased collections:* |
| |
| // temp retained to avoid empty alias during race with collection creation |
| myalias__TRA__2019-07-01__CRA__NEW_CATEGORY_ROUTED_ALIAS_WAITING_FOR_DATA_TEMP |
| myalias__TRA__2019-07-01__CRA__someCategory |
| |
| *Doc 2* |
| |
| * time: 2019-07-02T00:04:00Z |
| * category: otherCategory |
| |
| *Aliased collections:* |
| |
| // temp can now be deleted without risk of having an empty alias. |
| myalias__TRA__2019-07-01__CRA__someCategory |
| myalias__TRA__2019-07-01__CRA__otherCategory // 2 collections created in one update |
| myalias__TRA__2019-07-02__CRA__otherCategory |
| |
| *Doc 3* |
| |
| * time: 2019-07-03T00:12:00Z |
| * category: thirdCategory |
| |
| *Aliased collections:* |
| |
| myalias__TRA__2019-07-01__CRA__someCategory |
| myalias__TRA__2019-07-01__CRA__otherCategory |
| myalias__TRA__2019-07-02__CRA__otherCategory |
| myalias__TRA__2019-07-01__CRA__thirdCategory // 3 collections created in one update! |
| myalias__TRA__2019-07-02__CRA__thirdCategory |
| myalias__TRA__2019-07-03__CRA__thirdCategory |
| |
| *Doc 4* |
| |
| * time: 2019-07-03T00:12:00Z |
| * category: someCategory |
| |
| *Aliased collections:* |
| |
| myalias__TRA__2019-07-01__CRA__someCategory |
| myalias__TRA__2019-07-01__CRA__otherCategory |
| myalias__TRA__2019-07-02__CRA__otherCategory |
| myalias__TRA__2019-07-01__CRA__thirdCategory |
| myalias__TRA__2019-07-02__CRA__thirdCategory |
| myalias__TRA__2019-07-03__CRA__thirdCategory |
| myalias__TRA__2019-07-02__CRA__someCategory // 2 collections created in one update |
| myalias__TRA__2019-07-03__CRA__someCategory |
| |
| Therefore the sweet spot for DRA's is for a data set with a well standardized set of dimensions that are not changing |
| and where the full set of permutations occur regularly. If a new category is introduced at a later date and |
| indexing latency is an important SLA feature, there are a couple strategies to mitigate this effect: |
| |
| * If the number of extra time slices to be created is not very large, then sending a single document out of band from |
| regular indexing, and waiting for collection creation to complete before allowing the new category to be sent via the |
| SLA constrained process. |
| |
| * If the above procedure is likely to create an extreme number of collections, and the earliest possible document in |
| the new category is known, the start time for the time dimension may be adjusted using the |
| <<collection-aliasing.adoc#aliasprop,ALIASPROP>> command |
| |
| === Improvement Possibilities |
| |
| Routed aliases are a relatively new feature of SolrCloud that can be expected to be improved. |
| Some _potential_ areas for improvement that _are not implemented yet_ are: |
| |
| * *TRAs*: Searches with time filters should only go to applicable collections. |
| |
| * *TRAs*: Ways to automatically optimize (or reduce the resources of) older collections that aren't expected to receive more |
| updates, and might have less search demand. |
| |
| * *CRAs*: Intrinsic support for non-English text via Base64 encoding. |
| |
| * *CRAs*: Supply an initial list of values for cases where these are known before hand to reduce pauses during indexing. |
| |
| * *DRAs*: Support for more than 2 dimensions. |
| |
| * `CloudSolrClient` could route documents to the correct collection based on the route value instead always picking the |
| latest/first. |
| |
| * Presently only updates are routed and queries are distributed to all collections in the alias, but future |
| features might enable routing of the query to the single appropriate collection based on a special parameter or perhaps |
| a filter on the routed field. |
| |
| * Collections might be constrained by their size instead of or in addition to time or category value. |
| This might be implemented as another type of routed alias, or possibly as an option on the existing routed aliases |
| |
| * Compatibility with CDCR. |
| |
| * Option for deletion of aliases that also deletes the underlying collections in one step. Routed Aliases may quickly |
| create more collections than expected during initial testing. Removing them after such events is overly tedious. |
| |
| As always, patches and pull requests are welcome! |
| |
| == Collection Commands and Aliases |
| Starting with version 8.1 SolrCloud supports using alias names in collection commands where normally a |
| collection name is expected. This works only when the following criteria are satisfied: |
| |
| * a request parameter `followAliases=true` is used |
| * an alias must not refer to more than one collection |
| * an alias must not refer to a <<Routed Aliases,Routed Alias>> |
| |
| If all criteria are satisfied then the command will resolve all alias names and operate on the collections the aliases |
| refer to as if it was invoked with the collection names instead. Otherwise the command will not be executed and |
| an exception will be thrown. |
| |
| The `followAliases=true` parameter should be used with care so that the resolved targets are indeed the intended ones. |
| In case of multi-level aliases or shadow aliases (an alias with the same name as an existing collection but pointing |
| to other collections) the use of this option is strongly discouraged because effects may be difficult to |
| predict correctly. |