server/apps/distributed-app/docs/modules/ROOT/pages/operate/guide.adoc - james-project - Git at Google

 = Distributed James Server &mdash; Operator guide
 :navtitle: Operator guide

 This guide aims to be an entry-point to the James documentation for user
 managing a distributed Guice James server.

 It includes:

 * Simple architecture explanations
 * Propose some diagnostics for some common issues
 * Present procedures that can be set up to address these issues

 In order to not duplicate information, existing documentation will be
 linked.

 Please note that this product is under active development, should be
 considered experimental and thus targets advanced users.

 == Basic Monitoring

 A toolbox is available to help an administrator diagnose issues:

 * xref:operate/logging.adoc[Structured logging into Kibana]
 * xref:operate/metrics.adoc[Metrics graphs into Grafana]
 * xref:operate/webadmin.adoc#_healthcheck[WebAdmin HealthChecks]

 == Mail processing

 Currently, an administrator can monitor mail processing failure through `ERROR` log
 review. We also recommend watching in Kibana INFO logs using the
 `org.apache.james.transport.mailets.ToProcessor` value as their `logger`. Metrics about
 mail repository size, and the corresponding Grafana boards are yet to be contributed.

 Furthermore, given the default mailet container configuration, we recommend monitoring
 `cassandra://var/mail/error/` to be empty.

 WebAdmin exposes all utilities for
 xref:operate/webadmin.adoc#_reprocessing_mails_from_a_mail_repository[reprocessing
 all mails in a mail repository] or
 xref:operate/webadmin.adoc#_reprocessing_a_specific_mail_from_a_mail_repository[reprocessing
 a single mail in a mail repository].

 In order to prevent unbounded processing that could consume unbounded resources. We can provide a CRON with `limit` parameter.
 Ex: 10 reprocessed per minute
 Note that it only support the reprocessing all mails.

 Also, one can decide to
 xref:operate/webadmin.adoc#_removing_all_mails_from_a_mail_repository[delete
 all the mails of a mail repository] or
 xref:operate/webadmin.adoc#_removing_a_mail_from_a_mail_repository[delete
 a single mail of a mail repository].

 Performance of mail processing can be monitored via the
 https://github.com/apache/james-project/blob/master/grafana-reporting/MAILET-1490071694187-dashboard.json[mailet
 grafana board] and
 https://github.com/apache/james-project/blob/master/grafana-reporting/MATCHER-1490071813409-dashboard.json[matcher
 grafana board].

 === Recipient rewriting

 Given the default configuration, errors (like loops) uopn recipient rewritting will lead
 to emails being stored in `cassandra://var/mail/rrt-error/`.

 We recommend monitoring the content of this mail repository to be empty.

 If it is not empty, we recommend
 verifying user mappings via xref:operate/webadmin.adoc#_user_mappings[User Mappings webadmin API] then once identified break the loop by removing
 some Recipient Rewrite Table entry via the
 xref:operate/webadmin.adoc#_removing_an_alias_of_an_user[Delete Alias],
 xref:operate/webadmin.adoc#_removing_a_group_member[Delete Group member],
 xref:operate/webadmin.adoc#_removing_a_destination_of_a_forward[Delete forward],
 xref:operate/webadmin.adoc#_remove_an_address_mapping[Delete Address mapping],
 xref:operate/webadmin.adoc#_removing_a_domain_mapping[Delete Domain mapping]
 or xref:operate/webadmin.adoc#_removing_a_regex_mapping[Delete Regex mapping]
 APIs (as needed).

 The `Mail.error` field can help diagnose the issue as well. Then once
 the root cause has been addressed, the mail can be reprocessed.

 == Mailbox Event Bus

 It is possible for the administrator of James to define the mailbox
 listeners he wants to use, by adding them in the
 https://github.com/apache/james-project/blob/master/server/apps/distributed-app/sample-configuration/listeners.xml[listeners.xml]
 configuration file. It’s possible also to add your own custom mailbox
 listeners. This enables to enhance capabilities of James as a Mail
 Delivery Agent. You can get more information about those
 link:config-listeners.html[here].

 Currently, an administrator can monitor listeners failures through
 `ERROR` log review. Metrics regarding mailbox listeners can be monitored
 via
 https://github.com/apache/james-project/blob/master/grafana-reporting/MailboxListeners-1528958667486-dashboard.json[mailbox_listeners
 grafana board] and
 https://github.com/apache/james-project/blob/master/grafana-reporting/MailboxListeners%20rate-1552903378376.json[mailbox_listeners_rate
 grafana board].

 Upon exceptions, a bounded number of retries are performed (with
 exponential backoff delays). If after those retries the listener is
 still failing to perform its operation, then the event will be stored in
 the xref:operate/webadmin.adoc#_event_dead_letter[Event Dead Letter]. This
 API allows diagnosing issues, as well as redelivering the events.

 To check that you have undelivered events in your system, you can first
 run the associated with
 xref:operate/webadmin.adoc#_healthcheck[event dead letter health check] .
 You can explore Event DeadLetter content through WebAdmin. For
 this, xref:operate/webadmin.adoc#_listing_mailbox_listener_groups[list mailbox listener groups]
 you will get a list of groups back, allowing
 you to check if those contain registered events in each by
 xref:operate/webadmin.adoc#_listing_failed_events[listing their failed events].

 If you get failed events IDs back, you can as well
 xref:operate/webadmin.adoc#_getting_event_details[check their details].

 An easy way to solve this is just to trigger then the
 xref:operate/webadmin.adoc#_redeliver_all_events[redeliver all events]
 task. It will start reprocessing all the failed events registered in
 event dead letters.

 In order to prevent unbounded processing that could consume unbounded resources. We can provide a CRON with `limit` parameter.
 Ex: 10 redelivery per minute

 If for some other reason you don’t need to redeliver all events, you
 have more fine-grained operations allowing you to
 xref:operate/webadmin.adoc#_redeliver_group_events[redeliver group events]
 or even just
 xref:operate/webadmin.adoc#_redeliver_a_single_event[redeliver a single event].

 == ElasticSearch Indexing

 A projection of messages is maintained in ElasticSearch via a listener
 plugged into the mailbox event bus in order to enable search features.

 You can find more information about ElasticSearch configuration
 link:config-elasticsearch.html[here].

 === Usual troubleshooting procedures

 As explained in the link:#_mailbox_event_bus[Mailbox Event Bus] section,
 processing those events can fail sometimes.

 Currently, an administrator can monitor indexation failures through
 `ERROR` log review. You can as well
 xref:operate/webadmin.adoc#_listing_failed_events[list failed events] by
 looking with the group called
 `org.apache.james.mailbox.opensearch.events.ElasticSearchListeningMessageSearchIndex$ElasticSearchListeningMessageSearchIndexGroup`.
 A first on-the-fly solution could be to just
 link:#_mailbox_event_bus[redeliver those group events with event dead letter].

 If the event storage in dead-letters fails (for instance in the face of
 Cassandra storage exceptions), then you might need to use our WebAdmin
 reIndexing tasks.

 From there, you have multiple choices. You can
 xref:operate/webadmin.adoc#_reindexing_all_mails[reIndex all mails],
 xref:operate/webadmin.adoc#_reindexing_a_mailbox_mails[reIndex mails from a mailbox] or even just
 xref:operate/webadmin.adoc#_reindexing_a_single_mail_by_messageid[reIndex a single mail].

 When checking the result of a reIndexing task, you might have failed
 reprocessed mails. You can still use the task ID to
 xref:operate/webadmin.adoc#_fixing_previously_failed_reindexing[reprocess previously failed reIndexing mails].

 === On the fly ElasticSearch Index setting update

 Sometimes you might need to update index settings. Cases when an
 administrator might want to update index settings include:

 * Scaling out: increasing the shard count might be needed.
 * Changing string analysers, for instance to target another language
 * etc.

 In order to achieve such a procedure, you need to:

 * https://www.elastic.co/guide/en/elasticsearch/reference/7.10/indices-create-index.html[Create
 the new index] with the right settings and mapping
 * James uses two aliases on the mailbox index: one for reading
 (`mailboxReadAlias`) and one for writing (`mailboxWriteAlias`). First
 https://www.elastic.co/guide/en/elasticsearch/reference/7.10/indices-aliases.html[add
 an alias] `mailboxWriteAlias` to that new index, so that now James
 writes on the old and new indexes, while only keeping reading on the
 first one
 * Now trigger a
 https://www.elastic.co/guide/en/elasticsearch/reference/7.10/docs-reindex.html[reindex]
 from the old index to the new one (this actively relies on `_source`
 field being present)
 * When this is done, add the `mailboxReadAlias` alias to the new index
 * Now that the migration to the new index is done, you can
 https://www.elastic.co/guide/en/elasticsearch/reference/7.10/indices-delete-index.html[drop
 the old index]
 * You might want as well modify the James configuration file
 https://github.com/apache/james-project/blob/master/server/apps/distributed-app/sample-configuration/elasticsearch.properties[elasticsearch.properties]
 by setting the parameter `elasticsearch.index.mailbox.name` to the name
 of your new index. This is to avoid that James re-creates index upon
 restart

 _Note_: keep in mind that reindexing can be a very long operation
 depending on the volume of mails you have stored.

 == Solving cassandra inconsistencies

 Cassandra backend uses data duplication to workaround Cassandra query
 limitations. However, Cassandra is not doing transaction when writing in
 several tables, this can lead to consistency issues for a given piece of
 data. The consequence could be that the data is in a transient state
 (that should never appear outside of the system).

 Because of the lack of transactions, it’s hard to prevent these kind of
 issues. We had developed some features to fix some existing cassandra
 inconsistency issues that had been reported to James.

 === Jmap message fast view projections

 When you read a Jmap message, some calculated properties are expected to
 be fast to retrieve, like `preview`, `hasAttachment`. James achieves it
 by pre-calculating and storing them into a caching table
 (`message_fast_view_projection`). Missing caches are populated on
 message reads and will temporary decrease the performance.

 ==== How to detect the outdated projections

 You can watch the `MessageFastViewProjection` health check at
 xref:operate/webadmin.adoc#_check_all_components[webadmin documentation].
 It provides a check based on the ratio of missed projection reads.

 ==== How to solve

 Since the MessageFastViewProjection is self healing, you should be
 concerned only if the health check still returns `degraded` for a while,
 there’s a possible thing you can do is looking at James logs for more
 clues.

 === Mailboxes

 `mailboxPath` and `mailbox` tables share common fields like `mailboxId`
 and mailbox `name`. A successful operation of creating/renaming/delete
 mailboxes has to succeed at updating `mailboxPath` and `mailbox` table.
 Any failure on creating/updating/delete records in `mailboxPath` or
 `mailbox` can produce inconsistencies.

 ==== How to detect the inconsistencies

 If you found the suspicious `MailboxNotFoundException` in your logs.
 Currently, there’s no dedicated tool for that, we recommend scheduling
 the SolveInconsistencies task below for the mailbox object on a regular
 basis, avoiding peak traffic in order to address both inconsistencies
 diagnostic and fixes.

 ==== How to solve

 An admin can run offline webadmin
 xref:operate/webadmin.adoc#_fixing_mailboxes_inconsistencies[solve Cassandra mailbox object inconsistencies task]
 in order to sanitize his
 mailbox denormalization.

 In order to ensure being offline, stop the traffic on SMTP, JMAP and
 IMAP ports, for example via re-configuration or firewall rules.

 === Mailboxes Counters

 James maintains a per mailbox projection for message count and unseen
 message count. Failures during the denormalization process will lead to
 incorrect results being returned.

 ==== How to detect the inconsistencies

 Incorrect message count/message unseen count could be seen in the
 `Mail User Agent` (IMAP or JMAP). Invalid values are reported in the
 logs as warning with the following class
 `org.apache.james.mailbox.model.MailboxCounters` and the following
 message prefix: `Invalid mailbox counters`.

 ==== How to solve

 Execute the
 xref:operate/webadmin.adoc#_recomputing_mailbox_counters[recompute Mailbox counters task].
 This task is not concurrent-safe. Concurrent
 increments & decrements will be ignored during a single mailbox
 processing. Re-running this task may eventually return the correct
 result.

 === Messages

 Messages are denormalized and stored in both `imapUidTable` (source of
 truth) and `messageIdTable`. Failure in the denormalization process will
 cause inconsistencies between the two tables.

 ==== How to detect the inconsistencies

 User can see a message in JMAP but not in IMAP, or mark a message as
 `SEEN' in JMAP but the message flag is still unchanged in IMAP.

 ==== How to solve

 Execute the
 xref:operate/webadmin.adoc#_fixing_message_inconsistencies[solve Cassandra message inconsistencies task]. This task is not
 concurrent-safe. User actions concurrent to the inconsistency fixing
 task could result in new inconsistencies being created. However the
 source of truth `imapUidTable` will not be affected and thus re-running
 this task may eventually fix all issues.

 === Quotas

 User can monitor the amount of space and message count he is allowed to
 use, and that he is effectively using. James relies on an event bus and
 Cassandra to track the quota of an user. Upon Cassandra failure, this
 value can be incorrect.

 ==== How to detect the inconsistencies

 Incorrect quotas could be seen in the `Mail User Agent` (IMAP or JMAP).

 ==== How to solve

 Execute the
 xref:operate/webadmin.adoc#_recomputing_current_quotas_for_users[recompute Quotas counters task]. This task is not concurrent-safe. Concurrent
 operations will result in an invalid quota to be persisted. Re-running
 this task may eventually return the correct result.

 === RRT (RecipientRewriteTable) mapping sources

 `rrt` and `mappings_sources` tables store information about address
 mappings. The source of truth is `rrt` and `mappings_sources` is the
 projection table containing all mapping sources.

 ==== How to detect the inconsistencies

 Right now there’s no tool for detecting that, we’re proposing a
 https://issues.apache.org/jira/browse/JAMES-3069[development plan]. By
 the mean time, the recommendation is to execute the
 `SolveInconsistencies` task below in a regular basis.

 ==== How to solve

 Execute the Cassandra mapping `SolveInconsistencies` task described in
 xref:operate/webadmin.adoc#_operations_on_mappings_sources[webadmin documentation]

 == Setting Cassandra user permissions

 When a Cassandra cluster is serving more than a James cluster, the
 keyspaces need isolation. It can be achieved by configuring James server
 with credentials preventing access or modification of other keyspaces.

 We recommend you to not use the initial admin user of Cassandra and
 provide a different one with a subset of permissions for each
 application.

 === Prerequisites

 We’re gonna use the Cassandra super users to create roles and grant
 permissions for them. To do that, Cassandra requires you to login via
 username/password authentication and enable granting in cassandra
 configuration file.

 For example:

 ....
 echo -e "\nauthenticator: PasswordAuthenticator" >> /etc/cassandra/cassandra.yaml
 echo -e "\nauthorizer: org.apache.cassandra.auth.CassandraAuthorizer" >> /etc/cassandra/cassandra.yaml
 ....

 === Prepare Cassandra roles & keyspaces for James

 ==== Create a role

 Have a look at
 http://cassandra.apache.org/doc/3.11.11/cql/security.html[cassandra documentation] section `CREATE ROLE` for more information

 E.g.

 ....
 CREATE ROLE james_one WITH PASSWORD = 'james_one' AND LOGIN = true;
 ....

 ==== Create a keyspace

 Have a look at
 http://cassandra.apache.org/doc/3.11.11/cql/ddl.html[cassandra documentation] section `CREATE KEYSPACE` for more information

 ==== Grant permissions on created keyspace to the role

 The role to be used by James needs to have full rights on the keyspace
 that James is using. Assuming the keyspace name is `james_one_keyspace`
 and the role be `james_one`.

 ....
 GRANT CREATE ON KEYSPACE james_one_keyspace TO james_one; // Permission to create tables on the appointed keyspace
 GRANT SELECT ON KEYSPACE james_one_keyspace TO james_one; // Permission to select from tables on the appointed keyspace
 GRANT MODIFY ON KEYSPACE james_one_keyspace TO james_one; // Permission to update data in tables on the appointed keyspace
 ....

 *Warning*: The granted role doesn’t have the right to create keyspaces,
 thus, if you haven’t created the keyspace, James server will fail to
 start is expected.

 *Tips*

 Since all of Cassandra roles used by different James are supposed to
 have a same set of permissions, you can reduce the works by creating a
 base role set like `typical_james_role` with all of necessary
 permissions. After that, with each James, create a new role and grant
 the `typical_james_role` to the newly created one. Note that, once a
 base role set is updated ( granting or revoking rights) all granted
 roles are automatically updated.

 E.g.

 ....
 CREATE ROLE james1 WITH PASSWORD = 'james1' AND LOGIN = true;
 GRANT typical_james_role TO james1;

 CREATE ROLE james2 WITH PASSWORD = 'james2' AND LOGIN = true;
 GRANT typical_james_role TO james2;
 ....

 ==== Revoke harmful permissions from the created role

 We want a specific role that cannot describe or query the information of
 other keyspaces or tables used by another application. By default,
 Cassandra allows every role created to have the right to describe any
 keyspace and table. There’s no configuration that can make effect on
 that topic. Consequently, you have to accept that your data models are
 still being exposed to anyone having credentials to Cassandra.

 For more information, have a look at
 http://cassandra.apache.org/doc/3.11.11/cql/security.html[cassandra documentation] section `REVOKE PERMISSION`.

 Except for the case above, the permissions are not auto available for a
 specific role unless they are granted by `GRANT` command. Therefore, if
 you didn’t provide more permissions than
 link:#Grant_permissions_on_created_keyspace_to_the_role[granting
 section], there’s no need to revoke.

 == Cassandra table level configuration

 While _Distributed James_ is shipped with default table configuration
 options, these settings should be refined depending of your usage.

 These options are:

 * The https://cassandra.apache.org/doc/latest/operating/compaction.html[compaction algorithms]
 * The https://cassandra.apache.org/doc/latest/operating/bloom_filters.html[bloom filter sizing]
 * The https://cassandra.apache.org/doc/latest/operating/compression.html?highlight=chunk%20size[chunk size]
 * Thehttps://www.datastax.com/blog/2011/04/maximizing-cache-benefit-cassandra[cachingoptions]

 The compaction algorithms allow a tradeoff between background IO upon
 writes and reads. We recommend:

 * Using *Leveled Compaction Strategy* on
 read intensive tables subject to updates. This limits the count of
 SStables being read at the cost of more background IO. High garbage
 collections can be caused by an inappropriate use of Leveled Compaction
 Strategy.
 * Otherwise use the default *Size Tiered Compaction Strategy*.

 Bloom filters help avoiding unnecessary reads on SSTables. This
 probabilistic data structure can tell an entry absence from a SSTable,
 as well as the presence of an entry with an associated probability. If a
 lot of false positives are noticed, the size of the bloom filters can be
 increased.

 As explained in
 https://thelastpickle.com/blog/2018/08/08/compression_performance.html[this post],
 chunk size used upon compression allows a tradeoff between reads
 and writes. A smaller size will mean decreasing compression, thus it
 increases data being stored on disk, but allow lower chunks to be read
 to access data, and will favor reads. A bigger size will mean better
 compression, thus writing less, but it might imply reading bigger
 chunks.

 Cassandra enables a key cache and a row cache. Key cache enables to skip
 reading the partition index upon reads, thus performing 1 read to the
 disk instead of 2. Enabling this cache is globally advised. Row cache
 stores the entire row in memory. It can be seen as an optimization, but
 it might actually use memory no longer available for instance for file
 system cache. We recommend turning it off on modern SSD hardware.

 A review of your usage can be conducted using
 https://cassandra.apache.org/doc/latest/tools/nodetool/nodetool.html[nodetool]
 utility. For example `nodetool tablestats \{keyspace\}` allows reviewing
 the number of SSTables, the read/write ratios, bloom filter efficiency.
 `nodetool tablehistograms \{keyspace\}.\{table\}` might give insight about
 read/write performance.

 Table level options can be changed using *ALTER TABLE* for example with
 the https://cassandra.apache.org/doc/latest/tools/cqlsh.html[cqlsh]
 utility. A full compaction might be needed in order for the changes to
 be taken into account.

 == Mail Queue

 === Fine tune configuration for RabbitMQ

 In order to adapt mail queue settings to the actual traffic load, an
 administrator needs to perform fine configuration tunning as explained
 in
 https://github.com/apache/james-project/blob/master/src/site/xdoc/server/config-rabbitmq.xml[rabbitmq.properties].

 Be aware that `MailQueue::getSize` is currently performing a browse and
 thus is expensive. Size recurring metric reporting thus introduces
 performance issues. As such, we advise setting
 `mailqueue.size.metricsEnabled=false`.

 === Managing email queues

 Managing an email queue is an easy task if you follow this procedure:

 * First, xref:operate/webadmin.adoc#_listing_mail_queues[List mail queues]
 and xref:operate/webadmin.adoc#_getting_a_mail_queue_details[get a mail queue details].
 * And then
 xref:operate/webadmin.adoc#_listing_the_mails_of_a_mail_queue[List the mails of a mail queue].

 In case, you need to clear an email queue because there are only spam or
 trash emails in the email queue you have this procedure to follow:

 * All mails from the given mail queue will be deleted with
 xref:operate/webadmin.adoc#_clearing_a_mail_queue[Clearing a mail queue].

 == Updating Cassandra schema version

 A schema version indicates you which schema your James server is relying
 on. The schema version number tracks if a migration is required. For
 instance, when the latest schema version is 2, and the current schema
 version is 1, you might think that you still have data in the deprecated
 Message table in the database. Hence, you need to migrate these messages
 into the MessageV2 table. Once done, you can safely bump the current
 schema version to 2.

 Relying on outdated schema version prevents you to benefit from the
 newest performance and safety improvements. Otherwise, there’s something
 very unexpected in the way we manage cassandra schema: we create new
 tables without asking the admin about it. That means your James version
 is always using the last tables but may also take into account the old
 ones if the migration is not done yet.

 === How to detect when we should update Cassandra schema version

 When you see in James logs
 `org.apache.james.modules.mailbox.CassandraSchemaVersionStartUpCheck`
 showing a warning like `Recommended version is versionX`, you should
 perform an update of the Cassandra schema version.

 Also, we keep track of changes needed when upgrading to a newer version.
 You can read this
 https://github.com/apache/james-project/blob/master/upgrade-instructions.md[upgrade
 instructions].

 === How to update Cassandra schema version

 These schema updates can be triggered by webadmin using the Cassandra
 backend. Following steps are for updating Cassandra schema version:

 * At the very first step, you need to
 xref:operate/webadmin.adoc#_retrieving_current_cassandra_schema_version[retrieve
 current Cassandra schema version]
 * And then, you
 xref:operate/webadmin.adoc#_retrieving_latest_available_cassandra_schema_version[retrieve
 latest available Cassandra schema version] to make sure there is a
 latest available version
 * Eventually, you can update the current schema version to the one you
 got with
 xref:operate/webadmin.adoc#_upgrading_to_the_latest_version[upgrading to
 the latest version]

 Otherwise, if you need to run the migrations to a specific version, you
 can use
 xref:operate/webadmin.adoc#_upgrading_to_a_specific_version[Upgrading to a
 specific version]

 == Deleted Message Vault

 We recommend the administrator to
 xref:#_cleaning_expired_deleted_messages[run it] in cron job to save
 storage volume.

 === How to configure deleted messages vault

 To setup James with Deleted Messages Vault, you need to follow those
 steps:

 * Enable Deleted Messages Vault by configuring Pre Deletion Hooks.
 * Configuring the retention time for the Deleted Messages Vault.

 ==== Enable Deleted Messages Vault by configuring Pre Deletion Hooks

 You need to configure this hook in
 https://github.com/apache/james-project/blob/master/server/apps/distributed-app/sample-configuration/listeners.xml[listeners.xml]
 configuration file. More details about configuration & example can be
 found at http://james.apache.org/server/config-listeners.html[Pre
 Deletion Hook Configuration]

 ==== Configuring the retention time for the Deleted Messages Vault

 In order to configure the retention time for the Deleted Messages Vault,
 an administrator needs to perform fine configuration tunning as
 explained in
 https://github.com/apache/james-project/blob/master/server/apps/distributed-app/sample-configuration/deletedMessageVault.properties[deletedMessageVault.properties].
 Mails are not retained forever as you have to configure a retention
 period (by `retentionPeriod`) before using it (with one-year retention
 by default if not defined).

 === Restore deleted messages after deletion

 After users deleted their mails and emptied the trash, the admin can use
 xref:operate/webadmin.adoc#_restore_deleted_messagest[Restore Deleted Messages]
 to restore all the deleted mails.

 === Cleaning expired deleted messages

 You can delete all deleted messages older than the configured
 `retentionPeriod` by using
 xref:operate/webadmin.adoc#_deleted_messages_vault[Purge Deleted Messages].
 We recommend calling this API in CRON job on 1st day each
 month.