blob: 2ed47728f0e56742a473576f86f1aa1afe9fa392 [file] [log] [blame]
= Distributed James Server — Architecture
:navtitle: Architecture
This sections presents the Distributed Server architecture.
== Storage
In order to deliver its promises, the Distributed Server leverages the following storage strategies:
image::storage.png[Storage responsibilities for the Distributed Server]
* *Cassandra* is used for metadata storage. Cassandra is efficient for a very high workload of small queries following
a known pattern.
* The *blob store* storage interface is responsible of storing potentially large binary data. For instance
email bodies, headers or attachments. Different technologies can be used: *Cassandra*, or S3 compatible *Object Storage*
(S3 or Swift).
* *OpenSearch* component empowers full text search on emails. It also enables querying data with unplanned access
patterns. OpenSearch throughput do not however match the one of Cassandra thus its use is avoided upon regular workloads.
* *RabbitMQ* enables James nodes of a same cluster to collaborate together. It is used to implement connected protocols,
notification patterns as well as distributed resilient work queues and mail queue.
* *Tika* (optional) enables text extraction from attachments, thus improving full text search results.
* *link:https://spamassassin.apache.org/[SpamAssassin] or link:https://rspamd.com/[Rspamd]* (optional) can be used for Spam detection and user feedback is supported.
xref:architecture/consistency-model.adoc[This page] further details Distributed James consistency model.
== Protocols
The following protocols are supported and can be used to interact with the Distributed Server:
* *SMTP*
* *IMAP*
* xref:operate/webadmin.adoc[WebAdmin] REST Administration API
* *LMTP*
* *POP3*
The following protocols should be considered experimental:
* *JMAP* (RFC-8620 &RFC-8621 specifications and known limitations of the James implementation are defined link:https://github.com/apache/james-project/tree/master/server/protocols/jmap-rfc-8621/doc[here])
* *ManagedSieve*
Read more on xref:architecture/implemented-standards.adoc[implemented standards].
== Topology
While it is perfectly possible to deploy homogeneous James instances, with the same configuration and thus the same
protocols and the same responsibilities one might want to investigate in
xref:architecture/specialized-instances.adoc['Specialized instances'].
== Components
This section presents the various components of the Distributed server, providing context about
their interactions, and about their implementations.
=== High level view
Here is a high level view of the various server components and their interactions:
image::server-components.png[Server components mobilized for SMTP & IMAP]
- The SMTP protocol receives a mail, and enqueue it on the MailQueue
- The MailetContainer will start processing the mail Asynchronously and will take business decisions like storing the
email locally in a user mailbox. The behaviour of the MailetContainer is highly customizable thanks to the Mailets and
the Matcher composibility.
- The Mailbox component is responsible of storing a user's mails.
- The user can use the IMAP or the JMAP protocol to retrieve and read his mails.
These components will be presented more in depth below.
=== Mail processing
Mail processing allows to take asynchronously business decisions on
received emails.
Here are its components:
* The `spooler` takes mail out of the mailQueue and executes mail
processing within the `mailet container`.
* The `mailet container` synchronously executes the user defined logic.
This `logic' is written through the use of `mailet`, `matcher` and
`processor`.
* A `mailet` represents an action: mail modification, envelop
modification, a side effect, or stop processing.
* A `matcher` represents a condition to execute a mailet.
* A `processor` is a flow of pair of `matcher` and `mailet` executed
sequentially. The `ToProcessor` mailet is a `goto` instruction to start
executing another `processor`
* A `mail repository` allows storage of a mail as part of its
processing. Standard configuration relies on the following mail
repository:
** `cassandra://var/mail/error/` : unexpected errors that occurred
during mail processing. Emails impacted by performance related
exceptions, or logical bug within James code are typically stored here.
These mails could be reprocessed once the cause of the error is fixed.
The `Mail.error` field can help diagnose the issue. Correlation with
logs can be achieved via the use of the `Mail.name` field.
** `cassandra://var/mail/address-error/` : mail addressed to a
non-existing recipient of a handled local domain. These mails could be
reprocessed once the user is created, for instance.
** `cassandra://var/mail/relay-denied/` : mail for whom relay was
denied: missing authentication can, for instance, be a cause. In
addition to prevent disasters upon miss configuration, an email review
of this mail repository can help refine a host spammer blacklist.
** `cassandra://var/mail/rrt-error/` : runtime error upon Recipient
Rewriting occurred. This is typically due to a loop.
=== Mail Queue
An email queue is a mandatory component of SMTP servers. It is a system
that creates a queue of emails that are waiting to be processed for
delivery. Email queuing is a form of Message Queuing an asynchronous
service-to-service communication. A message queue is meant to decouple a
producing process from a consuming one. An email queue decouples email
reception from email processing. It allows them to communicate without
being connected. As such, the queued emails wait for processing until
the recipient is available to receive them. As James is an Email Server,
it also supports mail queue as well.
==== Why Mail Queue is necessary
You might often need to check mail queue to make sure all emails are
delivered properly. At first, you need to know why email queues get
clogged. Here are the two core reasons for that:
* Exceeded volume of emails
Some mailbox providers enforce email rate limits on IP addresses. The
limits are based on the sender reputation. If you exceeded this rate and
queued too many emails, the delivery speed will decrease.
* Spam-related issues
Another common reason is that your email has been busted by spam
filters. The filters will let the emails gradually pass to analyze how
the rest of the recipients react to the message. If there is slow
progress, its okay. Your email campaign is being observed and assessed.
If its stuck, there could be different reasons including the blockage
of your IP address.
==== Why combining Cassandra, RabbitMQ and Object storage for MailQueue
* RabbitMQ ensures the messaging function, and avoids polling.
* Cassandra enables administrative operations such as browsing, deleting
using a time series which might require fine performance tuning (see
http://cassandra.apache.org/doc/latest/operating/index.html[Operating
Casandra documentation]).
* Object Storage stores potentially large binary payload.
However the current design do not implement delays. Delays allow to
define the time a mail have to be living in the mailqueue before being
dequeued and is used for example for exponential wait delays upon remote
delivery retries, or
=== Mailbox
Storage for emails belonging for users.
Metadata are stored in Cassandra while headers, bodies and attachments are stored
within the xref:#_blobstore[BlobStore].
==== Search index
Emails are indexed asynchronously in OpenSearch via the xref:#_event_bus[EventBus]
in order to empower advanced and fast email full text search.
Text extraction can be set up using link:https://tika.apache.org/[Tika], allowing
to extract the text from attachment, allowing to search your emails based on the attachment
textual content. In such case, the OpenSearch indexer will call a Tika server prior
indexing.
==== Quotas
Current Quotas of users are hold in a Cassandra projection. Limitations can be defined via
user, domain or globally.
==== Event Bus
Distributed James relies on an event bus system to enrich mailbox capabilities. Each
operation performed on the mailbox will trigger related events, that can
be processed asynchronously by potentially any James node on a
distributed system.
Many different kind of events can be triggered during a mailbox
operation, such as:
* `MailboxEvent`: event related to an operation regarding a mailbox:
** `MailboxDeletion`: a mailbox has been deleted
** `MailboxAdded`: a mailbox has been added
** `MailboxRenamed`: a mailbox has been renamed
** `MailboxACLUpdated`: a mailbox got its rights and permissions updated
* `MessageEvent`: event related to an operation regarding a message:
** `Added`: messages have been added to a mailbox
** `Expunged`: messages have been expunged from a mailbox
** `FlagsUpdated`: messages had their flags updated
** `MessageMoveEvent`: messages have been moved from a mailbox to an
other
* `QuotaUsageUpdatedEvent`: event related to quota update
Mailbox listeners can register themselves on this event bus system to be
called when an event is fired, allowing to do different kind of extra
operations on the system, like:
* Current quota calculation
* Message indexation with OpenSearch
* Mailbox annotations cleanup
* Ham/spam reporting to Spam filtering system
*
==== Deleted Messages Vault
Deleted Messages Vault is an interesting feature that will help James
users have a chance to:
* retain users deleted messages for some time.
* restore & export deleted messages by various criteria.
* permanently delete some retained messages.
If the Deleted Messages Vault is enabled when users delete their mails,
and by that we mean when they try to definitely delete them by emptying
the trash, James will retain these mails into the Deleted Messages
Vault, before an email or a mailbox is going to be deleted. And only
administrators can interact with this component via
wref:webadmin.adoc#_deleted-messages-vault[WebAdmin] REST APIs].
However, mails are not retained forever as you have to configure a
retention period before using it (with one-year retention by default if
not defined). Its also possible to permanently delete a mail if needed.
=== Data
Storage for domains and users.
Domains are persisted in Cassandra.
Users can be managed in Cassandra, or via a LDAP (read only).
=== Recipient rewrite tables
Storage of Recipients Rewriting rules, in Cassandra.
==== Mapping types
James allows using various mapping types for better expressing the intent of your address rewriting logic:
* *Domain mapping*: Rewrites the domain of mail addresses. Use it for technical purposes, user will not
be allowed to use the source in their FROM address headers. Domain mappings can be managed via the CLI and
added via xref:operate/webadmin.adoc#_domain_mappings[WebAdmin]
* *Domain aliases*: Rewrites the domain of mail addresses. Express the idea that both domains can be used
inter-changeably. User will be allowed to use the source in their FROM address headers. Domain aliases can
be managed via xref:operate/webadmin.adoc#_get_the_list_of_aliases_for_a_domain[WebAdmin]
* *Forwards*: Replaces the source address by another one. Vehicles the intent of forwarding incoming mails
to other users. Listing the forward source in the forward destinations keeps a local copy. User will not be
allowed to use the source in their FROM address headers. Forward can
be managed via xref:operate/webadmin.adoc#_address_forwards[WebAdmin]
* *Groups*: Replaces the source address by another one. Vehicles the intent of a group registration: group
address will be swapped by group member addresses (Feature poor mailing list). User will not be
allowed to use the source in their FROM address headers. Groups can
be managed via xref:operate/webadmin.adoc#_address_group[WebAdmin]
* *Aliases*: Replaces the source address by another one. Represents user owned mail address, with which
he can interact as if it was his main mail address. User will be allowed to use the source in their FROM
address headers. Aliases can be managed via xref:operate/webadmin.adoc#_address_aliases[WebAdmin]
* *Address mappings*: Replaces the source address by another one. Use for technical purposes, this mapping type do
not hold specific intent. Prefer using one of the above mapping types... User will not be allowed to use the source
in their FROM address headers. Address mappings can be managed via the CLI or via
xref:operate/webadmin.adoc#_address_mappings[WebAdmin]
* *Regex mappings*: Applies the regex on the supplied address. User will not be allowed to use the source
in their FROM address headers. Regex mappings can be managed via the CLI or via
xref:operate/webadmin.adoc#_regex_mapping[WebAdmin]
* *Error*: Throws an error upon processing. User will not be allowed to use the source
in their FROM address headers. Errors can be managed via the CLI
=== BlobStore
Stores potentially large binary data.
Mailbox component, Mail Queue component, Deleted Message Vault
component relies on it.
Supported backends include S3 compatible ObjectStorage (link:https://wiki.openstack.org/wiki/Swift[Swift], S3 API).
Encryption can be configured on top of ObjectStorage.
Blobs can currently be deduplicated in order to reduce storage space. This means that two blobs with
the same content will be stored one once.
The downside is that deletion is more complicated, and a garbage collection needs to be run. A first implementation
based on bloom filters can be used and triggered using the WebAdmin REST API.
=== Task Manager
Allows to control and schedule long running tasks run by other
components. Among other it enables scheduling, progress monitoring,
cancellation of long running tasks.
Distributed James leverage a task manager using Event Sourcing and RabbitMQ for messaging.
=== Event sourcing
link:https://martinfowler.com/eaaDev/EventSourcing.html[Event sourcing] implementation
for the Distributed server stores events in Cassandra. It enables components
to rely on event sourcing technics for taking decisions.
A short list of usage are:
* Data leak prevention storage
* JMAP filtering rules storage
* Validation of the MailQueue configuration
* Sending email warnings to user close to their quota
* Implementation of the TaskManager