blob: 31c7be977f8d708ee1f0debeb3d424b3d4fa78a3 [file] [log] [blame] [view]
# James' extensions for Rspamd
This module is for developing and delivering extensions to James for the [Rspamd](https://rspamd.com/) (the spam filtering system)
and [ClamAV](https://www.clamav.net/) (the antivirus engine).
## How to run
- The Rspamd extension requires an extra configuration file `rspamd.properties` to configure Rspamd connection
Configuration parameters:
- `rSpamdUrl` : URL defining the Rspamd's server. Eg: http://rspamd:11334
- `rSpamdPassword` : Password for pass authentication when request to Rspamd's server. Eg: admin
- `rspamdTimeout` : Timeout for HTTP requests called to Rspamd. Default to 15 seconds.
- `perUserBayes` : Use per-user Bayes for mail scanning/feedback. Default to false.
- Declare the `extensions.properties` for this module.
```
guice.extension.module=org.apache.james.rspamd.module.RspamdModule
guice.extension.task=org.apache.james.rspamd.module.RspamdTaskExtensionModule
```
- Declare the Rspamd mailbox listeners in `listeners.xml`. Eg:
```
<listener>
<class>org.apache.james.rspamd.RspamdListener</class>
</listener>
```
This listener can report mails to per-user Bayes by configure `perUserBayes` in `rspamd.properties`.
- Declare the Rspamd mailet for custom mail processing.
You can specify the `virusProcessor` if you want to enable virus scanning for mail. Upon configurable `virusProcessor`
you can specify how James process mail virus. We provide a sample Rspamd mailet and `virusProcessor` configuration:
You can specify the `rejectSpamProcessor`. Emails marked as `rejected` by Rspamd will be redirected to this
processor. This corresponds to emails with the highest spam score, thus delivering them to users as marked as spam
might not even be desirable.
The `rewriteSubject` option allows to rewritte subjects when asked by Rspamd.
This mailet can scan mails against per-user Bayes by configure `perUserBayes` in `rspamd.properties`. This is achieved
through the use of Rspamd `Deliver-To` HTTP header. If true, Rspamd will be called for each recipient of the mail, which comes at a performance cost. If true, subjects are not rewritten.
If true `virusProcessor` and `rejectSpamProcessor` are honnered per user, at the cost of email copies. Default to false.
```xml
<processor state="local-delivery" enableJmx="true">
<mailet match="All" class="org.apache.james.rspamd.RspamdScanner">
<rewriteSubject>true</rewriteSubject>
<virusProcessor>virus</virusProcessor>
<rejectSpamProcessor>spam</rejectSpamProcessor>
<onMailetException>ignore</onMailetException>
</mailet>
<mailet match="IsMarkedAsSpam=org.apache.james.rspamd.status" class="WithStorageDirective">
<targetFolderName>Spam</targetFolderName>
</mailet>
<mailet match="All" class="LocalDelivery"/>
</processor>
<!--Choose one between these two following virus processor, or configure a custom one if you want-->
<!--Hard reject virus mail-->
<processor state="virus" enableJmx="false">
<mailet match="All" class="ToRepository">
<repositoryPath>file://var/mail/virus/</repositoryPath>
</mailet>
</processor>
<!--Soft reject virus mail-->
<processor state="virus" enableJmx="false">
<mailet match="All" class="StripAttachment">
<remove>all</remove>
<pattern>.*</pattern>
</mailet>
<mailet match="All" class="AddSubjectPrefix">
<subjectPrefix>[VIRUS]</subjectPrefix>
</mailet>
<mailet match="All" class="LocalDelivery"/>
</processor>
<!--Store rejected spam emails (with a very high score) -->
<processor state="spam" enableJmx="false">
<mailet match="All" class="ToRepository">
<repositoryPath>cassandra://var/mail/spam</repositoryPath>
</mailet>
</processor>
```
`RSpamdScanner` supports addition `rspamdUrl`, `rspamdPassword`, `rspamdTimeout`, `perUserBayes` properties allowing to
override content defined in `rspamd.properties`, which allows running several instances on distict Rspamd instance. A
possible use case is to use 1 RSpamD cluster on user incoming spam, trained in perUserBayes mode, and another RSpamD
cluster configured to check outgoing email for spams with a tolerant threshold and a specifc configuration.
- Declare the webadmin for Rspamd in `webadmin.properties`
```
extensions.routes=org.apache.james.rspamd.route.FeedMessageRoute
```
How to use admin endpoint, see more at [Additional webadmin endpoints](README.md)
- Declare the Rspamd healthcheck in `healthcheck.properties`
```
additional.healthchecks=org.apache.james.rspamd.healthcheck.RspamdHealthCheck
```
- Docker compose file example: [docker-compose.yml](docker-compose.yml) or [docker-compose-distributed.yml](docker-compose-distributed.yml).
Please configure `ClamAV` integration into `Rspamd` if you want to enable virus scanning.
- The sample-configuration: [sample-configuration](sample-configuration)
- For running docker-compose, first compile this project
```
mvn clean install -DskipTests
```
then run it: `docker-compose up`
## Additional webadmin endpoints
### Report spam messages to Rspamd
#### Use a webadmin task
One can use this route to schedule a task that reports spam messages to Rspamd for its spam classify learning.
This task can be configured to report spam messages to per-user Bayes via `perUserBayes` in `rspamd.properties`.
```bash
curl -XPOST 'http://ip:port/rspamd?action=reportSpam
```
This endpoint has the following param:
- `action` (required): need to be `reportSpam`
- `messagesPerSecond` (optional): Concurrent learns performed for Rspamd, default to 10
- `period` (optional): duration (support many time units, default in seconds), only messages between `now` and `now - duration` are reported. By default,
all messages are reported.
These inputs represent the same duration: `1d`, `1day`, `86400 seconds`, `86400`...
- `samplingProbability` (optional): float between 0 and 1, represent the chance to report each given message to Rspamd.
By default, all messages are reported.
- `classifiedAsSpam` (optional): Boolean, true to only include messages tagged as Spam by Rspamd, false for only
messages tagged as ham by Rspamd. If omitted all messages are included.
- `rspamdTimeout` (optional): duration, Default is 15 seconds. Provide configuration timeout when HTTP request to rspamd for learning.
Will return the task id. E.g:
```
{
"taskId": "70c12761-ab86-4321-bb6f-fde99e2f74b0"
}
```
Response codes:
- 201: Task generation succeeded. Corresponding task id is returned.
- 400: Invalid arguments supplied in the user request.
[More details about endpoints returning a task](https://james.apache.org/server/manage-webadmin.html#Endpoints_returning_a_task).
The scheduled task will have the following type `FeedSpamToRspamdTask` and the following additionalInformation:
```json
{
"errorCount": 1,
"reportedSpamMessageCount": 2,
"runningOptions": {
"messagesPerSecond": 10,
"rspamdTimeoutInSeconds": 15,
"periodInSecond": 3600,
"samplingProbability": 1.0
},
"spamMessageCount": 4,
"timestamp": "2007-12-03T10:15:30Z",
"type": "FeedSpamToRspamdTask"
}
```
### Report ham messages to Rspamd
One can use this route to schedule a task that reports ham messages to Rspamd for its spam classify learning.
This task can be configured to report ham messages to per-user Bayes via `perUserBayes` in `rspamd.properties`.
```bash
curl -XPOST 'http://ip:port/rspamd?action=reportHam
```
This endpoint has the following param:
- `action` (required): need to be `reportHam`
- `messagesPerSecond` (optional): Concurrent learns performed for Rspamd, default to 10
- `period` (optional): duration (support many time units, default in seconds), only messages between `now` and `now - duration` are reported. By default,
all messages are reported.
These inputs represent the same duration: `1d`, `1day`, `86400 seconds`, `86400`...
- `samplingProbability` (optional): float between 0 and 1, represent the chance to report each given message to Rspamd.
By default, all messages are reported.
- `classifiedAsSpam` (optional): Boolean, true to only include messages tagged as Spam by Rspamd, false for only
messages tagged as ham by Rspamd. If omitted all messages are included.
- `rspamdTimeout` (optional): duration, Default is 15 seconds. Provide configuration timeout when HTTP request to rspamd for learning.
Will return the task id. E.g:
```
{
"taskId": "70c12761-ab86-4321-bb6f-fde99e2f74b0"
}
```
Response codes:
- 201: Task generation succeeded. Corresponding task id is returned.
- 400: Invalid arguments supplied in the user request.
[More details about endpoints returning a task](https://james.apache.org/server/manage-webadmin.html#Endpoints_returning_a_task).
The scheduled task will have the following type `FeedHamToRspamdTask` and the following additionalInformation:
```json
{
"errorCount": 1,
"reportedHamMessageCount": 2,
"runningOptions": {
"messagesPerSecond": 10,
"rspamdTimeoutInSeconds": 15,
"periodInSecond": 3600,
"samplingProbability": 1.0
},
"hamMessageCount": 4,
"timestamp": "2007-12-03T10:15:30Z",
"type": "FeedHamToRspamdTask"
}
```
#### Use live reporting
Alternatively, ham/spam can be reported by using a mailbox listener. To do so enable `RspamdListener` within `listeners.xml`
configuration file:
```xml
<listeners>
<listener>
<class>org.apache.james.rspamd.RspamdListener</class>
<async>true</async>
</listener>
</listeners>
```
Note that you can turn off `reportAdded` (which reports incoming messages as Ham) resulting in lesser work:
```xml
<listeners>
<listener>
<class>org.apache.james.rspamd.RspamdListener</class>
<async>true</async>
<configuration>
<reportAdded>false</reportAdded>
</configuration>
</listener>
</listeners>
```
## Apache Kvrocks as Rspamd storage
> **Note**: Kvrocks integration is currently a work-in-progress and under triage on a realistic setup. As of today, the Apache James PMC does not endorse its use in production environments.
The Rspamd extension can use Apache Kvrocks as storage. Apache Kvrocks is a more suitable option for Rspamd storage compared to Redis for several reasons:
- Kvrocks stores data on disk, which is beneficial when dealing with large datasets that may not fit entirely in memory. This ensures that you can handle more extensive spam training data without running into Redis memory limitations.
- Kvrocks is Redis APIs compatible.
We document accordingly the docker compose setup:
- [Apache James + Rspamd + Apache Kvrocks standalone](docker-compose-rspamd-with-kvrocks-standalone.yml)
- [Apache James + Rspamd + Apache Kvrocks Sentinel](docker-compose-rspamd-with-kvrocks-sentinel.yml)
Please note that to make Rspamd work well with Kvrocks Sentinel:
- Configure `slave-read-only no` in `kvrocks.conf` file (allow Rspamd to execute read-only Lua script to get its Bayes statistics against the Kvrocks replicas, which Kvrocks is strict about by default).
- Use Rspamd `3.10` or later.
### Migrate Rspamd data from Redis to Kvrocks
Hereby we document a sample to use [RedisShake](https://github.com/tair-opensource/RedisShake) to migrate data from Redis to Kvrocks.
Sample command:
```bash
docker run --network=emaily \
--entrypoint "/bin/sh" \
-v ${PWD}/sample-configuration/redis-shake/shake.toml:/app/shake.toml \
-e SHAKE_SRC_ADDRESS=redis:6379 \
-e SHAKE_DST_ADDRESS=kvrocks:6379 \
ghcr.io/tair-opensource/redisshake:4.4.0 \
-c "./redis-shake /app/shake.toml"
```