third-party/rspamd/README.md - james-project - Git at Google

 # James' extensions for Rspamd

 This module is for developing and delivering extensions to James for the [Rspamd](https://rspamd.com/) (the spam filtering system)
 and [ClamAV](https://www.clamav.net/) (the antivirus engine).

 ## How to run

 - The Rspamd extension requires an extra configuration file `rspamd.properties` to configure Rspamd connection
 Configuration parameters:
     - `rSpamdUrl` : URL defining the Rspamd's server. Eg: http://rspamd:11334
     - `rSpamdPassword` : Password for pass authentication when request to Rspamd's server. Eg: admin
     - `rspamdTimeout` : Timeout for HTTP requests called to Rspamd. Default to 15 seconds.
     - `perUserBayes` : Use per-user Bayes for mail scanning/feedback. Default to false.

 - Declare the `extensions.properties` for this module.

 ```
 guice.extension.module=org.apache.james.rspamd.module.RspamdModule
 guice.extension.task=org.apache.james.rspamd.module.RspamdTaskExtensionModule
 ```

 - Declare the Rspamd mailbox listeners in `listeners.xml`. Eg:

 ```
 <listener>
     <class>org.apache.james.rspamd.RspamdListener</class>
 </listener>
 ```

   This listener can report mails to per-user Bayes by configure `perUserBayes` in `rspamd.properties`.

 - Declare the Rspamd mailet for custom mail processing.

   You can specify the `virusProcessor` if you want to enable virus scanning for mail. Upon configurable `virusProcessor`
 you can specify how James process mail virus. We provide a sample Rspamd mailet and `virusProcessor` configuration:

   You can specify the `rejectSpamProcessor`. Emails marked as `rejected` by Rspamd will be redirected to this
 processor. This corresponds to emails with the highest spam score, thus delivering them to users as marked as spam
 might not even be desirable.

   The `rewriteSubject` option allows to rewritte subjects when asked by Rspamd.

   This mailet can scan mails against per-user Bayes by configure `perUserBayes` in `rspamd.properties`. This is achieved
 through the use of Rspamd `Deliver-To` HTTP header. If true, Rspamd will be called for each recipient of the mail, which comes at a performance cost. If true, subjects are not rewritten.
 If true `virusProcessor` and `rejectSpamProcessor` are honnered per user, at the cost of email copies. Default to false.

 ```xml
 <processor state="local-delivery" enableJmx="true">
     <mailet match="All" class="org.apache.james.rspamd.RspamdScanner">
         <rewriteSubject>true</rewriteSubject>
         <virusProcessor>virus</virusProcessor>
         <rejectSpamProcessor>spam</rejectSpamProcessor>
         <onMailetException>ignore</onMailetException>
     </mailet>
     <mailet match="IsMarkedAsSpam=org.apache.james.rspamd.status" class="WithStorageDirective">
         <targetFolderName>Spam</targetFolderName>
     </mailet>
     <mailet match="All" class="LocalDelivery"/>
 </processor>

 <!--Choose one between these two following virus processor, or configure a custom one if you want-->
 <!--Hard reject virus mail-->
 <processor state="virus" enableJmx="false">
     <mailet match="All" class="ToRepository">
         <repositoryPath>file://var/mail/virus/</repositoryPath>
     </mailet>
 </processor>

 <!--Soft reject virus mail-->
 <processor state="virus" enableJmx="false">
     <mailet match="All" class="StripAttachment">
         <remove>all</remove>
         <pattern>.*</pattern>
     </mailet>
     <mailet match="All" class="AddSubjectPrefix">
         <subjectPrefix>[VIRUS]</subjectPrefix>
     </mailet>
     <mailet match="All" class="LocalDelivery"/>
 </processor>

 <!--Store rejected spam emails (with a very high score) -->
 <processor state="spam" enableJmx="false">
     <mailet match="All" class="ToRepository">
         <repositoryPath>cassandra://var/mail/spam</repositoryPath>
     </mailet>
 </processor>
 ```

 `RSpamdScanner` supports addition `rspamdUrl`, `rspamdPassword`, `rspamdTimeout`, `perUserBayes` properties allowing to
 override content defined in `rspamd.properties`, which allows running several instances on distict Rspamd instance. A
 possible use case is to use 1 RSpamD cluster on user incoming spam, trained in perUserBayes mode, and another RSpamD
 cluster configured to check outgoing email for spams with a tolerant threshold and a specifc configuration.

 - Declare the webadmin for Rspamd in `webadmin.properties`

 ```
 extensions.routes=org.apache.james.rspamd.route.FeedMessageRoute
 ```
 How to use admin endpoint, see more at [Additional webadmin endpoints](README.md)

 - Declare the Rspamd healthcheck in `healthcheck.properties`

 ```
 additional.healthchecks=org.apache.james.rspamd.healthcheck.RspamdHealthCheck
 ```

 - Docker compose file example: [docker-compose.yml](docker-compose.yml) or [docker-compose-distributed.yml](docker-compose-distributed.yml).

   Please configure `ClamAV` integration into `Rspamd` if you want to enable virus scanning.
 - The sample-configuration: [sample-configuration](sample-configuration)
 - For running docker-compose, first compile this project

 ```
 mvn clean install -DskipTests
 ```
 then run it: `docker-compose up`

 ## Additional webadmin endpoints

 ### Report spam messages to Rspamd

 #### Use a webadmin task

 One can use this route to schedule a task that reports spam messages to Rspamd for its spam classify learning.
 This task can be configured to report spam messages to per-user Bayes via `perUserBayes` in `rspamd.properties`.

 ```bash
 curl -XPOST 'http://ip:port/rspamd?action=reportSpam
 ```

 This endpoint has the following param:
 - `action` (required): need to be `reportSpam`
 - `messagesPerSecond` (optional): Concurrent learns performed for Rspamd, default to 10
 - `period` (optional): duration (support many time units, default in seconds), only messages between `now` and `now - duration` are reported. By default,
 all messages are reported.
    These inputs represent the same duration: `1d`, `1day`, `86400 seconds`, `86400`...
 - `samplingProbability` (optional): float between 0 and 1, represent the chance to report each given message to Rspamd.
 By default, all messages are reported.
 - `classifiedAsSpam` (optional): Boolean, true to only include messages tagged as Spam by Rspamd, false for only
 messages tagged as ham by Rspamd. If omitted all messages are included.
 - `rspamdTimeout` (optional): duration, Default is 15 seconds. Provide configuration timeout when HTTP request to rspamd for learning.
 Will return the task id. E.g:
 ```
 {
     "taskId": "70c12761-ab86-4321-bb6f-fde99e2f74b0"
 }
 ```

 Response codes:
 - 201: Task generation succeeded. Corresponding task id is returned.
 - 400: Invalid arguments supplied in the user request.

 [More details about endpoints returning a task](https://james.apache.org/server/manage-webadmin.html#Endpoints_returning_a_task).

 The scheduled task will have the following type `FeedSpamToRspamdTask` and the following additionalInformation:

 ```json
 {
   "errorCount": 1,
   "reportedSpamMessageCount": 2,
   "runningOptions": {
     "messagesPerSecond": 10,
     "rspamdTimeoutInSeconds": 15,
     "periodInSecond": 3600,
     "samplingProbability": 1.0
   },
   "spamMessageCount": 4,
   "timestamp": "2007-12-03T10:15:30Z",
   "type": "FeedSpamToRspamdTask"
 }
 ```

 ### Report ham messages to Rspamd
 One can use this route to schedule a task that reports ham messages to Rspamd for its spam classify learning.
 This task can be configured to report ham messages to per-user Bayes via `perUserBayes` in `rspamd.properties`.

 ```bash
 curl -XPOST 'http://ip:port/rspamd?action=reportHam
 ```

 This endpoint has the following param:
 - `action` (required): need to be `reportHam`
 - `messagesPerSecond` (optional): Concurrent learns performed for Rspamd, default to 10
 - `period` (optional): duration (support many time units, default in seconds), only messages between `now` and `now - duration` are reported. By default,
   all messages are reported.
   These inputs represent the same duration: `1d`, `1day`, `86400 seconds`, `86400`...
 - `samplingProbability` (optional): float between 0 and 1, represent the chance to report each given message to Rspamd.
   By default, all messages are reported.
 - `classifiedAsSpam` (optional): Boolean, true to only include messages tagged as Spam by Rspamd, false for only
 messages tagged as ham by Rspamd. If omitted all messages are included.
 - `rspamdTimeout` (optional): duration, Default is 15 seconds. Provide configuration timeout when HTTP request to rspamd for learning.
 Will return the task id. E.g:
 ```
 {
     "taskId": "70c12761-ab86-4321-bb6f-fde99e2f74b0"
 }
 ```

 Response codes:
 - 201: Task generation succeeded. Corresponding task id is returned.
 - 400: Invalid arguments supplied in the user request.

 [More details about endpoints returning a task](https://james.apache.org/server/manage-webadmin.html#Endpoints_returning_a_task).

 The scheduled task will have the following type `FeedHamToRspamdTask` and the following additionalInformation:

 ```json
 {
   "errorCount": 1,
   "reportedHamMessageCount": 2,
   "runningOptions": {
     "messagesPerSecond": 10,
     "rspamdTimeoutInSeconds": 15,
     "periodInSecond": 3600,
     "samplingProbability": 1.0
   },
   "hamMessageCount": 4,
   "timestamp": "2007-12-03T10:15:30Z",
   "type": "FeedHamToRspamdTask"
 }
 ```

 #### Use live reporting

 Alternatively, ham/spam can be reported by using a mailbox listener. To do so enable `RspamdListener` within `listeners.xml`
 configuration file:

 ```xml
 <listeners>
     <listener>
         <class>org.apache.james.rspamd.RspamdListener</class>
         <async>true</async>
     </listener>
 </listeners>
 ```

 Note that you can turn off `reportAdded` (which reports incoming messages as Ham) resulting in lesser work:


 ```xml
 <listeners>
     <listener>
         <class>org.apache.james.rspamd.RspamdListener</class>
         <async>true</async>
         <configuration>
           <reportAdded>false</reportAdded>
         </configuration>
     </listener>
 </listeners>
 ```

 ## Apache Kvrocks as Rspamd storage

 > **Note**: Kvrocks integration is currently a work-in-progress and under triage on a realistic setup. As of today, the Apache James PMC does not endorse its use in production environments.

 The Rspamd extension can use Apache Kvrocks as storage. Apache Kvrocks is a more suitable option for Rspamd storage compared to Redis for several reasons:
 - Kvrocks stores data on disk, which is beneficial when dealing with large datasets that may not fit entirely in memory. This ensures that you can handle more extensive spam training data without running into Redis memory limitations.
 - Kvrocks is Redis APIs compatible.

 We document accordingly the docker compose setup:
 - [Apache James + Rspamd + Apache Kvrocks standalone](docker-compose-rspamd-with-kvrocks-standalone.yml)
 - [Apache James + Rspamd + Apache Kvrocks Sentinel](docker-compose-rspamd-with-kvrocks-sentinel.yml)

   Please note that to make Rspamd work well with Kvrocks Sentinel:
   - Configure `slave-read-only no` in `kvrocks.conf` file (allow Rspamd to execute read-only Lua script to get its Bayes statistics against the Kvrocks replicas, which Kvrocks is strict about by default).
   - Use Rspamd `3.10` or later.

 ### Migrate Rspamd data from Redis to Kvrocks

 Hereby we document a sample to use [RedisShake](https://github.com/tair-opensource/RedisShake) to migrate data from Redis to Kvrocks.

 Sample command:
 ```bash
 docker run --network=emaily \
   --entrypoint "/bin/sh" \
   -v ${PWD}/sample-configuration/redis-shake/shake.toml:/app/shake.toml \
   -e SHAKE_SRC_ADDRESS=redis:6379 \
   -e SHAKE_DST_ADDRESS=kvrocks:6379 \
   ghcr.io/tair-opensource/redisshake:4.4.0 \
   -c "./redis-shake /app/shake.toml"
 ```
	# James' extensions for Rspamd

	This module is for developing and delivering extensions to James for the [Rspamd](https://rspamd.com/) (the spam filtering system)
	and [ClamAV](https://www.clamav.net/) (the antivirus engine).

	## How to run

	- The Rspamd extension requires an extra configuration file `rspamd.properties` to configure Rspamd connection
	Configuration parameters:
	- `rSpamdUrl` : URL defining the Rspamd's server. Eg: http://rspamd:11334
	- `rSpamdPassword` : Password for pass authentication when request to Rspamd's server. Eg: admin
	- `rspamdTimeout` : Timeout for HTTP requests called to Rspamd. Default to 15 seconds.
	- `perUserBayes` : Use per-user Bayes for mail scanning/feedback. Default to false.

	- Declare the `extensions.properties` for this module.

	```
	guice.extension.module=org.apache.james.rspamd.module.RspamdModule
	guice.extension.task=org.apache.james.rspamd.module.RspamdTaskExtensionModule
	```

	- Declare the Rspamd mailbox listeners in `listeners.xml`. Eg:

	```
	<listener>
	<class>org.apache.james.rspamd.RspamdListener</class>
	</listener>
	```

	This listener can report mails to per-user Bayes by configure `perUserBayes` in `rspamd.properties`.

	- Declare the Rspamd mailet for custom mail processing.

	You can specify the `virusProcessor` if you want to enable virus scanning for mail. Upon configurable `virusProcessor`
	you can specify how James process mail virus. We provide a sample Rspamd mailet and `virusProcessor` configuration:

	You can specify the `rejectSpamProcessor`. Emails marked as `rejected` by Rspamd will be redirected to this
	processor. This corresponds to emails with the highest spam score, thus delivering them to users as marked as spam
	might not even be desirable.

	The `rewriteSubject` option allows to rewritte subjects when asked by Rspamd.

	This mailet can scan mails against per-user Bayes by configure `perUserBayes` in `rspamd.properties`. This is achieved
	through the use of Rspamd `Deliver-To` HTTP header. If true, Rspamd will be called for each recipient of the mail, which comes at a performance cost. If true, subjects are not rewritten.
	If true `virusProcessor` and `rejectSpamProcessor` are honnered per user, at the cost of email copies. Default to false.

	```xml
	<processor state="local-delivery" enableJmx="true">
	<mailet match="All" class="org.apache.james.rspamd.RspamdScanner">
	<rewriteSubject>true</rewriteSubject>
	<virusProcessor>virus</virusProcessor>
	<rejectSpamProcessor>spam</rejectSpamProcessor>
	<onMailetException>ignore</onMailetException>
	</mailet>
	<mailet match="IsMarkedAsSpam=org.apache.james.rspamd.status" class="WithStorageDirective">
	<targetFolderName>Spam</targetFolderName>
	</mailet>
	<mailet match="All" class="LocalDelivery"/>
	</processor>

	<!--Choose one between these two following virus processor, or configure a custom one if you want-->
	<!--Hard reject virus mail-->
	<processor state="virus" enableJmx="false">
	<mailet match="All" class="ToRepository">
	<repositoryPath>file://var/mail/virus/</repositoryPath>
	</mailet>
	</processor>

	<!--Soft reject virus mail-->
	<processor state="virus" enableJmx="false">
	<mailet match="All" class="StripAttachment">
	<remove>all</remove>
	<pattern>.*</pattern>
	</mailet>
	<mailet match="All" class="AddSubjectPrefix">
	<subjectPrefix>[VIRUS]</subjectPrefix>
	</mailet>
	<mailet match="All" class="LocalDelivery"/>
	</processor>

	<!--Store rejected spam emails (with a very high score) -->
	<processor state="spam" enableJmx="false">
	<mailet match="All" class="ToRepository">
	<repositoryPath>cassandra://var/mail/spam</repositoryPath>
	</mailet>
	</processor>
	```

	`RSpamdScanner` supports addition `rspamdUrl`, `rspamdPassword`, `rspamdTimeout`, `perUserBayes` properties allowing to
	override content defined in `rspamd.properties`, which allows running several instances on distict Rspamd instance. A
	possible use case is to use 1 RSpamD cluster on user incoming spam, trained in perUserBayes mode, and another RSpamD
	cluster configured to check outgoing email for spams with a tolerant threshold and a specifc configuration.

	- Declare the webadmin for Rspamd in `webadmin.properties`

	```
	extensions.routes=org.apache.james.rspamd.route.FeedMessageRoute
	```
	How to use admin endpoint, see more at [Additional webadmin endpoints](README.md)

	- Declare the Rspamd healthcheck in `healthcheck.properties`

	```
	additional.healthchecks=org.apache.james.rspamd.healthcheck.RspamdHealthCheck
	```

	- Docker compose file example: [docker-compose.yml](docker-compose.yml) or [docker-compose-distributed.yml](docker-compose-distributed.yml).

	Please configure `ClamAV` integration into `Rspamd` if you want to enable virus scanning.
	- The sample-configuration: [sample-configuration](sample-configuration)
	- For running docker-compose, first compile this project

	```
	mvn clean install -DskipTests
	```
	then run it: `docker-compose up`

	## Additional webadmin endpoints

	### Report spam messages to Rspamd

	#### Use a webadmin task

	One can use this route to schedule a task that reports spam messages to Rspamd for its spam classify learning.
	This task can be configured to report spam messages to per-user Bayes via `perUserBayes` in `rspamd.properties`.

	```bash
	curl -XPOST 'http://ip:port/rspamd?action=reportSpam
	```

	This endpoint has the following param:
	- `action` (required): need to be `reportSpam`
	- `messagesPerSecond` (optional): Concurrent learns performed for Rspamd, default to 10
	- `period` (optional): duration (support many time units, default in seconds), only messages between `now` and `now - duration` are reported. By default,
	all messages are reported.
	These inputs represent the same duration: `1d`, `1day`, `86400 seconds`, `86400`...
	- `samplingProbability` (optional): float between 0 and 1, represent the chance to report each given message to Rspamd.
	By default, all messages are reported.
	- `classifiedAsSpam` (optional): Boolean, true to only include messages tagged as Spam by Rspamd, false for only
	messages tagged as ham by Rspamd. If omitted all messages are included.
	- `rspamdTimeout` (optional): duration, Default is 15 seconds. Provide configuration timeout when HTTP request to rspamd for learning.
	Will return the task id. E.g:
	```
	{
	"taskId": "70c12761-ab86-4321-bb6f-fde99e2f74b0"
	}
	```

	Response codes:
	- 201: Task generation succeeded. Corresponding task id is returned.
	- 400: Invalid arguments supplied in the user request.

	[More details about endpoints returning a task](https://james.apache.org/server/manage-webadmin.html#Endpoints_returning_a_task).

	The scheduled task will have the following type `FeedSpamToRspamdTask` and the following additionalInformation:

	```json
	{
	"errorCount": 1,
	"reportedSpamMessageCount": 2,
	"runningOptions": {
	"messagesPerSecond": 10,
	"rspamdTimeoutInSeconds": 15,
	"periodInSecond": 3600,
	"samplingProbability": 1.0
	},
	"spamMessageCount": 4,
	"timestamp": "2007-12-03T10:15:30Z",
	"type": "FeedSpamToRspamdTask"
	}
	```

	### Report ham messages to Rspamd
	One can use this route to schedule a task that reports ham messages to Rspamd for its spam classify learning.
	This task can be configured to report ham messages to per-user Bayes via `perUserBayes` in `rspamd.properties`.

	```bash
	curl -XPOST 'http://ip:port/rspamd?action=reportHam
	```

	This endpoint has the following param:
	- `action` (required): need to be `reportHam`
	- `messagesPerSecond` (optional): Concurrent learns performed for Rspamd, default to 10
	- `period` (optional): duration (support many time units, default in seconds), only messages between `now` and `now - duration` are reported. By default,
	all messages are reported.
	These inputs represent the same duration: `1d`, `1day`, `86400 seconds`, `86400`...
	- `samplingProbability` (optional): float between 0 and 1, represent the chance to report each given message to Rspamd.
	By default, all messages are reported.
	- `classifiedAsSpam` (optional): Boolean, true to only include messages tagged as Spam by Rspamd, false for only
	messages tagged as ham by Rspamd. If omitted all messages are included.
	- `rspamdTimeout` (optional): duration, Default is 15 seconds. Provide configuration timeout when HTTP request to rspamd for learning.
	Will return the task id. E.g:
	```
	{
	"taskId": "70c12761-ab86-4321-bb6f-fde99e2f74b0"
	}
	```

	Response codes:
	- 201: Task generation succeeded. Corresponding task id is returned.
	- 400: Invalid arguments supplied in the user request.

	[More details about endpoints returning a task](https://james.apache.org/server/manage-webadmin.html#Endpoints_returning_a_task).

	The scheduled task will have the following type `FeedHamToRspamdTask` and the following additionalInformation:

	```json
	{
	"errorCount": 1,
	"reportedHamMessageCount": 2,
	"runningOptions": {
	"messagesPerSecond": 10,
	"rspamdTimeoutInSeconds": 15,
	"periodInSecond": 3600,
	"samplingProbability": 1.0
	},
	"hamMessageCount": 4,
	"timestamp": "2007-12-03T10:15:30Z",
	"type": "FeedHamToRspamdTask"
	}
	```

	#### Use live reporting

	Alternatively, ham/spam can be reported by using a mailbox listener. To do so enable `RspamdListener` within `listeners.xml`
	configuration file:

	```xml
	<listeners>
	<listener>
	<class>org.apache.james.rspamd.RspamdListener</class>
	<async>true</async>
	</listener>
	</listeners>
	```

	Note that you can turn off `reportAdded` (which reports incoming messages as Ham) resulting in lesser work:


	```xml
	<listeners>
	<listener>
	<class>org.apache.james.rspamd.RspamdListener</class>
	<async>true</async>
	<configuration>
	<reportAdded>false</reportAdded>
	</configuration>
	</listener>
	</listeners>
	```

	## Apache Kvrocks as Rspamd storage

	> Note: Kvrocks integration is currently a work-in-progress and under triage on a realistic setup. As of today, the Apache James PMC does not endorse its use in production environments.

	The Rspamd extension can use Apache Kvrocks as storage. Apache Kvrocks is a more suitable option for Rspamd storage compared to Redis for several reasons:
	- Kvrocks stores data on disk, which is beneficial when dealing with large datasets that may not fit entirely in memory. This ensures that you can handle more extensive spam training data without running into Redis memory limitations.
	- Kvrocks is Redis APIs compatible.

	We document accordingly the docker compose setup:
	- [Apache James + Rspamd + Apache Kvrocks standalone](docker-compose-rspamd-with-kvrocks-standalone.yml)
	- [Apache James + Rspamd + Apache Kvrocks Sentinel](docker-compose-rspamd-with-kvrocks-sentinel.yml)

	Please note that to make Rspamd work well with Kvrocks Sentinel:
	- Configure `slave-read-only no` in `kvrocks.conf` file (allow Rspamd to execute read-only Lua script to get its Bayes statistics against the Kvrocks replicas, which Kvrocks is strict about by default).
	- Use Rspamd `3.10` or later.

	### Migrate Rspamd data from Redis to Kvrocks

	Hereby we document a sample to use [RedisShake](https://github.com/tair-opensource/RedisShake) to migrate data from Redis to Kvrocks.

	Sample command:
	```bash
	docker run --network=emaily \
	--entrypoint "/bin/sh" \
	-v ${PWD}/sample-configuration/redis-shake/shake.toml:/app/shake.toml \
	-e SHAKE_SRC_ADDRESS=redis:6379 \
	-e SHAKE_DST_ADDRESS=kvrocks:6379 \
	ghcr.io/tair-opensource/redisshake:4.4.0 \
	-c "./redis-shake /app/shake.toml"
	```