Copy updates from main ponymail repo

commit: 0f3d3f626e90dd656c461f99c1686494ea5eb19f [log] [tgz]
author: Sebb <sebb@apache.org> Thu Nov 18 21:01:07 2021 +0000
committer: Sebb <sebb@apache.org> Thu Nov 18 21:01:07 2021 +0000
tree: cfd23a7aa37523feb83e1aa1f97bab5abffc235b
parent: 4aec6b042bb0b91c2b1a81f959957167882cc9d8 [diff]
diff --git a/source/markdown/docs/API.md b/source/markdown/docs/API.md
index a73213e..1d62f75 100644
--- a/source/markdown/docs/API.md
+++ b/source/markdown/docs/API.md

@@ -33,12 +33,15 @@
     "tid": "06b318af97ca96c115e878c14d0814a53407751c31388410421c1751@1441467256@<dev.any23.apache.org>",
     "list_raw": "<dev.any23.apache.org>"
 }
+
+Note: date and epoch are in UTC
+
 ~~~
 
 
 ### Fetching list data
 Usage:
-`GET /api/stats.lua?list=$list&domain=$domain[&d=$timespan][&q=$query][&header_from=$from][&header_to=$to][&header_subject=$subject][&header_body=$body][&quick][&emailsOnly][&s=$s&e=$e]`
+`GET /api/stats.lua?list=$list&domain=$domain[&d=$timespan][&q=$query][&header_from=$from][&header_to=$to][&header_subject=$subject][&header_body=$body][&quick][&emailsOnly][&s=$s&e=$e][&since=$since][&dfrom=$dfrom&dto=$dto]`
 
 See below for details of [timespan](#Timespans) values
 
@@ -46,7 +49,7 @@
 
     - $list: The list prefix (e.g. `dev`). Wildcards may be used
     - $domain: The list domain (e.g. `httpd.apache.org`). Wildcards may be used
-    - $timespan: A timespan value (see below)
+    - $timespan: A [timespan](#Timespans) value
     - $s: yyyy-mm start of month (day 1)
     - $e: yyyy-mm end of month (last day)
     - $query: A search query (may contain wildcards or negations):
@@ -57,6 +60,13 @@
     - $to: Optional To: address
     - $subject: Optional Subject: line
     - $body: Optional body text
+    - $since: number of seconds since the epoch, defaults to now. 
+       Returns '{"changed":false}' if no emails are later than epoch, otherwise proceeds with normal search
+    - $dfrom: days ago to start
+    - $dto: total days to match
+
+Options:
+
     - quick: send statistics only (exclude participants, threadstruct, word-cloud, emails apart from epoch)
     - emailsOnly: return email summaries only (omit thread_struct, top 10 participants and word-cloud)
     
@@ -94,7 +104,35 @@
     "name": "dev",
     "cloud": {...},
     "hits": 25,
-    "thread_struct": {...},
+    thread_struct":
+    {
+        "nest": 2,
+        "children": {
+            {
+                "children": {
+                    {
+                        "children": {
+                            {
+                                "children": { },
+                                epoch: ...,
+                                tid: ...,
+                                nest: 1
+                            }
+                        },
+                        epoch: ...,
+                        tid: ...,
+                        nest: 2
+                    }
+                },
+                "epoch": 1474883100,
+                "tid": "b1d6446f5cc8f4846454cbabc48ddb08afbb601a77169f8e32e34102@<dev.ponymail.apache.org>",
+                "nest": 2
+            }
+        },
+        epoch: ...,
+        tid: ...,
+        body: ...
+    },
     "max": 5000,
     "searchlist": "<dev.ponymail.info>",
     "list": "dev@ponymail.info",
@@ -167,9 +205,10 @@
 
 ### Fetching notifications for a logged in user
 Usage:
-`GET /api/notifications.lua`
+`GET /api/notifications.lua[?seen=$mid]`
 
-Parameters: `None` (cookie required)
+Parameters: (cookie required)
+  - $mid: id of the message to be marked as having been seen
 
 
 Response example:
@@ -178,6 +217,8 @@
 {
     "notifications": {...}
 }
+or
+{"marked": true}
 ~~~
 
 ### Fetching a month's data as an mbox file
@@ -190,3 +231,21 @@
 TBA
 ~~~
 
+### Get ATOM data for list or email
+
+Usage:
+`GET /api/atom/lua(?list=$lid|?mid=$mid)`
+
+Parameters: (cookie may be required)
+  - $lid: the list id, e.g. dev@ponymail.apache.org
+  - $mid: The email ID (Permalink)
+
+One of the above is required.
+In the case of the list id, data is returned for the last month.
+For email ID, the thread is returned.
+
+Response example:
+
+~~~
+TBA
+~~~

diff --git a/source/markdown/docs/DESIGN-NOTES.md b/source/markdown/docs/DESIGN-NOTES.md
new file mode 100644
index 0000000..1feee34
--- /dev/null
+++ b/source/markdown/docs/DESIGN-NOTES.md

@@ -0,0 +1,76 @@
+# Design Notes
+
+This file is an attempt to summarise some of the design issues.
+
+## Database
+The project uses the ElasticSearch (ES) database to store the mails as individual documents.
+The database stores each mail to each list as a separate document.
+If the same mail was sent to multiple lists, then it exists as multiple documents in the database.
+
+ES requires that each distinct document has a unique id (MID).
+The MID is used to insert the document in the database, and can be used to fetch it.
+
+### Database design
+The mails are stored in two separate ES indexes:
+* "mbox" - this stores information about the document, plus the parsed content, and is used for searching and summary displays.
+* "mbox_source" - this is used to store the raw content of the document.
+The two versions of the document are linked by using the same MID.
+
+### Requirements for the MID
+As mentioned above, each different document must have a unique id (MID).
+This document may arrive as a single mail message, or be loaded from a collection such as an mbox file.
+
+Duplicate database entries can be avoided by ensuring that the same MID is calculated regardless of the input source.
+[If the same message is processed more than once, it then does not matter as only the last instance will be stored.]
+The MID format does not have to be transparent; it can be an opaque hash.
+
+### Generation of the MID
+The same message may be sent to multiple lists, so the message data alone is not sufficient to identify it uniquely.
+The same message may potentially be sent more than once to the same list,
+so the combination of message and listname is also not sufficient to identify a message.
+
+Many messages will have a Message-Id header which is intended to be unique to the message.
+However this may not be the case, and some messages do not have one.
+
+Many mailing list servers will allocate a squence number or other such id to each message they send.
+This should be unique for the list, assuming that sequence is not reset.
+
+Where the Message-Id and List Server Id both exist, they can be combined to generate a MID.
+[If the List Server Id is known to be unique, then that can potentially be used alone.] 
+
+Where one or other id does not exist, then alternative means need to be used to generate the MID.
+The data used to do so must be present it all supported message sources.
+
+Algorithms for the generator remain TBA
+
+### Permalink requirements
+The application provides Permalinks which can later be used to refer to any document in the database.
+Once published, it is important that such links must continue to work.
+
+Links should be portable; i.e. if the raw messages are loaded into a new archive it should be possible
+to support existing published Permalinks.
+
+Multiple links may refer to the same document, however each link should refer to a single document.
+Ideally the Permalink should be relatively short; however that may conflict with the uniqueness requirement.
+
+It may be useful for the Permalink format to be relatively transparent.
+For example, a current ASF mod_mbox link looks like:
+
+http://mail-archives.apache.org/mod_mbox/ponymail-commits/201605.mbox/<1f73b4e0fc1a4fbbbfe4d155293c2f1a@git.apache.org>
+
+This includes a reference to the:
+- mailing list name (ponymail-commits)
+- month when mail was sent (201605.mbox)
+- the Message-Id (<1f73b4e0fc1a4fbbbfe4d155293c2f1a@git.apache.org>)
+
+This information should be sufficient to find the message in just about any mail-archive.
+
+Whereas vendor-specific links may be much shorter, but are only valid for the particular service.
+For example the equivalent Markmail link is:
+http://markmail.org/message/oanktcpxlxkmyora
+
+There may be use cases for both styles of link.
+
+### Permalink design
+TBA
+

diff --git a/source/markdown/docs/INSTALLING.md b/source/markdown/docs/INSTALLING.md
index de197ef..0ecd8d3 100644
--- a/source/markdown/docs/INSTALLING.md
+++ b/source/markdown/docs/INSTALLING.md

@@ -14,8 +14,9 @@
 ## Pre-requisites ##
 You will need the following software installed on your machine:
 
-- ElasticSearch >= 2.1
+- ElasticSearch >= 2.1 and < 6.0 (setup.py does not support 6.x+; the code may perhaps run on 6.x)
 - Python 3.x for the archiver plugin (setup.py will handle dependencies) and importer
+- Python `html2text` package (GPLv3) if you wish to archive HTML-only mails (remember to add the `--html2text` command line arg)
 - Apache HTTP Server 2.4.x with mod_lua (see http://modlua.org/gs/installing if you need to build mod_lua manually)
 - Lua >=5.1 with the following modules: cjson, luasec, luasocket
   (Note: Lua 5.3 is not currently supported by httpd mod_lua or luasocket)
@@ -208,4 +209,11 @@
 To enable these headers, set `full_headers` to `true` in the `site/api/lib/config.lua` file.
 
 ### Lastly, a note about Message-ID (MID) generators
+The default MID generator is called 'medium' and digests the message
+body, timestamp and list-ID to generate the MID. There is also a 'short'
+that only digests the body, and a 'full' that uses the entire message as
+a bytestring to generate an ID. Medium is recommended for most setups
+(especially clustered setups), while full can be used for single-machine
+setups.
+N.B. At present, all the generators have issues, see (#176 #177 #178)
 Please see [this paragraph](archiving.html#usingtherightidgenerator) about document ID generators.
commit	0f3d3f626e90dd656c461f99c1686494ea5eb19f	[log] [tgz]
author	Sebb <sebb@apache.org>	Thu Nov 18 21:01:07 2021 +0000
committer	Sebb <sebb@apache.org>	Thu Nov 18 21:01:07 2021 +0000
tree	cfd23a7aa37523feb83e1aa1f97bab5abffc235b
parent	4aec6b042bb0b91c2b1a81f959957167882cc9d8 [diff]