tools/plugins/dkim_id_test.html - incubator-ponymail-foal - Git at Google

 <?xml version="1.0" encoding="utf-8" ?>
 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
 <meta name="generator" content="Docutils 0.16: http://docutils.sourceforge.net/" />
 <title>DKIM-IDs</title>
 <style type="text/css">

 /*
 :Author: David Goodger (goodger@python.org)
 :Id: $Id: html4css1.css 7952 2016-07-26 18:15:59Z milde $
 :Copyright: This stylesheet has been placed in the public domain.

 Default cascading style sheet for the HTML output of Docutils.

 See http://docutils.sf.net/docs/howto/html-stylesheets.html for how to
 customize this style sheet.
 */

 /* used to remove borders from tables and images */
 .borderless, table.borderless td, table.borderless th {
   border: 0 }

 table.borderless td, table.borderless th {
   /* Override padding for "table.docutils td" with "! important".
      The right padding separates the table cells. */
   padding: 0 0.5em 0 0 ! important }

 .first {
   /* Override more specific margin styles with "! important". */
   margin-top: 0 ! important }

 .last, .with-subtitle {
   margin-bottom: 0 ! important }

 .hidden {
   display: none }

 .subscript {
   vertical-align: sub;
   font-size: smaller }

 .superscript {
   vertical-align: super;
   font-size: smaller }

 a.toc-backref {
   text-decoration: none ;
   color: black }

 blockquote.epigraph {
   margin: 2em 5em ; }

 dl.docutils dd {
   margin-bottom: 0.5em }

 object[type="image/svg+xml"], object[type="application/x-shockwave-flash"] {
   overflow: hidden;
 }

 /* Uncomment (and remove this text!) to get bold-faced definition list terms
 dl.docutils dt {
   font-weight: bold }
 */

 div.abstract {
   margin: 2em 5em }

 div.abstract p.topic-title {
   font-weight: bold ;
   text-align: center }

 div.admonition, div.attention, div.caution, div.danger, div.error,
 div.hint, div.important, div.note, div.tip, div.warning {
   margin: 2em ;
   border: medium outset ;
   padding: 1em }

 div.admonition p.admonition-title, div.hint p.admonition-title,
 div.important p.admonition-title, div.note p.admonition-title,
 div.tip p.admonition-title {
   font-weight: bold ;
   font-family: sans-serif }

 div.attention p.admonition-title, div.caution p.admonition-title,
 div.danger p.admonition-title, div.error p.admonition-title,
 div.warning p.admonition-title, .code .error {
   color: red ;
   font-weight: bold ;
   font-family: sans-serif }

 /* Uncomment (and remove this text!) to get reduced vertical space in
    compound paragraphs.
 div.compound .compound-first, div.compound .compound-middle {
   margin-bottom: 0.5em }

 div.compound .compound-last, div.compound .compound-middle {
   margin-top: 0.5em }
 */

 div.dedication {
   margin: 2em 5em ;
   text-align: center ;
   font-style: italic }

 div.dedication p.topic-title {
   font-weight: bold ;
   font-style: normal }

 div.figure {
   margin-left: 2em ;
   margin-right: 2em }

 div.footer, div.header {
   clear: both;
   font-size: smaller }

 div.line-block {
   display: block ;
   margin-top: 1em ;
   margin-bottom: 1em }

 div.line-block div.line-block {
   margin-top: 0 ;
   margin-bottom: 0 ;
   margin-left: 1.5em }

 div.sidebar {
   margin: 0 0 0.5em 1em ;
   border: medium outset ;
   padding: 1em ;
   background-color: #ffffee ;
   width: 40% ;
   float: right ;
   clear: right }

 div.sidebar p.rubric {
   font-family: sans-serif ;
   font-size: medium }

 div.system-messages {
   margin: 5em }

 div.system-messages h1 {
   color: red }

 div.system-message {
   border: medium outset ;
   padding: 1em }

 div.system-message p.system-message-title {
   color: red ;
   font-weight: bold }

 div.topic {
   margin: 2em }

 h1.section-subtitle, h2.section-subtitle, h3.section-subtitle,
 h4.section-subtitle, h5.section-subtitle, h6.section-subtitle {
   margin-top: 0.4em }

 h1.title {
   text-align: center }

 h2.subtitle {
   text-align: center }

 hr.docutils {
   width: 75% }

 img.align-left, .figure.align-left, object.align-left, table.align-left {
   clear: left ;
   float: left ;
   margin-right: 1em }

 img.align-right, .figure.align-right, object.align-right, table.align-right {
   clear: right ;
   float: right ;
   margin-left: 1em }

 img.align-center, .figure.align-center, object.align-center {
   display: block;
   margin-left: auto;
   margin-right: auto;
 }

 table.align-center {
   margin-left: auto;
   margin-right: auto;
 }

 .align-left {
   text-align: left }

 .align-center {
   clear: both ;
   text-align: center }

 .align-right {
   text-align: right }

 /* reset inner alignment in figures */
 div.align-right {
   text-align: inherit }

 /* div.align-center * { */
 /*   text-align: left } */

 .align-top    {
   vertical-align: top }

 .align-middle {
   vertical-align: middle }

 .align-bottom {
   vertical-align: bottom }

 ol.simple, ul.simple {
   margin-bottom: 1em }

 ol.arabic {
   list-style: decimal }

 ol.loweralpha {
   list-style: lower-alpha }

 ol.upperalpha {
   list-style: upper-alpha }

 ol.lowerroman {
   list-style: lower-roman }

 ol.upperroman {
   list-style: upper-roman }

 p.attribution {
   text-align: right ;
   margin-left: 50% }

 p.caption {
   font-style: italic }

 p.credits {
   font-style: italic ;
   font-size: smaller }

 p.label {
   white-space: nowrap }

 p.rubric {
   font-weight: bold ;
   font-size: larger ;
   color: maroon ;
   text-align: center }

 p.sidebar-title {
   font-family: sans-serif ;
   font-weight: bold ;
   font-size: larger }

 p.sidebar-subtitle {
   font-family: sans-serif ;
   font-weight: bold }

 p.topic-title {
   font-weight: bold }

 pre.address {
   margin-bottom: 0 ;
   margin-top: 0 ;
   font: inherit }

 pre.literal-block, pre.doctest-block, pre.math, pre.code {
   margin-left: 2em ;
   margin-right: 2em }

 pre.code .ln { color: grey; } /* line numbers */
 pre.code, code { background-color: #eeeeee }
 pre.code .comment, code .comment { color: #5C6576 }
 pre.code .keyword, code .keyword { color: #3B0D06; font-weight: bold }
 pre.code .literal.string, code .literal.string { color: #0C5404 }
 pre.code .name.builtin, code .name.builtin { color: #352B84 }
 pre.code .deleted, code .deleted { background-color: #DEB0A1}
 pre.code .inserted, code .inserted { background-color: #A3D289}

 span.classifier {
   font-family: sans-serif ;
   font-style: oblique }

 span.classifier-delimiter {
   font-family: sans-serif ;
   font-weight: bold }

 span.interpreted {
   font-family: sans-serif }

 span.option {
   white-space: nowrap }

 span.pre {
   white-space: pre }

 span.problematic {
   color: red }

 span.section-subtitle {
   /* font-size relative to parent (h1..h6 element) */
   font-size: 80% }

 table.citation {
   border-left: solid 1px gray;
   margin-left: 1px }

 table.docinfo {
   margin: 2em 4em }

 table.docutils {
   margin-top: 0.5em ;
   margin-bottom: 0.5em }

 table.footnote {
   border-left: solid 1px black;
   margin-left: 1px }

 table.docutils td, table.docutils th,
 table.docinfo td, table.docinfo th {
   padding-left: 0.5em ;
   padding-right: 0.5em ;
   vertical-align: top }

 table.docutils th.field-name, table.docinfo th.docinfo-name {
   font-weight: bold ;
   text-align: left ;
   white-space: nowrap ;
   padding-left: 0 }

 /* "booktabs" style (no vertical lines) */
 table.docutils.booktabs {
   border: 0px;
   border-top: 2px solid;
   border-bottom: 2px solid;
   border-collapse: collapse;
 }
 table.docutils.booktabs * {
   border: 0px;
 }
 table.docutils.booktabs th {
   border-bottom: thin solid;
   text-align: left;
 }

 h1 tt.docutils, h2 tt.docutils, h3 tt.docutils,
 h4 tt.docutils, h5 tt.docutils, h6 tt.docutils {
   font-size: 100% }

 ul.auto-toc {
   list-style-type: none }

 </style>
 </head>
 <body>
 <div class="document" id="dkim-ids">
 <h1 class="title">DKIM-IDs</h1>

 <p>The recommended Ponymail ID generator is the DKIM-ID generator. It
 simplifies a message using an algorithm based on DKIM relaxed/simple
 canonicalisation, hashes it with an SHA-256 HMAC, and then encodes the
 truncated digest using base32 with the custom alphabet <tt class="docutils literal"><span class="pre">0-9</span> <span class="pre">b-d</span> <span class="pre">f-h</span>
 <span class="pre">j-t</span> <span class="pre">v-z</span></tt> and the padding stripped.</p>
 <div class="section" id="dkim-ids-test-suite">
 <h1>DKIM-IDs test suite</h1>
 <p>As well as plain Python doctests, we also use the hypothesis package
 to check properties of the DKIM-ID generator algorithm. This has the
 advantage of providing a kind of partial specification as well as
 testing the code. The suite can be run using:</p>
 <pre class="literal-block">
 PYTHONPATH=../tools python3 dkim_id_test.py
 </pre>
 <p>And exported to HTML using docutils and the command:</p>
 <pre class="literal-block">
 HTML=1 PYTHONPATH=../tools \
     python3 dkim_id_test.py &gt; dkim_id_test.html
 </pre>
 <div class="section" id="rfc5322-line-ending-normalisation">
 <h2>RFC5322 line ending normalisation</h2>
 <p>The first step of generating a DKIM-ID is to convert all line endings
 of the input to CRLF by upgrading bare CR and LF characters.</p>
 <blockquote>
 <p>If the message is submitted to the Signer with any local encoding
 that will be modified before transmission, that modification to
 canonical [RFC5322] form MUST be done before signing. In particular,
 bare CR or LF characters (used by some systems as a local line
 separator convention) MUST be converted to the SMTP-standard CRLF
 sequence before the message is signed.</p>
 <p><a class="reference external" href="https://tools.ietf.org/html/rfc6376#section-5.3">https://tools.ietf.org/html/rfc6376#section-5.3</a></p>
 </blockquote>
 <p>We follow the algorithm used in dkim_header in dkim.c in version 2.10
 of libopendkim, the implementation of which is this, reformatted for
 brevity:</p>
 <pre class="literal-block">
 for (p = hdr; p &lt; q &amp;&amp; *p != '\0'; p++) {
   if (*p == '\n' &amp;&amp; prev != '\r') { /* bare LF */
     dkim_dstring_catn(tmphdr, CRLF, 2);
   } else if (prev == '\r' &amp;&amp; *p != '\n') { /* bare CR */
     dkim_dstring_cat1(tmphdr, '\n');
     dkim_dstring_cat1(tmphdr, *p);
   } else { /* other */
     dkim_dstring_cat1(tmphdr, *p);
   }
   prev = *p;
 }
 if (prev == '\r') { /* end CR */
   dkim_dstring_cat1(tmphdr, '\n');
 }
 </pre>
 <p>Our version of this algorithm is called <tt class="docutils literal">rfc5322_endings</tt>.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; from dkim_id import rfc5322_endings
 </pre>
 <p>It works on bytes and produces bytes.</p>
 <p>We test properties of the DKIM-ID related functions not by formally
 proving them, as there are no mainstream frameworks for formal
 verification of Python (though Nagini may be worth trying), but
 instead by fuzzing with hypothesis as a property checker.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; from hypothesis import given
 &gt;&gt;&gt; from hypothesis.strategies import from_regex as regex, text
 </pre>
 <p>The regex producer outputs str instances, and we use it because
 hypothesis does not allow us to use patterns or other smart generation
 with only bytes. Therefore we use the smart str generators and then
 convert the output to bytes using cp1252 or utf-8 encoding as
 necessary.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; def cp1252(text: str) -&gt; bytes:
 ...     return bytes(text, &quot;cp1252&quot;)
 &gt;&gt;&gt; def utf8(text: str):
 ...     return bytes(text, &quot;utf-8&quot;)
 </pre>
 <p>We'll also use our own decorator to make tests run automatically.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; def thesis(hypo, *args):
 ...     def decorator(func):
 ...         func = hypo(*args)(func)
 ...         func()
 ...         return func
 ...     return decorator
 </pre>
 <p>Since <tt class="docutils literal">rfc5322_endings</tt> only converts endings, sequences containing
 neither CR nor LF are unaffected.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, regex(r&quot;\A[^\r\n]*\Z&quot;))
 ... def non_cr_lf_unaffected(text: str) -&gt; None:
 ...     data: bytes = utf8(text)
 ...     assert data == rfc5322_endings(data), repr(data)
 </pre>
 <p>The algorithm is that any LF not preceded with CR will have one
 inserted before it, and likewise for CR not followed by LF. Therefore
 we expect the result to always have the same number of CR and LFs.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, text(alphabet=&quot;\r\n.&quot;))
 ... def cr_lf_same_cardinality(text: str) -&gt; None:
 ...     data: bytes = rfc5322_endings(utf8(text))
 ...     crs = data.count(b&quot;\r&quot;)
 ...     lfs = data.count(b&quot;\n&quot;)
 ...     assert crs == lfs, repr(data)
 </pre>
 <p>That the number of CRs or LFs will never be reduced.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, text(alphabet=&quot;\r\n.&quot;))
 ... def cr_lf_no_reduce(text: str) -&gt; None:
 ...     a: bytes = utf8(text)
 ...     b: bytes = rfc5322_endings(a)
 ...     assert b.count(b&quot;\r&quot;) &gt;= a.count(b&quot;\r&quot;), repr(data)
 ...     assert b.count(b&quot;\n&quot;) &gt;= a.count(b&quot;\n&quot;), repr(data)
 </pre>
 <p>That if we delete all CRLF subsequences, there will be no CR or LFs
 remaining in the sequence.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, text(alphabet=&quot;\r\n.&quot;))
 ... def only_crlf_subsequences(text: str) -&gt; None:
 ...     data: bytes = rfc5322_endings(utf8(text))
 ...     data = data.replace(b&quot;\r\n&quot;, b&quot;.&quot;)
 ...     assert data.count(b&quot;\r&quot;) == 0, repr(data)
 ...     assert data.count(b&quot;\n&quot;) == 0, repr(data)
 </pre>
 <p>That if we split on CR or LF sequences, the input and output will be
 the same.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, text(alphabet=&quot;\r\nabc. &quot;))
 ... def non_crlf_subsequences(text: str) -&gt; None:
 ...     def split(data: bytes):
 ...         data = data.replace(b&quot;\r&quot;, b&quot;\n&quot;)
 ...         while b&quot;\n\n&quot; in data:
 ...             data = data.replace(b&quot;\n\n&quot;, b&quot;\n&quot;)
 ...         return data.strip(b&quot;\n&quot;).split(b&quot;\n&quot;)
 ...     data: bytes = utf8(text)
 ...     expected = split(data)
 ...     normed: bytes = rfc5322_endings(data)
 ...     assert split(normed) == expected, repr(data)
 </pre>
 <p>And that all of this is equivalent to saying that every CR is now
 followed by LF and every LF is preceded by CR.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, text(alphabet=&quot;\r\n.&quot;))
 ... def cr_and_lf_pairs(text: str) -&gt; None:
 ...     data: bytes = rfc5322_endings(utf8(text))
 ...     if b&quot;\r&quot; in data:
 ...         datum: bytes
 ...         for datum in data.split(b&quot;\r&quot;)[1:]:
 ...             assert datum.startswith(b&quot;\n&quot;), repr(data)
 ...     if b&quot;\n&quot; in data:
 ...         datum: bytes
 ...         for datum in data.split(b&quot;\n&quot;)[:-1]:
 ...             assert datum.endswith(b&quot;\r&quot;), repr(data)
 </pre>
 <p>Most importantly, the number of CRLFs in the output must be equal to
 the number of CRLFs in the input, plus the number of individual CRs
 and LFs once the CRLFs have been removed.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, text(alphabet=&quot;\r\n.&quot;))
 ... def crlf_count(text: str) -&gt; None:
 ...     nocrlf = text.replace(&quot;\r\n&quot;, &quot;&quot;)
 ...     expected = text.count(&quot;\r\n&quot;)
 ...     expected += nocrlf.count(&quot;\r&quot;)
 ...     expected += nocrlf.count(&quot;\n&quot;)
 ...     data: bytes = rfc5322_endings(utf8(text))
 ...     assert data.count(b&quot;\r\n&quot;) == expected, repr(text)
 </pre>
 <p>We'll now give a few examples. First, with no CR or LF.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc5322_endings(b&quot;&quot;)
 b''
 &gt;&gt;&gt; rfc5322_endings(b&quot;abc&quot;)
 b'abc'
 </pre>
 <p>All of the following are equivalent to CRLF.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc5322_endings(b&quot;\r&quot;)
 b'\r\n'
 &gt;&gt;&gt; rfc5322_endings(b&quot;\n&quot;)
 b'\r\n'
 &gt;&gt;&gt; rfc5322_endings(b&quot;\r\n&quot;)
 b'\r\n'
 </pre>
 <p>And the following are equivalent to CRLF CRLF.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc5322_endings(b&quot;\r\r&quot;)
 b'\r\n\r\n'
 &gt;&gt;&gt; rfc5322_endings(b&quot;\n\n&quot;)
 b'\r\n\r\n'
 &gt;&gt;&gt; rfc5322_endings(b&quot;\n\r&quot;)
 b'\r\n\r\n'
 </pre>
 </div>
 <div class="section" id="dkim-relaxed-head-canonicalisation">
 <h2>DKIM relaxed head canonicalisation</h2>
 <p>The next important component of DKIM-ID generation is DKIM head
 canonicalisation using the relaxed canonicalisation algorithm. The
 algorithm is not trivial, consisting of five separate steps:</p>
 <blockquote>
 <ul class="simple">
 <li>Convert all header field names (not the header field values) to
 lowercase. For example, convert &quot;SUBJect: AbC&quot; to &quot;subject: AbC&quot;.</li>
 <li>Unfold all header field continuation lines as described in
 [RFC5322]; in particular, lines with terminators embedded in
 continued header field values (that is, CRLF sequences followed by
 WSP) MUST be interpreted without the CRLF. Implementations MUST
 NOT remove the CRLF at the end of the header field value.</li>
 <li>Convert all sequences of one or more WSP characters to a single SP
 character. WSP characters here include those before and after a
 line folding boundary.</li>
 <li>Delete all WSP characters at the end of each unfolded header field
 value.</li>
 <li>Delete any WSP characters remaining before and after the colon
 separating the header field name from the header field value. The
 colon separator MUST be retained.</li>
 </ul>
 <p><a class="reference external" href="https://tools.ietf.org/html/rfc6376#section-3.4.2">https://tools.ietf.org/html/rfc6376#section-3.4.2</a></p>
 </blockquote>
 <p>We'll use hypothesis to check each of these properties in turn. The
 canonicalisation function is called <tt class="docutils literal">rfc6376_relaxed_head</tt>.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; from dkim_id import rfc6376_relaxed_head
 </pre>
 <p>And to test it, we'll need the lists producer from hypothesis.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; from hypothesis.strategies import lists
 &gt;&gt;&gt; chars = text(alphabet=&quot;\x00\t\r\n\f .ABCabc\xc0&quot;).map(cp1252)
 &gt;&gt;&gt; headers = lists(lists(chars, min_size=2, max_size=2))
 </pre>
 <div class="section" id="step-one">
 <h3>Step one</h3>
 <p>Step one is to convert header field names only to lowercase. Since
 other normalisation steps will occur, to test it we need to take only
 the alphabetical octets.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; def alphabetical(data: bytes) -&gt; bytes:
 ...    from typing import Set
 ...    upper: bytes = b&quot;ABCDEFGHIJKLMNOPQRSTUVWXYZ&quot;
 ...    alpha: Set[int] = set(upper + upper.lower())
 ...    return bytes([b for b in data if b in alpha])
 </pre>
 <p>Then we can make a direct comparison.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, headers)
 ... def step_1_field_names_lower(headers) -&gt; None:
 ...     ks = [alphabetical(kv[0]) for kv in headers]
 ...     for i, (k, v) in enumerate(rfc6376_relaxed_head(headers)):
 ...         assert ks[i].lower() == alphabetical(k), repr(headers)
 </pre>
 <p>Including that values use the same case.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, headers)
 ... def step_1_field_values_case(headers) -&gt; None:
 ...     vs = [kv[1] for kv in headers]
 ...     alpha = &quot;ABCDEFGHIJKLMNOPQRSTUVWXYZ&quot;
 ...     cases = set(alpha + alpha.lower())
 ...     for i, (k, v) in enumerate(rfc6376_relaxed_head(headers)):
 ...         assert (set(vs[i]) &amp; cases) == (set(v) &amp; cases), repr(headers)
 </pre>
 </div>
 <div class="section" id="step-two">
 <h3>Step two</h3>
 <p>Step two is to unfold continuations by removing CRLF except at the
 end. This would only produce consistent results if the value is in
 <tt class="docutils literal">rfc5322_endings</tt> normal form, so we extend the step to remove all
 CR or LF, except for a trailing CRLF in the header field value.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_relaxed_head([[b&quot;&quot;, b&quot;\r&quot;]])
 [[b'', b'']]
 &gt;&gt;&gt; rfc6376_relaxed_head([[b&quot;&quot;, b&quot;\n&quot;]])
 [[b'', b'']]
 &gt;&gt;&gt; rfc6376_relaxed_head([[b&quot;&quot;, b&quot;\r\n&quot;]])
 [[b'', b'\r\n']]
 &gt;&gt;&gt; rfc6376_relaxed_head([[b&quot;&quot;, b&quot;...\r&quot;]])
 [[b'', b'...']]
 &gt;&gt;&gt; rfc6376_relaxed_head([[b&quot;&quot;, b&quot;...\n&quot;]])
 [[b'', b'...']]
 &gt;&gt;&gt; rfc6376_relaxed_head([[b&quot;&quot;, b&quot;...\r\n&quot;]])
 [[b'', b'...\r\n']]
 &gt;&gt;&gt; rfc6376_relaxed_head([[b&quot;&quot;, b&quot;a\rb\r\n&quot;]])
 [[b'', b'ab\r\n']]
 &gt;&gt;&gt; rfc6376_relaxed_head([[b&quot;&quot;, b&quot;a\nb\r\n&quot;]])
 [[b'', b'ab\r\n']]
 &gt;&gt;&gt; rfc6376_relaxed_head([[b&quot;&quot;, b&quot;a\r\nb\r\n&quot;]])
 [[b'', b'ab\r\n']]
 </pre>
 <p>We do this even though, for example, <tt class="docutils literal">b&quot;a\r\nb\r\n&quot;</tt> is not a
 possible header field value because the first CRLF is not followed by
 a space or a tab, meaning that it is not a continuation.</p>
 <p>We apply the CR and LF removal to header field names too, following
 libopendkim, although <tt class="docutils literal">rfc6376_relaxed_head</tt> should never encounter
 CR or LF in a header field name during DKIM-ID generation. The removal
 of CR and LF in header names includes CRLF at the end of a header
 field name, unlike in a header field value where trailing CRLF is
 retained.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_relaxed_head([[b&quot;...\r\n&quot;, b&quot;&quot;]])
 [[b'...', b'']]
 </pre>
 <pre class="doctest-block">
 &gt;&gt;&gt; header_text = (text(alphabet=&quot;\x00\t\r\n\f .ABCabc\xc0&quot;)
 ...     .map(cp1252)
 ...     .map(rfc5322_endings))
 &gt;&gt;&gt; wild_headers = lists(lists(header_text, min_size=2, max_size=2))
 </pre>
 <p>The <tt class="docutils literal">wild_headers</tt> producer gives us headers which have not been
 normalised, and can therefore be used to test the extended step,
 e.g. for CR and LF deletion.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, wild_headers)
 ... def step_2_cr_lf_deletion(headers) -&gt; None:
 ...     for (k, v) in rfc6376_relaxed_head(headers):
 ...         assert b&quot;\r&quot; not in k, repr(headers)
 ...         assert b&quot;\n&quot; not in k, repr(headers)
 ...         if v.endswith(b&quot;\r\n&quot;):
 ...             v = v[:-2]
 ...         assert b&quot;\r&quot; not in v, repr(headers)
 ...         assert b&quot;\n&quot; not in v, repr(headers)
 </pre>
 <p>We can also test that any trailing CRLF in a header field value is
 retained.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, wild_headers)
 ... def step_2_field_values_trailing_crlf(headers) -&gt; None:
 ...     vs = [kv[1] for kv in headers]
 ...     for i, (k, v) in enumerate(rfc6376_relaxed_head(headers)):
 ...         a = vs[i].endswith(b&quot;\r\n&quot;)
 ...         b = v.endswith(b&quot;\r\n&quot;)
 ...         assert a == b, repr(headers)
 </pre>
 </div>
 <div class="section" id="step-three">
 <h3>Step three</h3>
 <p>Step three is to reduce all sequences of spaces or tabs to a single
 space, i.e. all sequences that match <tt class="docutils literal">[ \t]+</tt> must be replaced with
 <tt class="docutils literal">&quot; &quot;</tt>. The RFC sounds like it's saying that step three should be
 applied to both names and values, but may regard the issue as moot
 since WSP is not allowed in header names according to RFC 5322:</p>
 <blockquote>
 <p>[...] A field name MUST be composed of printable US-ASCII characters
 (i.e., characters that have values between 33 and 126, inclusive),
 except colon.</p>
 <p><a class="reference external" href="https://tools.ietf.org/html/rfc5322#section-2.2">https://tools.ietf.org/html/rfc5322#section-2.2</a></p>
 </blockquote>
 <p>Since RFC 6376 says to convert to RFC 5322 normal form first, that
 implies removing all characters outside of the range 33 to 126. It is
 not clear that ignoring characters out of this range, e.g. converting
 &quot;T\x00o&quot; to &quot;To&quot;, has no detrimental security properties. Neither RFC
 4409 section 8 nor RFC 6376 section 3.8 and 8 discuss this issue. The
 latter simply says that &quot;Signers and Verifiers SHOULD take reasonable
 steps to ensure that the messages they are processing are valid&quot;.</p>
 <p>In any case, libopendkim also doesn't delete all characters outside
 the range 33 to 126 in header field names. Instead, it deletes only
 tab, CR, LF, and space. But RFC 6376 also says in step five to delete
 &quot;any WSP characters remaining before and after the colon&quot;, with
 &quot;remaining&quot; being the operative word here. This suggests that it did
 consider the earlier step three to apply to headers too, otherwise the
 WSP characters would not be &quot;remaining&quot; ones. But if it considered the
 earlier step three to apply to header field names, then it must also
 consider that there may be spaces and tabs inside header field names
 even after RFC 5322 normalisation. Hence, we consider that RFC 6376 is
 primarily suggesting to apply RFC 5322 <em>line ending</em> normalisation,
 which notably it introduces by saying &quot;in particular&quot; in section
 5.3. We also consider that it suggests reducing spaces and tabs to a
 single space in step three, answering the question of what to do with
 &quot;T o&quot; (it remains &quot;T o&quot;) and &quot;T\x00o&quot; (it remains &quot;T\x00o&quot;).</p>
 <p>In summary, we follow RFC 6376 as literally as possible, contrary to
 libopendkim in this case, and apply step three to header field names.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_relaxed_head([[b&quot;Spaced \t  \t\tKey&quot;, b&quot;Value\r\n&quot;]])
 [[b'spaced key', b'Value\r\n']]
 </pre>
 <p>With this, <tt class="docutils literal">rfc6376_relaxed_head</tt> accepts arbitrary bytes for names
 and values, and deals with them in a consistent and considered way,
 including tab and space other values outside 33 to 126. This also
 includes retaining colon and semicolon, even though they are
 problematic in DKIM signing.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_relaxed_head([[b&quot;:&quot;, b&quot;Value\r\n&quot;]])
 [[b':', b'Value\r\n']]
 &gt;&gt;&gt; rfc6376_relaxed_head([[b&quot;;&quot;, b&quot;Value\r\n&quot;]])
 [[b';', b'Value\r\n']]
 </pre>
 <p>In the component of the DKIM-ID generator which uses header
 canonicalisation it's impossible for it to have colon in the header
 name, but it is possible for it to have semicolon. Such a header could
 not be signed using DKIM as it uses semicolon as the separator in the
 list of headers which have been signed, but it will be ignored in
 DKIM-ID generation as long as the defaults are followed or <tt class="docutils literal">&quot;;&quot;</tt> is
 not manually specified as a subset header to keep. Another problematic
 header which is possible is the empty header. The case of a header
 name starting with WSP also doesn't arise, because such lines are
 continuation lines.</p>
 <p>Overall, there should never be a tab in canonicalised header field
 names and values, and there should never be a double space in
 canonicalised header field names and values.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, wild_headers)
 ... def step_3_field_values(headers) -&gt; None:
 ...     for (k, v) in rfc6376_relaxed_head(headers):
 ...         assert b&quot;\t&quot; not in k, repr(headers)
 ...         assert b&quot;\t&quot; not in v, repr(headers)
 ...         assert b&quot;  &quot; not in k, repr(headers)
 ...         assert b&quot;  &quot; not in v, repr(headers)
 </pre>
 <p>Internally, the function that performs this step is called
 <tt class="docutils literal">rfc6376_shrink_head</tt>.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; from dkim_id import rfc6376_shrink_head
 </pre>
 <p>And it should work like a more efficient version of iteratively
 removing double spaces, except that it also strips leading and
 trailing whitespace, which is for steps four and five.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, wild_headers)
 ... def step_3_reduce_iterative(headers) -&gt; None:
 ...     for (k, v) in headers:
 ...         kk = k.replace(b&quot;\t&quot;, b&quot; &quot;)
 ...         vv = v.replace(b&quot;\t&quot;, b&quot; &quot;)
 ...         while b&quot;  &quot; in kk:
 ...             kk = kk.replace(b&quot;  &quot;, b&quot; &quot;)
 ...         kk = kk.strip(b&quot; &quot;)
 ...         while b&quot;  &quot; in vv:
 ...             vv = vv.replace(b&quot;  &quot;, b&quot; &quot;)
 ...         vv = vv.strip(b&quot; &quot;)
 ...         assert rfc6376_shrink_head(k) == kk, repr(k)
 ...         assert rfc6376_shrink_head(v) == vv, repr(v)
 </pre>
 <p>This also means that leading whitespace is removed from the beginnings
 of header names. Again this is not a case which could occur during
 DKIM-ID generation, in this case because such a name would have been
 regarded as a continuation, even at the beginning of a message where
 it is regarded as the continuation of the empty name.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_relaxed_head([[b&quot; Key&quot;, b&quot;Value\r\n&quot;]])
 [[b'key', b'Value\r\n']]
 </pre>
 </div>
 <div class="section" id="step-four">
 <h3>Step four</h3>
 <p>Step four says that spaces and tabs at the end of a header field value
 are removed.</p>
 <p>It is possible to give a header field value without a trailing CRLF to
 <tt class="docutils literal">rfc6376_relaxed_head</tt>, and so any trailing tabs or spaces there
 must be removed.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_relaxed_head([[b&quot;&quot;, b&quot;Value\t &quot;]])
 [[b'', b'Value']]
 </pre>
 <p>But the RFC 5322 message grammar states that all headers shall end
 with CRLF. An overly literal reading of RFC 6376 therefore implies
 that spaces and tabs are never removed from the end of a field value,
 because the value must always end with CRLF according to RFC 5322. But
 if they were never removed then there would be no need for the step,
 so the implication is that the &quot;end&quot; for the purposes of this step is
 before the trailing CRLF.</p>
 <p>A reading of <tt class="docutils literal">dkim_canon_header_string</tt> in libopendkim suggests that
 it could leave a header ending with space CRLF, but this hasn't been
 tested. We remove the space correctly.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_relaxed_head([[b&quot;Key&quot;, b&quot;Value \r\n&quot;]])
 [[b'key', b'Value\r\n']]
 </pre>
 <p>Indeed, a header field value must never end with space or tab.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, wild_headers)
 ... def step_4_field_values_ends(headers) -&gt; None:
 ...     for (k, v) in rfc6376_relaxed_head(headers):
 ...         assert not v.endswith(b&quot; &quot;), repr(headers)
 ...         assert not v.endswith(b&quot;\t&quot;), repr(headers)
 </pre>
 <p>And must never end with space CRLF or tab CRLF.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, wild_headers)
 ... def step_4_field_values_ends_2(headers) -&gt; None:
 ...     for (k, v) in rfc6376_relaxed_head(headers):
 ...         assert not v.endswith(b&quot; \r\n&quot;), repr(headers)
 ...         assert not v.endswith(b&quot;\t\r\n&quot;), repr(headers)
 </pre>
 <p>Indeed, it should never be possible to contain, let alone end, with a
 tab anyway after step three since that replaces all sequences of
 spaces and tabs with a single space, leaving no tabs at all in the
 output before it reaches step four.</p>
 </div>
 <div class="section" id="step-five">
 <h3>Step five</h3>
 <p>Step five is to remove spaces and tabs from the end of header names,
 and from the start of header values. Again, all tabs should have been
 removed anyway in step three, so this step could have specified only
 removing spaces.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, wild_headers)
 ... def step_5_wsp_around_colon(headers) -&gt; None:
 ...     for (k, v) in rfc6376_relaxed_head(headers):
 ...         assert not k.endswith(b&quot; &quot;), repr(headers)
 ...         assert not k.endswith(b&quot;\t&quot;), repr(headers)
 ...         assert not v.startswith(b&quot; &quot;), repr(headers)
 ...         assert not v.startswith(b&quot;\t&quot;), repr(headers)
 </pre>
 </div>
 <div class="section" id="general-properties">
 <h3>General properties</h3>
 <p>We can combine headers in order to check their size.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; from dkim_id import rfc6376_join
 </pre>
 <p>This can be used to test one of the general properties of
 <tt class="docutils literal">rfc6376_relaxed_head</tt>, that it never enlarges the data given to it.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, wild_headers)
 ... def head_never_enlarged(headers) -&gt; None:
 ...     a: bytes = rfc6376_join(headers)
 ...     h: List[List[bytes]] = rfc6376_relaxed_head(headers)
 ...     b: bytes = rfc6376_join(h)
 ...     assert len(a) &gt;= len(b), repr(headers)
 </pre>
 <p>Perhaps the most important general property of canonicalisation is
 that once canonicalised, attempting to canonicalise again produces the
 same data. In other words canonicalisation is absolute, and data
 cannot be canonicalised further.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, wild_headers)
 ... def recanonicalisation_is_identity(headers) -&gt; None:
 ...     a = rfc6376_relaxed_head(headers)
 ...     b = rfc6376_relaxed_head(a)
 ...     assert a == b, repr(headers)
 </pre>
 </div>
 </div>
 <div class="section" id="simple-body-canonicalisation">
 <h2>Simple body canonicalisation</h2>
 <p>The body canonicalisation function is called <tt class="docutils literal">rfc6376_simple_body</tt>.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; from dkim_id import rfc6376_simple_body
 </pre>
 <p>It maps an empty body to CRLF, and then ensures that there is at most
 one CRLF at the end of the body. Therefore, a consequence is that it
 ensures that the output is never empty.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, chars)
 ... def body_not_empty(body) -&gt; None:
 ...     body_c = rfc6376_simple_body(body)
 ...     assert len(body_c) &gt; 0, repr(body)
 </pre>
 <p>And that the output never ends CRLF CRLF.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, chars)
 ... def body_no_trailing_crlfcrlf(body) -&gt; None:
 ...     body_c = rfc6376_simple_body(body)
 ...     assert not body_c.endswith(b&quot;\r\n\r\n&quot;) &gt; 0, repr(body)
 </pre>
 <p>But it could end non-CR LF CRLF, or CR CRLF if the input were not RFC
 5322 ending normalised.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_simple_body(b&quot;Non-CR\n\r\n&quot;)
 b'Non-CR\n\r\n'
 &gt;&gt;&gt; rfc6376_simple_body(b&quot;CR\r\r\n&quot;)
 b'CR\r\r\n'
 </pre>
 <p>The function enlarges data only when its input is empty.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, chars.filter(lambda b: b != b&quot;&quot;))
 ... def body_enlarging_edge(body) -&gt; None:
 ...     body_c = rfc6376_simple_body(body)
 ...     assert len(body_c) &lt;= len(body), repr(body)
 </pre>
 <p>The prefix of the output up to any trailing CRLF the shared by the input.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, chars)
 ... def body_same_prefix(body) -&gt; None:
 ...     body_c = rfc6376_simple_body(body)
 ...     size_c = len(body_c)
 ...     if body_c.endswith(b&quot;\r\n&quot;):
 ...         size_c -= 2
 ...     assert body[:size_c] == body_c[:size_c], repr(body)
 </pre>
 <p>And any remainder must consist solely of CRLFs in both input and output.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, chars)
 ... def body_suffix_crlfs(body) -&gt; None:
 ...     body_c = rfc6376_simple_body(body)
 ...     size_c = len(body_c)
 ...     if body_c.endswith(b&quot;\r\n&quot;):
 ...         size_c -= 2
 ...     assert not body[size_c:].replace(b&quot;\r\n&quot;, b&quot;&quot;), repr(body)
 ...     assert not body_c[size_c:].replace(b&quot;\r\n&quot;, b&quot;&quot;), repr(body)
 </pre>
 </div>
 <div class="section" id="splitting">
 <h2>Splitting</h2>
 <p>The main parser is called <tt class="docutils literal">rfc6376_split</tt>.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; from dkim_id import rfc6376_split
 </pre>
 <p>It does not perform canonicalisation. If there is no CRLF header and
 body boundary separator, then it returns None for the body.</p>
 <p>Each header field is defined by RFC 5322 as ending with CRLF which is
 inclusive to that header field. Any CRLF following that indicates the
 start of a body, which may be empty. Therefore, in the case of the
 empty document there are no headers and no body.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_split(b&quot;&quot;)
 ([], None)
 </pre>
 <p>In the case of just CRLF there are no headers, since they must contain
 at least one character before their CRLF. RFC 5322 section 2.2 says
 that header fields &quot;are lines beginning with a field name, followed by
 a colon&quot;, which implies at least the presence of a colon, and section
 3.6.8 says &quot;field-name = 1*ftext&quot; which means the name must include at
 least one printable character. As there is nothing after the CRLF in
 the case of just a CRLF, there is an empty body.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_split(b&quot;\r\n&quot;)
 ([], b'')
 </pre>
 <p>In the case of CRLF CRLF there are no headers, and there is a body
 which is CRLF.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_split(b&quot;\r\n\r\n&quot;)
 ([], b'\r\n')
 </pre>
 <p>And then this pattern repeats.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_split(b&quot;\r\n\r\n\r\n&quot;)
 ([], b'\r\n\r\n')
 &gt;&gt;&gt; rfc6376_split(b&quot;\r\n\r\n\r\n\r\n&quot;)
 ([], b'\r\n\r\n\r\n')
 </pre>
 <p>When we have a header, a single trailing CRLF is regarded as part of
 that header. This means that there is no body.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_split(b&quot;Key:Value\r\n&quot;)
 ([[b'Key', b'Value\r\n']], None)
 </pre>
 <p>But appending another CRLF to that gives an empty body.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_split(b&quot;Key:Value\r\n\r\n&quot;)
 ([[b'Key', b'Value\r\n']], b'')
 </pre>
 <p>As <tt class="docutils literal">rfc6376_split</tt> does not perform canonicalisation, we have the
 edge cases of isolated CRs and LFs. There should never be isolated CRs
 and LFs in DKIM-ID generation because RFC 5322 ending normalisation is
 applied before splitting, but in such cases where the function is
 called with isolated CRs and LFs they are considered as header field
 name or header field value data.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_split(b&quot;\r&quot;)
 ([[b'\r', b'']], None)
 &gt;&gt;&gt; rfc6376_split(b&quot;\n&quot;)
 ([[b'\n', b'']], None)
 &gt;&gt;&gt; rfc6376_split(b&quot;\n\r\n&quot;)
 ([[b'\n', b'\r\n']], None)
 &gt;&gt;&gt; rfc6376_split(b&quot;\r\r\n&quot;)
 ([[b'\r', b'\r\n']], None)
 &gt;&gt;&gt; rfc6376_split(b&quot;\r...\r\n&quot;)
 ([[b'\r...', b'\r\n']], None)
 &gt;&gt;&gt; rfc6376_split(b&quot;\n...\r\n&quot;)
 ([[b'\n...', b'\r\n']], None)
 &gt;&gt;&gt; rfc6376_split(b&quot;\n:\n\r\n&quot;)
 ([[b'\n', b'\n\r\n']], None)
 &gt;&gt;&gt; rfc6376_split(b&quot;\n...:\n...\r\n&quot;)
 ([[b'\n...', b'\n...\r\n']], None)
 </pre>
 <p>A header field name without any header field value is just regarded as
 being the same as one with an empty value.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_split(b&quot;Key\r\n\r\n&quot;)
 ([[b'Key', b'\r\n']], b'')
 &gt;&gt;&gt; rfc6376_split(b&quot;Key:\r\n\r\n&quot;)
 ([[b'Key', b'\r\n']], b'')
 </pre>
 <p>For greater consistency with how bodies are handled, the former could
 have been interpreted as <tt class="docutils literal">[b'Key', None]</tt>, but this would increase
 the complexity of the code, and lead to the question of where the
 trailing CRLF ought to be stored.</p>
 <p>In some cases, one of the mbox formats may accidentally be passed to
 <tt class="docutils literal">rfc6376_split</tt>, containing a line like this in its headers, usually
 at the start but potentially later in the headers too:</p>
 <blockquote>
 &quot;From MAILER-DAEMON Fri Jul  8 12:08:34 2011&quot;</blockquote>
 <p>Which would be interpreted as a header field whose name is:</p>
 <blockquote>
 &quot;From MAILER-DAEMON Fri Jul  8 12&quot;</blockquote>
 <p>And which could also collect any following continuation line.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_split(b&quot;To:You\r\nFrom Me\r\n More\r\n&quot;)
 ([[b'To', b'You\r\n'], [b'From Me', b'\r\n More\r\n']], None)
 </pre>
 <p>This is safe because even after canonicalisation it is not possible to
 confuse a <tt class="docutils literal">&quot;From &quot;</tt> line with a <tt class="docutils literal">&quot;From:&quot;</tt> header field, unless no
 text follows the <tt class="docutils literal">&quot;From &quot;</tt> and it is followed by a continuation. If
 no text follows the <tt class="docutils literal">&quot;From &quot;</tt> then it is not in one of the mbox
 formats anyway. And if it is followed by a continuation, then
 interpreting it as a From header field is reasonable.</p>
 <p>Similarly to a name without a value, a continuation value without a
 preceding line is treated as though the header field name is empty.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_split(b&quot; More\r\n&quot;)
 ([[b'', b' More\r\n']], None)
 </pre>
 <p>An alternative to this would be to treat the line itself as a header
 field name, but then that creates the issue of whether to remove the
 leading whitespace, and whether to parse a colon in it. It would also
 make it inconsistent with all other field names, which must not start
 with a space.</p>
 <p>The type of the body, the second element of the tuple returned from
 <tt class="docutils literal">rfc6376_split</tt>, directly correlates to whether the input starts
 with CRLF or whether CRLF CRLF occurs in the input. If it does so,
 then we say that the input message contains a header and body
 boundary.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; def contains_boundary(data: bytes) -&gt; bool:
 ...     return data.startswith(b&quot;\r\n&quot;) or (b&quot;\r\n\r\n&quot; in data)
 </pre>
 <p>We use a simple subset of all possible inputs to check this
 correlation.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; text_message = (text(alphabet=&quot;\x00\t\r\n\f .:ABCabc\xc0&quot;)
 ...     .map(cp1252))
 </pre>
 <p>Although <tt class="docutils literal">rfc6376_split</tt> should always take input in RFC 5322 ending
 normal form, we test without that normal form.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, text_message)
 ... def body_type_correlation(data) -&gt; None:
 ...     headers, body = rfc6376_split(data)
 ...     body_not_none = (body is not None)
 ...     assert contains_boundary(data) is body_not_none, repr(data)
 </pre>
 <p>If the input is not RFC 5322 normalised, then CR and LF can appear in
 header field names, as already demonstrated. Colon, however, should
 never appear in a header field name.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, text_message)
 ... def no_split_colon(data) -&gt; None:
 ...     headers, body = rfc6376_split(data)
 ...     for (k, v) in headers:
 ...         assert b&quot;:&quot; not in k, repr(data)
 </pre>
 <p>And if the input is RFC 5322 normalised, then colon, CR, and LF should
 never appear in header field names.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, text_message)
 ... def no_normal_split_chars(data) -&gt; None:
 ...     data = rfc5322_endings(data)
 ...     headers, body = rfc6376_split(data)
 ...     for (k, v) in headers:
 ...         assert b&quot;:&quot; not in k, repr(data)
 ...         assert b&quot;\r&quot; not in k, repr(data)
 ...         assert b&quot;\n&quot; not in k, repr(data)
 </pre>
 </div>
 <div class="section" id="canonicalised-splitting">
 <h2>Canonicalised splitting</h2>
 <p>The version of the main parser which performs canonicalisation is
 called <tt class="docutils literal">rfc6376_split_canon</tt>.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; from dkim_id import rfc6376_split_canon
 </pre>
 <p>It takes <tt class="docutils literal">head_subset</tt>, <tt class="docutils literal">head_canon</tt>, and <tt class="docutils literal">body_canon</tt>
 arguments. The first is a set of bytes, lower case header field names
 to keep when parsing the headers. If <tt class="docutils literal">head_subset</tt> is None, all
 headers are retained, which is useful for testing. The second is a
 boolean of whether to apply <tt class="docutils literal">rfc6376_relaxed_head</tt>, and the third is
 a boolean of whether to apply <tt class="docutils literal">rfc6376_simple_body</tt> and potentially
 modify the headers too for consistency.</p>
 <p>If there was no body, i.e. no header body boundary CRLF in the
 message, then the returned body should be <tt class="docutils literal">None</tt> rather than
 <tt class="docutils literal">b&quot;&quot;</tt>.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, text_message)
 ... def body_none(message) -&gt; None:
 ...     boundary = contains_boundary(rfc5322_endings(message))
 ...     headers, body = rfc6376_split_canon(message)
 ...     assert boundary is (body is not None), repr(message)
 </pre>
 <p>We can perform the canonicalisation steps ourselves. We need to import
 <tt class="docutils literal">rfc6376_simple_holistic</tt>, which ensures that headers are augmented
 with CRLF if necessary when there is either no body or an empty body
 but body canonicalisation synthesizes one.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; from dkim_id import rfc6376_simple_holistic
 </pre>
 <p>And then DKIM relaxed/simple can be applied consistently.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, text_message)
 ... def manual_canon(message) -&gt; None:
 ...     # uc = uncanonicalised, ec = expected canon, ac = actual canon
 ...     headers_uc, body_uc = rfc6376_split_canon(message)
 ...     headers_ec, body_ec = rfc6376_split_canon(message,
 ...         head_canon=True, body_canon=True)
 ...     headers_ac = rfc6376_relaxed_head(headers_uc)
 ...     headers_ac, body_ac = rfc6376_simple_holistic(headers_ac, body_uc)
 ...     assert headers_ac == headers_ec, repr(message)
 ...     assert body_ac == body_ec, repr(message)
 </pre>
 <p>The header and body canonicalisation steps are optional. Even when
 retaining all headers (which is the default) and performing neither
 kind of canonicalisation (which is also the default), the input
 message is not necessarily the same as the output message, whether RFC
 5322 normalisation were performed or not. This is because, for
 example, the construction of broken headers, i.e. those without
 colons, is fixed in the process.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_split_canon(b&quot;Key&quot;)
 ([[b'Key', b'']], None)
 &gt;&gt;&gt; rfc6376_join(*rfc6376_split_canon(b&quot;Key&quot;))
 b'Key:'
 </pre>
 </div>
 <div class="section" id="reformation">
 <h2>Reformation</h2>
 <p>We call the process of splitting and then joining &quot;reforming&quot;. There
 is a function called <tt class="docutils literal">rfc6376_reformed</tt> that performs this.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; from dkim_id import rfc6376_reformed
 </pre>
 <p>Then <tt class="docutils literal">rfc6376_reformed</tt> should be exactly equivalent to using
 <tt class="docutils literal">rfc6376_split</tt> and then <tt class="docutils literal">rfc6376_join</tt>.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, text_message)
 ... def normal(message) -&gt; None:
 ...     a = rfc6376_join(*rfc6376_split(message))
 ...     b = rfc6376_reformed(message)
 ...     assert a == b, repr(message)
 </pre>
 </div>
 <div class="section" id="canonicalised-reformation">
 <h2>Canonicalised reformation</h2>
 <p>We can use <tt class="docutils literal">rfc6376_reformed_canon</tt> to canonicalise a message whilst
 reforming it. This function accepts an additional <tt class="docutils literal">lid</tt> parameter to
 specify a list ID, in the RFC 2919 sense, and returns a list ID and
 the canonicalised message. The output list ID will be an empty bytes
 object if the input list ID was in any <tt class="docutils literal"><span class="pre">List-Id</span></tt> header in the input
 message.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; from dkim_id import rfc6376_reformed_canon
 </pre>
 <p>Then if we make our own headers, canonicalise them, and then join
 them, we should always get a canonicalised message.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, headers)
 ... def more_manual_canon(headers) -&gt; None:
 ...     headers_c = rfc6376_relaxed_head(headers)
 ...     message_c = rfc6376_join(headers_c)
 ...     assert message_c == rfc6376_reformed_canon(message_c,
 ...         head_canon=True, body_canon=False)[1], repr(message_c)
 </pre>
 </div>
 <div class="section" id="rascals">
 <h2>Rascals</h2>
 <p>DKIM-ID generation uses the standard <tt class="docutils literal">rfc6376_reformed_canon</tt> call
 with <tt class="docutils literal">rfc4871_subset</tt> headers and both head and body
 canonicalised. We refer to this combination as <em>reformed and
 relaxed/simple canonicalisation</em>, or just &quot;rascal&quot; for short. The
 function that performs this is called <tt class="docutils literal">rfc6376_rascal</tt>. Like
 <tt class="docutils literal">rfc6376_reformed_canon</tt>, this function accepts an additional
 <tt class="docutils literal">lid</tt> parameter to specify a list ID, in the RFC 2919 sense, and
 returns a list ID and the canonicalised message.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; from dkim_id import rfc6376_rascal
 </pre>
 <p>A missing or empty body is encoded, per RFC 6376 simple body
 canonicalisation, as CRLF. We always perform body canonicalisation if
 <tt class="docutils literal">body_canon</tt> is <tt class="docutils literal">True</tt>, which means that even if there is no body
 (i.e. there was no header and body boundary in the original) there
 will always be body canonicalisation, which means that the body will
 always be non-empty, and will always be appended by <tt class="docutils literal">rfc6376_join</tt>
 after the header and body separator CRLF. This means that there will
 always be a header and body boundary in the rascal output.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, text_message)
 ... def rascal_contains_boundary(data) -&gt; None:
 ...     rascal = rfc6376_rascal(data)[1]
 ...     assert contains_boundary(rascal), repr(data)
 </pre>
 <p>In particular, it means that the empty input document will become CRLF
 CRLF, which is the header and body separator CRLF followed by the
 canonicalised empty body CRLF. Two CRLFs, but with completely
 different roles.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_rascal(b&quot;&quot;)
 (b'', b'\r\n\r\n')
 </pre>
 <p>And, because trailing CRs or LFs are RFC 5322 ending normalised and
 then canonicalised to a single CRLF, it means that any sequence of CRs
 or LFs will be rascaled to CRLF CRLF too.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, text(alphabet=&quot;\r\n&quot;).map(utf8))
 ... def normal_crlfs_to_crlf2(data) -&gt; None:
 ...     rascal = rfc6376_rascal(data)[1]
 ...     assert rascal == b&quot;\r\n\r\n&quot;, repr(data)
 </pre>
 <p>Since the input is considered to be a message, arbitrary text without
 metacharacters will usually be regarded as a discardable header field.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_rascal(b&quot;Text&quot;)
 (b'', b'\r\n\r\n')
 </pre>
 <p>This is true even when colon is included, as long as the prefix is not
 one of the standard header field names in <tt class="docutils literal">rfc4871_subset</tt>.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_rascal(b&quot;Discarded: Value&quot;)
 (b'', b'\r\n\r\n')
 </pre>
 <p>But if the header is in the subset, it will indeed be retained. In
 this case, holistic canonicalisation ensures that CRLF is appended to
 the header too.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_rascal(b&quot;To: Recipient&quot;)
 (b'', b'to:Recipient\r\n\r\n\r\n')
 </pre>
 <p>In other words this is a header field <tt class="docutils literal">b'to:Recipient\r\n'</tt>,
 followed by a CRLF header and body boundary, followed by the CRLF of
 the canonicalised missing body.</p>
 <p>If there is no header value for a subset header, then it is treated as
 if the header value were empty.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_rascal(b&quot;To&quot;)
 (b'', b'to:\r\n\r\n\r\n')
 &gt;&gt;&gt; rfc6376_rascal(b&quot;To:&quot;)
 (b'', b'to:\r\n\r\n\r\n')
 </pre>
 <p>RFC 6376 says that canonicalisation should, obviously, come before
 signing.</p>
 <blockquote>
 <p>Canonicalization simply prepares the email for presentation to the
 signing or verification algorithm.</p>
 <p><a class="reference external" href="https://tools.ietf.org/html/rfc6376#section-3.4">https://tools.ietf.org/html/rfc6376#section-3.4</a></p>
 </blockquote>
 <p>But a more subtle consequence of this is that subsetting headers also
 comes after canonicalisation, because subsetting is not part of
 canonicalisation - it's part of signing.</p>
 <p>This is important in our expansion of the RFC 6376 algorithm to cover
 all inputs because e.g. it means that header field names with trailing
 whitespace are treated the same as without that whitespace.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_rascal(b&quot;To   \n&quot;)
 (b'', b'to:\r\n\r\n\r\n')
 </pre>
 <p>But a header name with whitespace inside it is not, unlike in the
 libopendkim algorithm, treated the same as one without whitespace
 inside it, for reasons already discussed in the documentation of RFC
 6376 header canonicalisation step three.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; rfc6376_rascal(b&quot;T o\n&quot;)
 (b'', b'\r\n\r\n')
 </pre>
 </div>
 <div class="section" id="header-subsetting">
 <h2>Header subsetting</h2>
 <p>We use a subset of headers specified in RFC 4871. We use RFC 4871 even
 though it was obsoleted by RFC 6376 because the earlier RFC has a more
 extensive list of headers, and the later RFC says anyway that the
 choice of which headers to include is a matter of choice dependent on
 the signing environment. Since DKIM-ID generation does not even
 include signing, our requirements are somewhat different anyway.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; from dkim_id import rfc4871_subset
 </pre>
 <p>Whenever the <tt class="docutils literal">rfc4871_subset</tt> headers are specified as the subset to
 be retained, they should indeed be retained in the output of
 <tt class="docutils literal">rfc6376_rascal</tt>.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; for k in rfc4871_subset:
 ...     minimal = k + b&quot;:\r\n\r\n\r\n&quot;
 ...     assert minimal == rfc6376_rascal(minimal)[1], repr(minimal)
 </pre>
 <p>Though the subset is loosely called the &quot;RFC 4871 subset&quot;, there is
 one header in <tt class="docutils literal">rfc4871_subset</tt> which RFC 4871 doesn't recommend:
 DKIM-Signature itself.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; b&quot;dkim-signature&quot; in rfc4871_subset
 True
 </pre>
 <p>We include the DKIM-Signature header field in the subset of retained
 headers because then if the sender has signed their message it ought
 to be reflected in the identifier for that message. It would not have
 made sense for RFC 4817 to recommend that header field for signing
 input, because it is itself the signing output! But if, for example,
 there were an widely implemented RFC specifying a precursor to DKIM
 which was later superseded by DKIM, it is reasonable to assume that
 RFC 4817 would have recommended including the output of the precursor
 in the headers to sign, combining the two approaches. Similarly, since
 DKIM is a precursor to DKIM-ID, DKIM-ID is able to include its output
 as an input.</p>
 </div>
 <div class="section" id="custom-base32-encoding">
 <h2>Custom base32 encoding</h2>
 <p>When we have a canonicalised message with subsetted headers, we take
 the SHA-256 HMAC digest of that message and then encode a truncated
 version of it using pibble32, which is base32 with the alphabet <tt class="docutils literal"><span class="pre">0-9</span>
 <span class="pre">b-d</span> <span class="pre">f-h</span> <span class="pre">j-t</span> <span class="pre">v-z</span></tt>, and remove the padding.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; from dkim_id import pibble32
 </pre>
 <p>The alphabet used means that the pibble32 output is always lowercase,
 and never contains the letters a, e, i, or u.</p>
 <p>We need the binary producer from hypothesis.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; from hypothesis.strategies import binary
 </pre>
 <p>And then we can test these general properties.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, binary())
 ... def pibble32_general(data) -&gt; None:
 ...     encoded = pibble32(data)
 ...     assert encoded == encoded.lower(), repr(data)
 ...     encoded_set = set(encoded)
 ...     assert not (encoded_set &amp; {&quot;a&quot;, &quot;e&quot;, &quot;i&quot;, &quot;u&quot;}), repr(data)
 </pre>
 <p>There may be padding, but only when the data length is not divisible
 by five.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, binary())
 ... def pibble32_padding(data) -&gt; None:
 ...     encoded = pibble32(data)
 ...     no_padding = not encoded.endswith(&quot;=&quot;)
 ...     divisible_by_five = not (len(data) % 5)
 ...     assert no_padding is divisible_by_five, repr(data)
 </pre>
 <p>We strip the padding on the DKIM-ID since it is fixed at a width of
 128 bits, and the pibble32 output is byte aligned anyway, i.e. the
 decoder accepts no other padding than &quot;======&quot;.</p>
 <p>The length of the pibble32 output will always be the same as when
 base32 encoding it.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, binary())
 ... def pibble32_length(data) -&gt; None:
 ...     from base64 import b32encode
 ...     assert len(pibble32(data)) == len(b32encode(data)), repr(data)
 </pre>
 <p>Here are a some specific examples:</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; pibble32(b&quot;&quot;)
 ''
 &gt;&gt;&gt; pibble32(b&quot;\x00&quot;)
 '00======'
 &gt;&gt;&gt; pibble32(b&quot;\x01&quot;)
 '04======'
 &gt;&gt;&gt; pibble32(b&quot;\x02&quot;)
 '08======'
 &gt;&gt;&gt; pibble32(b&quot;\xff&quot;)
 'zw======'
 &gt;&gt;&gt; pibble32(b&quot;\x00\x00\x00\x00\x00&quot;)
 '00000000'
 &gt;&gt;&gt; pibble32(b&quot;\x00\x00\x01\x00\x00&quot;)
 '00002000'
 &gt;&gt;&gt; pibble32(b&quot;\x00\x00\x02\x00\x00&quot;)
 '00004000'
 &gt;&gt;&gt; pibble32(b&quot;\x00\x00\xff\x00\x00&quot;)
 '000hy000'
 &gt;&gt;&gt; pibble32(b&quot;\x00\x00\xff\xff\x00&quot;)
 '000hzzr0'
 &gt;&gt;&gt; pibble32(b&quot;\xff\xff\xff\xff\xff&quot;)
 'zzzzzzzz'
 </pre>
 <p>When the input length is divisible by five, the output length is
 always 8 / 5 of that length.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, binary())
 ... def pibble32_eight_fifths(data) -&gt; None:
 ...     size = len(data)
 ...     resized = size - (size % 5)
 ...     fives = data[:resized]
 ...     assert len(pibble32(fives)) == (resized * 8 / 5), repr(data)
 </pre>
 <p>And when it's not divisible by five, the length is rounded up to the
 next number divisible by five.</p>
 <p>This means that 160 bits of input is multiplied by 8 / 5, which gives
 256 bits, or 32 bytes, of output.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; 160 * 8 // 5
 256
 &gt;&gt;&gt; 256 // 8
 32
 </pre>
 </div>
 <div class="section" id="dkim-id-generation">
 <h2>DKIM-ID generation</h2>
 <p>Once the rascaled version of the message is obtained, it it hashed and
 then pibble32 encoded to form the DKIM-ID. We want to check that the
 output is pibble32 encoded, at least in that its length is correct and
 its alphabet is a subset of what is expected.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; digit = &quot;0123456789&quot;
 &gt;&gt;&gt; lower = &quot;abcdefghijklmnopqrstuvwxyz&quot;
 &gt;&gt;&gt; pibble32_alphabet = (set(digit) | set(lower)) - {&quot;a&quot;, &quot;e&quot;, &quot;i&quot;, &quot;u&quot;}
 </pre>
 <p>We guard against typos in the alphabet by testing expected properties,
 first by checking the digits.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; assert len(digit) == 10
 &gt;&gt;&gt; assert len(set(digit)) == 10
 &gt;&gt;&gt; assert list(digit) == sorted(list(digit))
 &gt;&gt;&gt; assert digit.isdigit()
 </pre>
 <p>Then the lowercase letters.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; assert len(lower) == 26
 &gt;&gt;&gt; assert len(set(lower)) == 26
 &gt;&gt;&gt; assert list(lower) == sorted(list(lower))
 &gt;&gt;&gt; assert lower.isalpha()
 </pre>
 <p>And then the whole alphabet.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; assert len(pibble32_alphabet) == 32
 </pre>
 <p>Now we can test the DKIM-ID output, from function <tt class="docutils literal">dkim_id</tt>.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; from dkim_id import dkim_id
 </pre>
 <p>By checking that its output is consistent with the pibble32 encoding.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; &#64;thesis(given, text_message)
 ... def consistent_output(data) -&gt; None:
 ...     dkimid: str = dkim_id(data)
 ...     assert len(dkimid) == 32, repr(data)
 ...     assert not (set(dkimid) - pibble32_alphabet), repr(data)
 </pre>
 <p>We can also check that the unpibbled output is the same as the
 SHA-256 HMAC of the rascal.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; from dkim_id import unpibble32
 &gt;&gt;&gt; from hmac import digest as hmac_digest
 &gt;&gt;&gt; &#64;thesis(given, text_message)
 ... def check_hash_digest(data) -&gt; None:
 ...     rascal: bytes = rfc6376_rascal(data)[1]
 ...     digest_e: bytes = hmac_digest(b&quot;&quot;, rascal, &quot;sha256&quot;)[:160 // 8]
 ...     dkimid: str = dkim_id(data)
 ...     digest_a: bytes = unpibble32(dkimid)
 ...     assert digest_a == digest_e, repr(data)
 </pre>
 <p>And here are some example outputs for some simple messages.</p>
 <pre class="doctest-block">
 &gt;&gt;&gt; dkim_id(b&quot;&quot;)
 '8fgp2do75oqo6qd08vs4p7dpp1gj4vjn'
 &gt;&gt;&gt; dkim_id(b&quot;To: You&quot;)
 'wowc4vvd0ftwm0q24106mldg67komfl0'
 &gt;&gt;&gt; dkim_id(b&quot;To: You\r\n&quot;)
 'wowc4vvd0ftwm0q24106mldg67komfl0'
 &gt;&gt;&gt; dkim_id(b&quot;To: You\r\nFrom: Me&quot;)
 'kf7f6zxt7w7k1h1lhxmg9mxngkl5vbcm'
 &gt;&gt;&gt; dkim_id(b&quot;To: You\r\nFrom: Me\r\n\r\nBody&quot;)
 'xx5nf02ptvv92tt73kg7n7o9o5t4ngvd'
 &gt;&gt;&gt; dkim_id(b&quot;To: You\r\nFrom: Me\r\n\r\nBody\r\n&quot;)
 'b752nf3njqs9r5qwmrkh3n2s24y7y33g'
 </pre>
 </div>
 </div>
 </div>
 </body>
 </html>