blob: eae61584fe678f3b399010f41102950bb8cbdfa0 [file] [log] [blame]
<!DOCTYPE html><html lang="en"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><meta name="generator" content="rustdoc"><meta name="description" content="Converts ranges of Unicode scalar values to equivalent ranges of UTF-8 bytes."><meta name="keywords" content="rust, rustlang, rust-lang, utf8"><title>regex_syntax::utf8 - Rust</title><link rel="preload" as="font" type="font/woff2" crossorigin href="../../SourceSerif4-Regular.ttf.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin href="../../FiraSans-Regular.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin href="../../FiraSans-Medium.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin href="../../SourceCodePro-Regular.ttf.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin href="../../SourceSerif4-Bold.ttf.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin href="../../SourceCodePro-Semibold.ttf.woff2"><link rel="stylesheet" href="../../normalize.css"><link rel="stylesheet" href="../../rustdoc.css" id="mainThemeStyle"><link rel="stylesheet" href="../../ayu.css" disabled><link rel="stylesheet" href="../../dark.css" disabled><link rel="stylesheet" href="../../light.css" id="themeStyle"><script id="default-settings" ></script><script src="../../storage.js"></script><script defer src="../../main.js"></script><noscript><link rel="stylesheet" href="../../noscript.css"></noscript><link rel="alternate icon" type="image/png" href="../../favicon-16x16.png"><link rel="alternate icon" type="image/png" href="../../favicon-32x32.png"><link rel="icon" type="image/svg+xml" href="../../favicon.svg"></head><body class="rustdoc mod"><!--[if lte IE 11]><div class="warning">This old browser is unsupported and will most likely display funky things.</div><![endif]--><nav class="mobile-topbar"><button class="sidebar-menu-toggle">&#9776;</button><a class="sidebar-logo" href="../../regex_syntax/index.html"><div class="logo-container"><img class="rust-logo" src="../../rust-logo.svg" alt="logo"></div></a><h2></h2></nav><nav class="sidebar"><a class="sidebar-logo" href="../../regex_syntax/index.html"><div class="logo-container"><img class="rust-logo" src="../../rust-logo.svg" alt="logo"></div></a><h2 class="location"><a href="#">Module utf8</a></h2><div class="sidebar-elems"><section><ul class="block"><li><a href="#structs">Structs</a></li><li><a href="#enums">Enums</a></li></ul></section></div></nav><main><div class="width-limiter"><nav class="sub"><form class="search-form"><div class="search-container"><span></span><input class="search-input" name="search" autocomplete="off" spellcheck="false" placeholder="Click or press ‘S’ to search, ‘?’ for more options…" type="search"><div id="help-button" title="help" tabindex="-1"><a href="../../help.html">?</a></div><div id="settings-menu" tabindex="-1"><a href="../../settings.html" title="settings"><img width="22" height="22" alt="Change settings" src="../../wheel.svg"></a></div></div></form></nav><section id="main-content" class="content"><div class="main-heading"><h1 class="fqn">Module <a href="../index.html">regex_syntax</a>::<wbr><a class="mod" href="#">utf8</a><button id="copy-path" onclick="copy_path(this)" title="Copy item path to clipboard"><img src="../../clipboard.svg" width="19" height="18" alt="Copy item path"></button></h1><span class="out-of-band"><a class="srclink" href="../../src/regex_syntax/utf8.rs.html#1-592">source</a> · <a id="toggle-all-docs" href="javascript:void(0)" title="collapse all docs">[<span class="inner">&#x2212;</span>]</a></span></div><details class="rustdoc-toggle top-doc" open><summary class="hideme"><span>Expand description</span></summary><div class="docblock"><p>Converts ranges of Unicode scalar values to equivalent ranges of UTF-8 bytes.</p>
<p>This is sub-module is useful for constructing byte based automatons that need
to embed UTF-8 decoding. The most common use of this module is in conjunction
with the <a href="../hir/struct.ClassUnicodeRange.html"><code>hir::ClassUnicodeRange</code></a> type.</p>
<p>See the documentation on the <code>Utf8Sequences</code> iterator for more details and
an example.</p>
<h2 id="wait-what-is-this"><a href="#wait-what-is-this">Wait, what is this?</a></h2>
<p>This is simplest to explain with an example. Let’s say you wanted to test
whether a particular byte sequence was a Cyrillic character. One possible
scalar value range is <code>[0400-04FF]</code>. The set of allowed bytes for this
range can be expressed as a sequence of byte ranges:</p>
<div class="example-wrap"><pre class="language-text"><code>[D0-D3][80-BF]</code></pre></div>
<p>This is simple enough: simply encode the boundaries, <code>0400</code> encodes to
<code>D0 80</code> and <code>04FF</code> encodes to <code>D3 BF</code>, and create ranges from each
corresponding pair of bytes: <code>D0</code> to <code>D3</code> and <code>80</code> to <code>BF</code>.</p>
<p>However, what if you wanted to add the Cyrillic Supplementary characters to
your range? Your range might then become <code>[0400-052F]</code>. The same procedure
as above doesn’t quite work because <code>052F</code> encodes to <code>D4 AF</code>. The byte ranges
you’d get from the previous transformation would be <code>[D0-D4][80-AF]</code>. However,
this isn’t quite correct because this range doesn’t capture many characters,
for example, <code>04FF</code> (because its last byte, <code>BF</code> isn’t in the range <code>80-AF</code>).</p>
<p>Instead, you need multiple sequences of byte ranges:</p>
<div class="example-wrap"><pre class="language-text"><code>[D0-D3][80-BF] # matches codepoints 0400-04FF
[D4][80-AF] # matches codepoints 0500-052F</code></pre></div>
<p>This gets even more complicated if you want bigger ranges, particularly if
they naively contain surrogate codepoints. For example, the sequence of byte
ranges for the basic multilingual plane (<code>[0000-FFFF]</code>) look like this:</p>
<div class="example-wrap"><pre class="language-text"><code>[0-7F]
[C2-DF][80-BF]
[E0][A0-BF][80-BF]
[E1-EC][80-BF][80-BF]
[ED][80-9F][80-BF]
[EE-EF][80-BF][80-BF]</code></pre></div>
<p>Note that the byte ranges above will <em>not</em> match any erroneous encoding of
UTF-8, including encodings of surrogate codepoints.</p>
<p>And, of course, for all of Unicode (<code>[000000-10FFFF]</code>):</p>
<div class="example-wrap"><pre class="language-text"><code>[0-7F]
[C2-DF][80-BF]
[E0][A0-BF][80-BF]
[E1-EC][80-BF][80-BF]
[ED][80-9F][80-BF]
[EE-EF][80-BF][80-BF]
[F0][90-BF][80-BF][80-BF]
[F1-F3][80-BF][80-BF][80-BF]
[F4][80-8F][80-BF][80-BF]</code></pre></div>
<p>This module automates the process of creating these byte ranges from ranges of
Unicode scalar values.</p>
<h2 id="lineage"><a href="#lineage">Lineage</a></h2>
<p>I got the idea and general implementation strategy from Russ Cox in his
<a href="https://web.archive.org/web/20160404141123/https://swtch.com/~rsc/regexp/regexp3.html">article on regexps</a> and RE2.
Russ Cox got it from Ken Thompson’s <code>grep</code> (no source, folk lore?).
I also got the idea from
<a href="https://github.com/apache/lucene-solr/blob/ae93f4e7ac6a3908046391de35d4f50a0d3c59ca/lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java">Lucene</a>,
which uses it for executing automata on their term index.</p>
</div></details><h2 id="structs" class="small-section-header"><a href="#structs">Structs</a></h2><div class="item-table"><div class="item-row"><div class="item-left module-item"><a class="struct" href="struct.Utf8Range.html" title="regex_syntax::utf8::Utf8Range struct">Utf8Range</a></div><div class="item-right docblock-short">A single inclusive range of UTF-8 bytes.</div></div><div class="item-row"><div class="item-left module-item"><a class="struct" href="struct.Utf8Sequences.html" title="regex_syntax::utf8::Utf8Sequences struct">Utf8Sequences</a></div><div class="item-right docblock-short">An iterator over ranges of matching UTF-8 byte sequences.</div></div></div><h2 id="enums" class="small-section-header"><a href="#enums">Enums</a></h2><div class="item-table"><div class="item-row"><div class="item-left module-item"><a class="enum" href="enum.Utf8Sequence.html" title="regex_syntax::utf8::Utf8Sequence enum">Utf8Sequence</a></div><div class="item-right docblock-short">Utf8Sequence represents a sequence of byte ranges.</div></div></div></section></div></main><div id="rustdoc-vars" data-root-path="../../" data-current-crate="regex_syntax" data-themes="ayu,dark,light" data-resource-suffix="" data-rustdoc-version="1.66.0-nightly (5c8bff74b 2022-10-21)" ></div></body></html>