| <!DOCTYPE html><html lang="en"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><meta name="generator" content="rustdoc"><meta name="description" content="github crates-io docs-rs"><meta name="keywords" content="rust, rustlang, rust-lang, unicode_ident"><title>unicode_ident - Rust</title><link rel="preload" as="font" type="font/woff2" crossorigin href="../SourceSerif4-Regular.ttf.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin href="../FiraSans-Regular.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin href="../FiraSans-Medium.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin href="../SourceCodePro-Regular.ttf.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin href="../SourceSerif4-Bold.ttf.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin href="../SourceCodePro-Semibold.ttf.woff2"><link rel="stylesheet" href="../normalize.css"><link rel="stylesheet" href="../rustdoc.css" id="mainThemeStyle"><link rel="stylesheet" href="../ayu.css" disabled><link rel="stylesheet" href="../dark.css" disabled><link rel="stylesheet" href="../light.css" id="themeStyle"><script id="default-settings" ></script><script src="../storage.js"></script><script defer src="../crates.js"></script><script defer src="../main.js"></script><noscript><link rel="stylesheet" href="../noscript.css"></noscript><link rel="alternate icon" type="image/png" href="../favicon-16x16.png"><link rel="alternate icon" type="image/png" href="../favicon-32x32.png"><link rel="icon" type="image/svg+xml" href="../favicon.svg"></head><body class="rustdoc mod crate"><!--[if lte IE 11]><div class="warning">This old browser is unsupported and will most likely display funky things.</div><![endif]--><nav class="mobile-topbar"><button class="sidebar-menu-toggle">☰</button><a class="sidebar-logo" href="../unicode_ident/index.html"><div class="logo-container"><img class="rust-logo" src="../rust-logo.svg" alt="logo"></div></a><h2></h2></nav><nav class="sidebar"><a class="sidebar-logo" href="../unicode_ident/index.html"><div class="logo-container"><img class="rust-logo" src="../rust-logo.svg" alt="logo"></div></a><h2 class="location"><a href="#">Crate unicode_ident</a></h2><div class="sidebar-elems"><ul class="block"><li class="version">Version 1.0.9</li><li><a id="all-types" href="all.html">All Items</a></li></ul><section><ul class="block"><li><a href="#functions">Functions</a></li></ul></section></div></nav><main><div class="width-limiter"><nav class="sub"><form class="search-form"><div class="search-container"><span></span><input class="search-input" name="search" autocomplete="off" spellcheck="false" placeholder="Click or press ‘S’ to search, ‘?’ for more options…" type="search"><div id="help-button" title="help" tabindex="-1"><a href="../help.html">?</a></div><div id="settings-menu" tabindex="-1"><a href="../settings.html" title="settings"><img width="22" height="22" alt="Change settings" src="../wheel.svg"></a></div></div></form></nav><section id="main-content" class="content"><div class="main-heading"><h1 class="fqn">Crate <a class="mod" href="#">unicode_ident</a><button id="copy-path" onclick="copy_path(this)" title="Copy item path to clipboard"><img src="../clipboard.svg" width="19" height="18" alt="Copy item path"></button></h1><span class="out-of-band"><a class="srclink" href="../src/unicode_ident/lib.rs.html#1-269">source</a> · <a id="toggle-all-docs" href="javascript:void(0)" title="collapse all docs">[<span class="inner">−</span>]</a></span></div><details class="rustdoc-toggle top-doc" open><summary class="hideme"><span>Expand description</span></summary><div class="docblock"><p><a href="https://github.com/dtolnay/unicode-ident"><img src="https://img.shields.io/badge/github-8da0cb?style=for-the-badge&labelColor=555555&logo=github" alt="github" /></a> <a href="https://crates.io/crates/unicode-ident"><img src="https://img.shields.io/badge/crates.io-fc8d62?style=for-the-badge&labelColor=555555&logo=rust" alt="crates-io" /></a> <a href="https://docs.rs/unicode-ident"><img src="https://img.shields.io/badge/docs.rs-66c2a5?style=for-the-badge&labelColor=555555&logo=docs.rs" alt="docs-rs" /></a></p> |
| <br> |
| <p>Implementation of <a href="https://www.unicode.org/reports/tr31/">Unicode Standard Annex #31</a> for determining which |
| <code>char</code> values are valid in programming language identifiers.</p> |
| <p>This crate is a better optimized implementation of the older <code>unicode-xid</code> |
| crate. This crate uses less static storage, and is able to classify both |
| ASCII and non-ASCII codepoints with better performance, 2–10× |
| faster than <code>unicode-xid</code>.</p> |
| <br> |
| <h3 id="comparison-of-performance"><a href="#comparison-of-performance">Comparison of performance</a></h3> |
| <p>The following table shows a comparison between five Unicode identifier |
| implementations.</p> |
| <ul> |
| <li><code>unicode-ident</code> is this crate;</li> |
| <li><a href="https://github.com/unicode-rs/unicode-xid"><code>unicode-xid</code></a> is a widely used crate run by the “unicode-rs” org;</li> |
| <li><code>ucd-trie</code> and <code>fst</code> are two data structures supported by the |
| <a href="https://github.com/BurntSushi/ucd-generate"><code>ucd-generate</code></a> tool;</li> |
| <li><a href="https://github.com/RoaringBitmap/roaring-rs"><code>roaring</code></a> is a Rust implementation of Roaring bitmap.</li> |
| </ul> |
| <p>The <em>static storage</em> column shows the total size of <code>static</code> tables that the |
| crate bakes into your binary, measured in 1000s of bytes.</p> |
| <p>The remaining columns show the <strong>cost per call</strong> to evaluate whether a |
| single <code>char</code> has the XID_Start or XID_Continue Unicode property, |
| comparing across different ratios of ASCII to non-ASCII codepoints in the |
| input data.</p> |
| <div><table><thead><tr><th></th><th>static storage</th><th>0% nonascii</th><th>1%</th><th>10%</th><th>100% nonascii</th></tr></thead><tbody> |
| <tr><td><strong><code>unicode-ident</code></strong></td><td>9.75 K</td><td>0.96 ns</td><td>0.95 ns</td><td>1.09 ns</td><td>1.55 ns</td></tr> |
| <tr><td><strong><code>unicode-xid</code></strong></td><td>11.34 K</td><td>1.88 ns</td><td>2.14 ns</td><td>3.48 ns</td><td>15.63 ns</td></tr> |
| <tr><td><strong><code>ucd-trie</code></strong></td><td>9.95 K</td><td>1.29 ns</td><td>1.28 ns</td><td>1.36 ns</td><td>2.15 ns</td></tr> |
| <tr><td><strong><code>fst</code></strong></td><td>133 K</td><td>55.1 ns</td><td>54.9 ns</td><td>53.2 ns</td><td>28.5 ns</td></tr> |
| <tr><td><strong><code>roaring</code></strong></td><td>66.1 K</td><td>2.78 ns</td><td>3.09 ns</td><td>3.37 ns</td><td>4.70 ns</td></tr> |
| </tbody></table> |
| </div> |
| <p>Source code for the benchmark is provided in the <em>bench</em> directory of this |
| repo and may be repeated by running <code>cargo criterion</code>.</p> |
| <br> |
| <h3 id="comparison-of-data-structures"><a href="#comparison-of-data-structures">Comparison of data structures</a></h3><h5 id="unicode-xid"><a href="#unicode-xid">unicode-xid</a></h5> |
| <p>They use a sorted array of character ranges, and do a binary search to look |
| up whether a given character lands inside one of those ranges.</p> |
| |
| <div class="example-wrap"><pre class="rust rust-example-rendered"><code><span class="kw">static </span>XID_Continue_table: [(char, char); <span class="number">763</span>] = [ |
| (<span class="string">'\u{30}'</span>, <span class="string">'\u{39}'</span>), <span class="comment">// 0-9 |
| </span>(<span class="string">'\u{41}'</span>, <span class="string">'\u{5a}'</span>), <span class="comment">// A-Z |
| </span>… |
| (<span class="string">'\u{e0100}'</span>, <span class="string">'\u{e01ef}'</span>), |
| ];</code></pre></div> |
| <p>The static storage used by this data structure scales with the number of |
| contiguous ranges of identifier codepoints in Unicode. Every table entry |
| consumes 8 bytes, because it consists of a pair of 32-bit <code>char</code> values.</p> |
| <p>In some ranges of the Unicode codepoint space, this is quite a sparse |
| representation – there are some ranges where tens of thousands of |
| adjacent codepoints are all valid identifier characters. In other places, |
| the representation is quite inefficient. A characater like <code>µ</code> (U+00B5) |
| which is surrounded by non-identifier codepoints consumes 64 bits in the |
| table, while it would be just 1 bit in a dense bitmap.</p> |
| <p>On a system with 64-byte cache lines, binary searching the table touches 7 |
| cache lines on average. Each cache line fits only 8 table entries. |
| Additionally, the branching performed during the binary search is probably |
| mostly unpredictable to the branch predictor.</p> |
| <p>Overall, the crate ends up being about 10× slower on non-ASCII input |
| compared to the fastest crate.</p> |
| <p>A potential improvement would be to pack the table entries more compactly. |
| Rust’s <code>char</code> type is a 21-bit integer padded to 32 bits, which means every |
| table entry is holding 22 bits of wasted space, adding up to 3.9 K. They |
| could instead fit every table entry into 6 bytes, leaving out some of the |
| padding, for a 25% improvement in space used. With some cleverness it may be |
| possible to fit in 5 bytes or even 4 bytes by storing a low char and an |
| extent, instead of low char and high char. I don’t expect that performance |
| would improve much but this could be the most efficient for space across all |
| the libraries, needing only about 7 K to store.</p> |
| <h5 id="ucd-trie"><a href="#ucd-trie">ucd-trie</a></h5> |
| <p>Their data structure is a compressed trie set specifically tailored for |
| Unicode codepoints. The design is credited to Raph Levien in |
| <a href="https://github.com/rust-lang/rust/pull/33098">rust-lang/rust#33098</a>.</p> |
| |
| <div class="example-wrap"><pre class="rust rust-example-rendered"><code><span class="kw">pub struct </span>TrieSet { |
| tree1_level1: <span class="kw-2">&</span><span class="lifetime">'static </span>[u64; <span class="number">32</span>], |
| tree2_level1: <span class="kw-2">&</span><span class="lifetime">'static </span>[u8; <span class="number">992</span>], |
| tree2_level2: <span class="kw-2">&</span><span class="lifetime">'static </span>[u64], |
| tree3_level1: <span class="kw-2">&</span><span class="lifetime">'static </span>[u8; <span class="number">256</span>], |
| tree3_level2: <span class="kw-2">&</span><span class="lifetime">'static </span>[u8], |
| tree3_level3: <span class="kw-2">&</span><span class="lifetime">'static </span>[u64], |
| }</code></pre></div> |
| <p>It represents codepoint sets using a trie to achieve prefix compression. The |
| final states of the trie are embedded in leaves or “chunks”, where each |
| chunk is a 64-bit integer. Each bit position of the integer corresponds to |
| whether a particular codepoint is in the set or not. These chunks are not |
| just a compact representation of the final states of the trie, but are also |
| a form of suffix compression. In particular, if multiple ranges of 64 |
| contiguous codepoints have the same Unicode properties, then they all map to |
| the same chunk in the final level of the trie.</p> |
| <p>Being tailored for Unicode codepoints, this trie is partitioned into three |
| disjoint sets: tree1, tree2, tree3. The first set corresponds to codepoints |
| [0, 0x800), the second [0x800, 0x10000) and the third [0x10000, |
| 0x110000). These partitions conveniently correspond to the space of 1 or 2 |
| byte UTF-8 encoded codepoints, 3 byte UTF-8 encoded codepoints and 4 byte |
| UTF-8 encoded codepoints, respectively.</p> |
| <p>Lookups in this data structure are significantly more efficient than binary |
| search. A lookup touches either 1, 2, or 3 cache lines based on which of the |
| trie partitions is being accessed.</p> |
| <p>One possible performance improvement would be for this crate to expose a way |
| to query based on a UTF-8 encoded string, returning the Unicode property |
| corresponding to the first character in the string. Without such an API, the |
| caller is required to tokenize their UTF-8 encoded input data into <code>char</code>, |
| hand the <code>char</code> into <code>ucd-trie</code>, only for <code>ucd-trie</code> to undo that work by |
| converting back into the variable-length representation for trie traversal.</p> |
| <h5 id="fst"><a href="#fst">fst</a></h5> |
| <p>Uses a <a href="https://github.com/BurntSushi/fst">finite state transducer</a>. This representation is built into |
| <a href="https://github.com/BurntSushi/ucd-generate">ucd-generate</a> but I am not aware of any advantage over the <code>ucd-trie</code> |
| representation. In particular <code>ucd-trie</code> is optimized for storing Unicode |
| properties while <code>fst</code> is not.</p> |
| <p>As far as I can tell, the main thing that causes <code>fst</code> to have large size |
| and slow lookups for this use case relative to <code>ucd-trie</code> is that it does |
| not specialize for the fact that only 21 of the 32 bits in a <code>char</code> are |
| meaningful. There are some dense arrays in the structure with large ranges |
| that could never possibly be used.</p> |
| <h5 id="roaring"><a href="#roaring">roaring</a></h5> |
| <p>This crate is a pure-Rust implementation of <a href="https://roaringbitmap.org/about/">Roaring Bitmap</a>, a data |
| structure designed for storing sets of 32-bit unsigned integers.</p> |
| <p>Roaring bitmaps are compressed bitmaps which tend to outperform conventional |
| compressed bitmaps such as WAH, EWAH or Concise. In some instances, they can |
| be hundreds of times faster and they often offer significantly better |
| compression.</p> |
| <p>In this use case the performance was reasonably competitive but still |
| substantially slower than the Unicode-optimized crates. Meanwhile the |
| compression was significantly worse, requiring 6× as much storage for |
| the data structure.</p> |
| <p>I also benchmarked the <a href="https://crates.io/crates/croaring"><code>croaring</code></a> crate which is an FFI wrapper around the |
| C reference implementation of Roaring Bitmap. This crate was consistently |
| about 15% slower than pure-Rust <code>roaring</code>, which could just be FFI overhead. |
| I did not investigate further.</p> |
| <h5 id="unicode-ident"><a href="#unicode-ident">unicode-ident</a></h5> |
| <p>This crate is most similar to the <code>ucd-trie</code> library, in that it’s based on |
| bitmaps stored in the leafs of a trie representation, achieving both prefix |
| compression and suffix compression.</p> |
| <p>The key differences are:</p> |
| <ul> |
| <li>Uses a single 2-level trie, rather than 3 disjoint partitions of different |
| depth each.</li> |
| <li>Uses significantly larger chunks: 512 bits rather than 64 bits.</li> |
| <li>Compresses the XID_Start and XID_Continue properties together |
| simultaneously, rather than duplicating identical trie leaf chunks across |
| the two.</li> |
| </ul> |
| <p>The following diagram show the XID_Start and XID_Continue Unicode boolean |
| properties in uncompressed form, in row-major order:</p> |
| <table> |
| <tr><th>XID_Start</th><th>XID_Continue</th></tr> |
| <tr> |
| <td><img alt="XID_Start bitmap" width="256" src="https://user-images.githubusercontent.com/1940490/168647353-c6eeb922-afec-49b2-9ef5-c03e9d1e0760.png"></td> |
| <td><img alt="XID_Continue bitmap" width="256" src="https://user-images.githubusercontent.com/1940490/168647367-f447cca7-2362-4d7d-8cd7-d21c011d329b.png"></td> |
| </tr> |
| </table> |
| <p>Uncompressed, these would take 140 K to store, which is beyond what would be |
| reasonable. However, as you can see there is a large degree of similarity |
| between the two bitmaps and across the rows, which lends well to |
| compression.</p> |
| <p>This crate stores one 512-bit “row” of the above bitmaps in the leaf level |
| of a trie, and a single additional level to index into the leafs. It turns |
| out there are 124 unique 512-bit chunks across the two bitmaps so 7 bits are |
| sufficient to index them.</p> |
| <p>The chunk size of 512 bits is selected as the size that minimizes the total |
| size of the data structure. A smaller chunk, like 256 or 128 bits, would |
| achieve better deduplication but require a larger index. A larger chunk |
| would increase redundancy in the leaf bitmaps. 512 bit chunks are the |
| optimum for total size of the index plus leaf bitmaps.</p> |
| <p>In fact since there are only 124 unique chunks, we can use an 8-bit index |
| with a spare bit to index at the half-chunk level. This achieves an |
| additional 8.5% compression by eliminating redundancies between the second |
| half of any chunk and the first half of any other chunk. Note that this is |
| not the same as using chunks which are half the size, because it does not |
| necessitate raising the size of the trie’s first level.</p> |
| <p>In contrast to binary search or the <code>ucd-trie</code> crate, performing lookups in |
| this data structure is straight-line code with no need for branching.</p> |
| </div></details><h2 id="functions" class="small-section-header"><a href="#functions">Functions</a></h2><div class="item-table"><div class="item-row"><div class="item-left module-item"><a class="fn" href="fn.is_xid_continue.html" title="unicode_ident::is_xid_continue fn">is_xid_continue</a></div></div><div class="item-row"><div class="item-left module-item"><a class="fn" href="fn.is_xid_start.html" title="unicode_ident::is_xid_start fn">is_xid_start</a></div></div></div></section></div></main><div id="rustdoc-vars" data-root-path="../" data-current-crate="unicode_ident" data-themes="ayu,dark,light" data-resource-suffix="" data-rustdoc-version="1.66.0-nightly (5c8bff74b 2022-10-21)" ></div></body></html> |