blob: dcfe705e3a4ff3ee52b7322aef84188b40d987d5 [file] [log] [blame]
.. Licensed to the Apache Software Foundation (ASF) under one or more contributor license
agreements. See the NOTICE file distributed with this work for additional information regarding
copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with the License. You may obtain
a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License
is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
or implied. See the License for the specific language governing permissions and limitations
under the License.
.. include:: ../common-defs.rst
.. default-domain:: cpp
.. highlight:: cpp
.. |TV| replace:: :code:`TextView`
.. |SV| replace:: :code:`std::string_view`.
.. _string-view: https://en.cppreference.com/w/cpp/string/basic_string_view
********
TextView
********
Synopsis
********
:code:`#include "swoc/TextView.h"`
.. class:: TextView
:libswoc:`Reference documentation <swoc::TextView>`.
This class acts as a view of memory allocated / owned elsewhere and treated as a sequence of 8 bit
characters. It is in effect a pointer and should be treated as such (e.g. care must be taken to
avoid dangling references by knowing where the memory really is). The purpose is to provide string
manipulation that is safer than raw pointers and much faster than duplicating strings.
Usage
*****
|TV| is a subclass of `std::string_view <string-view>`_ and inherits all of its methods. The
additional functionality of |TV| is for easy string manipulation, with an emphasis on fast parsing
of string data. As noted, an instance of |TV| is a pointer and needs to be handled as such. It does
not own the memory and therefore, like a pointer, care must be taken that the memory is not
deallocated while the |TV| still references it. The advantage of this is creating new views and
modifying existing ones is very cheap.
Any place that passes a :code:`char *` and a size is an excellent candidate for using a |TV|. Code
that uses functions such as :code:`strtok` or tracks pointers and offsets internally is an excellent
candidate for using |TV| instead.
Because |TV| is a subclass of :code:`std::string_view` it can be unclear which is a better choice.
In many cases it doesn't matter, since because of this relationship converting between the types is
at most as expensive as a copy of the same type, and in cases of constant reference, can be free. In
general if the string is treated as a block of data, :code:`std::string_view` is a better choice. If
the contents of the string are to be examined / parsed then |TV| is better. For example, if the
string is used simply as a key or a hash source, use :code:`std::string_view`. Contrariwise if the
string may contain substrings of interest such as key / value pairs, then use a |TV|. Although I do
sometimes use |TV| because of the lack of support for instance reuse in |SV| - e.g. no
:code:`assign` or :code:`clear` methods.
When passing |TV| as an argument, it is very debatable whether passing by value or passing by
reference is more efficient. The appropriate conclusion is it's not likely to matter in production
code. My personal heuristic is whether the function will modify the value. If so, passing by value
saves a copy to a local variable therefore it should be passed by value. If the function simply
passes the |TV| on to other functions, then pass by constant reference. This distinction is
irrelevant to the caller, the same code at the call site will work in either case.
As noted, |TV| is designed as a pointer style class. Therefore it has an increment operator which is
equivalent to :code:`std::string_view::remove_prefix`. |TV| also has a dereference operator, which
acts the same way as on a pointer. The difference is the view knows where the end of the view is.
This provides a comfortably familiar way of iterating through a view, the main difference being
checking the view itself rather than a dereference of it (like a C-style string) or a range limit.
E.g. the code to write a simple hash function [#]_ could be
.. code-block:: cpp
void hasher(TextView v) {
size_t hash = 0;
while (v) {
hash = hash * 13 + * v ++;
}
return hash;
}
Because |TV| inherits from :code:`std::string_view` it can also be used as a container for range
:code:`for` loops.
.. code-block:: cpp
void hasher(TextView const& v) {
size_t hash = 0;
for (char c : v) hash = hash * 13 + c;
return hash;
}
The standard functions :code:`strcmp`, :code:`memcmp`, code:`memcpy`, and :code:`strcasecmp` are
overloaded for |TV| so that a |TV| can be used as if it were a C-style string. The size is is taken
from the |TV| and doesn't need to be passed in explicitly.
Basic Operations
================
|TV| is essentially a collection of operations which have been found to be common and useful in
manipulating contiguous blocks of text.
Construction
------------
Constructing a view means creating a view from another object which owns the memory (for creating
views from other views see `Extraction`_). This can be a :code:`char const*` pointer and size, two
pointers, a literal string, a :code:`std::string` or a :code:`std::string_view` although in the last
case there is presumably yet another object that actually owns the memory. All of these constructors
require only the equivalent of two assignment statements. The one thing to be careful of is if a
literal string or C-string is used, the resulting |TV| will drop the terminating nul character from
the view. This is almost always the correct behavior, but if it isn't an explicit size can be used.
A |TV| can be constructed from a null :code:`char const*` pointer or a straight :code:`nullptr`. This
will construct an empty |TV| identical to one default constructed.
|TV| supports a generic constructor that will accept any class that provides the :code:`data` and
:code:`size` methods that return values convertible to :code:`char const *` and :code:`size_t`.
This enables greater interoperability with other libraries, as any well written C++ library with
its own string class will have these methods implemented sensibly.
Searching
---------
Because |TV| is a subclass of :code:`std::string_view` all of its search method work on a |TV|. The
only search methods provided beyond those are :libswoc:`TextView::find_if` and
:libswoc:`TextView::rfind_if` which search the view by a predicate. The predicate takes a single
:code:`char` argument and returns a :code:`bool`. The search terminates on the first character for
which the predicate returns :code:`true`.
Extraction
----------
Extraction is creating a new view from an existing view. Because views cannot in general be expanded
new views will be sub-sequences of existing views. This is the primary utility of a |TV|. As
noted in the `general description <Description>`_ |TV| supports copying or removing prefixes and
suffixes of the view. All of this is possible using the underlying :code:`std::string_view_substr`
but this is frequently much clumsier. The development of |TV| was driven to a large extent by the
desire to make such code much more compact and expressive, while being at least as safe. In particular
extraction methods on |TV| do useful and well defined things when given out of bounds arguments.
This is quite handy when extracting tokens based on separator characters.
The primary distinction is how a character in the view is selected.
* By index, an offset in to the view. These have plain names, such as :libswoc:`TextView::prefix`.
* By character comparison, either a single character or set of characters which is matched against a single
character in the view. These are suffixed with "at" such as :libswoc:`TextView::prefix_at`.
* By predicate, a function that takes a single character argument and returns a bool to indicate a match.
These are suffixed with "if", such as :libswoc:`TextView::prefix_if`.
A secondary distinction is what is done to the view by the methods.
* The base methods make a new view without modifying the existing view.
* The "split..." methods remove the corresponding part of the view and return it. The selected character
is discarded and not left in either the returned view nor the source view. If the selected character
is not in the view, an empty view is returned and the source view is not modified.
* The "take..." methods remove the corresponding part of the view and return it. The selected character
is discarded and not left in either the returned view nor the source view. If the selected character
is not in the view, the entire view is returned and the source view is cleared.
* The "clip..." methods remove the corresponding part of the view and return it. Only those characters
are removed - in contrast to "split..." and "take..." which drop a (presumed) separator. If the
first character doesn't match, the view is not modified and an empty view is returned. These are
very similar to the "trim..." methods described below, the difference what part of the original
view is returned.
.. _`std::string_view::remove_prefix`: https://en.cppreference.com/w/cpp/string/basic_string_view/remove_prefix
.. _`std::string_view::remove_suffix`: https://en.cppreference.com/w/cpp/string/basic_string_view/remove_suffix
This is a table of the affix oriented methods, grouped by the properties of the methods. "Bounded"
indicates whether the operation requires the target character, however specified, to be within the
bounds of the view. A bounded method does nothing if the target character is not in the view. On
this note, the :code:`remove_prefix` and :code:`remove_suffix` are implemented differently in |TV|
compared to :code:`std::string_view`. Rather than being undefined, the methods will clear the view
if the size specified is larger than the contents of the view.
+-----------------+--------+---------+------------------------------------------+
| Operation | Affix | Bounded | Method |
+=================+========+=========+==========================================+
| Copy | Prefix | No | :libswoc:`TextView::prefix` |
| + +---------+------------------------------------------+
| | | Yes | :libswoc:`TextView::prefix_at` |
| + + +------------------------------------------+
| | | | :libswoc:`TextView::prefix_if` |
| +--------+---------+------------------------------------------+
| | Suffix | No | :libswoc:`TextView::suffix` |
| + +---------+------------------------------------------+
| | | Yes | :libswoc:`TextView::suffix_at` |
| + + +------------------------------------------+
| | | | :libswoc:`TextView::suffix_if` |
+-----------------+--------+---------+------------------------------------------+
| Modify | Prefix | No | `std::string_view::remove_prefix`_ |
| | +---------+------------------------------------------+
| | | Yes | :libswoc:`TextView::remove_prefix_at` |
| | + +------------------------------------------+
| | | | :libswoc:`TextView::remove_prefix_if` |
| +--------+---------+------------------------------------------+
| | Suffix | No | `std::string_view::remove_suffix`_ |
| | +---------+------------------------------------------+
| | | Yes | :libswoc:`TextView::remove_suffix_at` |
| | + +------------------------------------------+
| | | | :libswoc:`TextView::remove_suffix_if` |
+-----------------+--------+---------+------------------------------------------+
| Modify and Copy | Prefix | Yes | :libswoc:`TextView::split_prefix` |
| | + +------------------------------------------+
| | | | :libswoc:`TextView::split_prefix_at` |
| | + +------------------------------------------+
| | | | :libswoc:`TextView::split_prefix_if` |
| | + +------------------------------------------+
| | | | :libswoc:`TextView::clip_prefix_of` |
| | +---------+------------------------------------------+
| | | No | :libswoc:`TextView::take_prefix` |
| | + +------------------------------------------+
| | | | :libswoc:`TextView::take_prefix_at` |
| | + +------------------------------------------+
| | | | :libswoc:`TextView::take_prefix_if` |
| +--------+---------+------------------------------------------+
| | Suffix | Yes | :libswoc:`TextView::split_suffix` |
| | + +------------------------------------------+
| | | | :libswoc:`TextView::split_suffix_at` |
| | + +------------------------------------------+
| | | | :libswoc:`TextView::split_suffix_if` |
| | + +------------------------------------------+
| | | | :libswoc:`TextView::clip_suffix_of` |
| | +---------+------------------------------------------+
| | | No | :libswoc:`TextView::take_suffix` |
| | + +------------------------------------------+
| | | | :libswoc:`TextView::take_suffix_at` |
| | + +------------------------------------------+
| | | | :libswoc:`TextView::take_suffix_if` |
+-----------------+--------+---------+------------------------------------------+
Other
-----
The comparison operators for |TV| are inherited from :code:`std::string_view` and therefore use the
content of the view to determine the relationship.
|TV| provides a collection of "trim" methods which remove leading or trailing characters. These have
similar suffixes with the same meaning as the affix methods. This can be done for a single
character, one of a set of characters, or a predicate. The most common use is with the predicate
:code:`isspace` which removes leading and/or trailing whitespace as needed.
While the plethora of view methods can seem a bit much, all of these are useful in different
situations and exist because of such use cases.
Numeric conversions are provided, in signed (:libswoc:`svtoi`), unsigned (:libswoc:`svtou`), and
floating point (:libswoc:`svtod`) flavors. The integer functions are designed to be "complete" in
the sense that any other string to integer conversion can be mapped to one of these functions. The
floating point conversion is sufficiently accurate - it will return a floating point value that is
within one epsilon of the exact value, but not always the closest. This is fine for general use such
as in configurations, but possibly not quite enough for high precision work.
The standard functions :code:`strcmp`, :code:`strcasecmp`, and :code:`memcmp` are overloaded when
at least of the parameters is a |TV|. The length is taken from the view, rather than being an explicit
parameter as with :code:`strncasecmp`.
When no other useful result can be returned, |TV| methods return a reference to the instance. This
makes chaining methods easy. If a list consisted of colon separated elements, each of which was
of the form "A.B.old" and just the "A.B" part was needed, sans leading white space:
.. literalinclude:: ../../unit_tests/ex_TextView.cc
:lines: 223-227
Parsing with TextView
=====================
Time for some examples demonstrating string parsing using |TV|. There are two major reasons for
developing |TV| parsing.
The first was to minimize the need to allocate memory to hold intermediate results. For this reason, the normal
style of use is a streaming / incremental one, where tokens are extracted from a source one by one
and placed in |TV| instances, with the orignal source |TV| being reduced by each extraction until
it is empty.
The second was to minimize cut and paste coding. Typical C or C++ parsing logic consists mostly of
very generic code to handle pointer and size updates. The point of |TV| is to automate all of that
so the resulting code is focused entirely on the parsing logic, not boiler plate string or view manipulation.
It is a common occurrence to not get such code exactly correct leading to hard to track bugs. Use
of |TV| eliminates those problems.
The minimization of exceptions on sizes beyond the view boundaries was done primarily to help
parsing. It noticeably simplifies the logic if excessive removal or advancement yields an empty
view rather than an exception.
CSV Example
-----------
For example, assume :arg:`value` contains a null terminated string which is expected to be tokens
separated by commas. To handle this generically a function could be written which takes a token
handler and calls it for each token.
.. literalinclude:: ../../unit_tests/ex_TextView.cc
:start-after: doc csv start
:end-before: doc csv end
If :arg:`value` was :literal:`"bob ,dave, sam"` then :arg:`token` would be successively
:literal:`bob`, :literal:`dave`, :literal:`sam`. Each loop iteration is guaranteed to remove text
from :arg:`src` making the loop eventually terminate when all text has been removed, because an
empty :code:`TextView` is :code:`false`. This is a recommended style because :code:`TextView` instances
are very cheap to copy. This is essentially the same as having a current pointer and and end pointer
and checking for :code:`current >= end` except :code:`TextView` does all the work, leading to
simpler and less buggy code.
White space is dropped because of the calls to :code:`ltrim_if` and `rtrim_if`. By calling in the
loop condition, the loop exits if the remaining text is only whitespace and no token is processed.
Alternatively :code:`trim_if` could be used after extraction. The performance will be *slightly*
better because although :code:`trim_if` calls :code:`ltrim_if` and :code:`rtrim_if`, a final
token extraction on trailing whitespace will be avoided. In practice it won't make a difference,
do what's convenient.
It could be tempting to squeeze the code a bit to be
.. literalinclude:: ../../unit_tests/ex_TextView.cc
:start-after: doc csv non-empty start
:end-before: doc csv non-empty end
However this causes a significant behavior difference - the loop terminates on an empty token because
that token will be :code:`false`. That is, this will work if there is a guarantee of no empty tokens
(e.g. adjacent separators).
Key / Value Example
-------------------
A similar case is parsing a list of key / value pairs in a comma separated list. Each pair is
"key=value" where white space is ignored. In this case it is also permitted to have just a keyword
for values that are boolean.
.. literalinclude:: ../../unit_tests/ex_TextView.cc
:start-after: doc kv start
:end-before: doc kv end
.. sidebar:: Verification
`Test code for example <https://github.com/SolidWallOfCode/libswoc/blob/1.4.12/unit_tests/ex_TextView.cc#L73>`__.
The basic list processing is the same as the previous example, extracting each comma separated
element. The resulting element is treated as a "list" with ``=`` as the separator. Note if there is
no ``=`` character then all of the list element is moved to :arg:`key` leaving :arg:`value` empty,
which is the desired result. A bit of extra white space trimming it done in case there was space
next to the ``=``.
Line Processing
---------------
|TV| works well when parsing lines from a file. For this example, :libswoc:`load` will
be used. This method, given a path, loads the entire content of the file into a :code:`std::string`.
This will serve as the owner of the string memory. If it is kept around with the configuration, all
of the parsed strings can be instances of |TV| that reference memory in that :code:`std::string`. If
the density of useful text is sufficiently high, this is a convenient way to handle parsing with
minimal memory allocations.
This example counts the number of code lines in the documenations ``conf.py`` file.
.. literalinclude:: ../../unit_tests/ex_TextView.cc
:lines: 203-217
The |TV| :arg:`src` is constructed from the :code:`std::string` :arg:`content` which contains the
file contents. While that view is not empty, a line is taken each look and leading and trailing
whitespace is trimmed. If this results in an empty view or one where the first character is the
Python comment character ``#`` it is not counted. The newlines are discard by the prefix extraction.
The use of :libswoc:`TextView::take_prefix_at` forces the extraction of text even if there is no
final newline. If this were a file of key value pairs, then :arg:`line` would be subjected to one of
the other examples to extract the values. For all of this, there is only one memory allocation, that
needed for :arg:`content` to load the file contents.
Entity Tag Lists Example
------------------------
An example from actual production code is this example that parses a quoted, comma separated list of
values ("CSV"). This is used for parsing `entity tags
<https://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.11>`__ as used for HTTP fields such as
"If-Match" (`14.24 <https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html>`__). This will be a CSV
each where each value is quoted. To make it interesting these quoted strings may contain commas,
which do not count as separators. Therefore the simple approach in previous examples will not work
in all cases. This example also does not use the callback style of the previous examples - instead
the tokens are pulled off in a streaming style with the source :code:`TextView` being passed by
reference in order to be updated by the tokenizer. Further, some callers want the quotes, and some
do not, so a flag to strip quotes from the resulting elements is needed. The final result looks like
.. literalinclude:: ../../unit_tests/ex_TextView.cc
:start-after: "TextView Tokens"
:lines: 2-26
.. sidebar:: Verification
`Test code for example <https://github.com/SolidWallOfCode/libswoc/blob/1.4.12/unit_tests/ex_TextView.cc#L90>`__.
This takes a :code:`TextView&` which is the source view which will be updated as tokens are removed
(therefore the caller must do the empty view check). The other arguments are the separator character
and the "strip quotes" flag. The algorithm is to find the next "interesting" character, which is either
a separator or a quote. Quotes flip the "in quote" flag back and forth, and separators terminate
the loop if the "in quote" flag is not set. This skips quoted separators. If neither is found then
all of the view is returned as the result. Whitespace is always trimmed and then quotes are trimmed
if requested, before the view is returned. In this case keeping an offset of the amount of the source
view processed is the most convenient mechanism for tracking progress. The result is a fairly compact
piece of code that does non-trivial parsing and conversion on a source string, without a lot of
complex parsing state, and no memory allocation.
History
*******
The first attempt at this functionality was in the TSConfig library in the :code:`ts::Buffer` and
:code:`ts::ConstBuffer` classes. Originally intended just as raw memory views,
:code:`ts::ConstBuffer` in particular was repeatedly enhanced to provide better support for strings.
The header was eventually moved from :literal:`lib/tsconfig` to :literal:`lib/ts` and was used in in
various part of the Traffic Server core.
There was then a proposal to make these classes available to plugin writers as they proved handy in
the core. A suggested alternative was `Boost.StringRef
<http://www.boost.org/doc/libs/1_61_0/libs/utility/doc/html/string_ref.html>`_ which provides a
similar functionality using :code:`std::string` as the base of the pre-allocated memory. A version
of the header was ported to Traffic Server (by stripping all the Boost support and cross includes) but in use
proved to provide little of the functionality available in :code:`ts::ConstBuffer`. If extensive
reworking was required in any case, it seemed better to start from scratch and build just what was
useful in the Traffic Server context.
The next step was the :code:`TextView` class which turned out reasonably well. About this time
:code:`std::string_view` was officially adopted for C++17, which was a bit of a problem because
:code:`TextView` was extremely similar in functionality but quite different in interface. Further,
it had a number of quite useful methods that were not in :code:`std::string_view`. To simplify the
use of :code:`TextView` (which was actually called "StringView" then) it was made a subclass of
:code:`std::string_view` with user defined conversions so that two classes could be used almost
interchangeable in an efficient way. Passing a :code:`TextView` to a :code:`std::string_view
const&` is zero marginal cost because of inheritance and passing by value is also no more expensive
than just :code:`std::string_view`.
.. rubric:: Footnotes
.. [#] This is a horrible hash function, do not actually use it.