| .. Licensed to the Apache Software Foundation (ASF) under one or more contributor license |
| agreements. See the NOTICE file distributed with this work for additional information regarding |
| copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with the License. You may obtain |
| a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software distributed under the License |
| is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express |
| or implied. See the License for the specific language governing permissions and limitations |
| under the License. |
| |
| .. include:: ../common-defs.rst |
| |
| .. default-domain:: cpp |
| .. highlight:: cpp |
| .. |TV| replace:: :code:`TextView` |
| .. |SV| replace:: :code:`std::string_view`. |
| |
| .. _string-view: https://en.cppreference.com/w/cpp/string/basic_string_view |
| |
| ******** |
| TextView |
| ******** |
| |
| Synopsis |
| ******** |
| |
| :code:`#include "swoc/TextView.h"` |
| |
| .. class:: TextView |
| |
| :libswoc:`Reference documentation <swoc::TextView>`. |
| |
| This class acts as a view of memory allocated / owned elsewhere and treated as a sequence of 8 bit |
| characters. It is in effect a pointer and should be treated as such (e.g. care must be taken to |
| avoid dangling references by knowing where the memory really is). The purpose is to provide string |
| manipulation that is safer than raw pointers and much faster than duplicating strings. |
| |
| Usage |
| ***** |
| |
| |TV| is a subclass of `std::string_view <string-view>`_ and inherits all of its methods. The |
| additional functionality of |TV| is for easy string manipulation, with an emphasis on fast parsing |
| of string data. As noted, an instance of |TV| is a pointer and needs to be handled as such. It does |
| not own the memory and therefore, like a pointer, care must be taken that the memory is not |
| deallocated while the |TV| still references it. The advantage of this is creating new views and |
| modifying existing ones is very cheap. |
| |
| Any place that passes a :code:`char *` and a size is an excellent candidate for using a |TV|. Code |
| that uses functions such as :code:`strtok` or tracks pointers and offsets internally is an excellent |
| candidate for using |TV| instead. |
| |
| Because |TV| is a subclass of :code:`std::string_view` it can be unclear which is a better choice. |
| In many cases it doesn't matter, since because of this relationship converting between the types is |
| at most as expensive as a copy of the same type, and in cases of constant reference, can be free. In |
| general if the string is treated as a block of data, :code:`std::string_view` is a better choice. If |
| the contents of the string are to be examined / parsed then |TV| is better. For example, if the |
| string is used simply as a key or a hash source, use :code:`std::string_view`. Contrariwise if the |
| string may contain substrings of interest such as key / value pairs, then use a |TV|. Although I do |
| sometimes use |TV| because of the lack of support for instance reuse in |SV| - e.g. no |
| :code:`assign` or :code:`clear` methods. |
| |
| When passing |TV| as an argument, it is very debatable whether passing by value or passing by |
| reference is more efficient. The appropriate conclusion is it's not likely to matter in production |
| code. My personal heuristic is whether the function will modify the value. If so, passing by value |
| saves a copy to a local variable therefore it should be passed by value. If the function simply |
| passes the |TV| on to other functions, then pass by constant reference. This distinction is |
| irrelevant to the caller, the same code at the call site will work in either case. |
| |
| As noted, |TV| is designed as a pointer style class. Therefore it has an increment operator which is |
| equivalent to :code:`std::string_view::remove_prefix`. |TV| also has a dereference operator, which |
| acts the same way as on a pointer. The difference is the view knows where the end of the view is. |
| This provides a comfortably familiar way of iterating through a view, the main difference being |
| checking the view itself rather than a dereference of it (like a C-style string) or a range limit. |
| E.g. the code to write a simple hash function [#]_ could be |
| |
| .. code-block:: cpp |
| |
| void hasher(TextView v) { |
| size_t hash = 0; |
| while (v) { |
| hash = hash * 13 + * v ++; |
| } |
| return hash; |
| } |
| |
| Because |TV| inherits from :code:`std::string_view` it can also be used as a container for range |
| :code:`for` loops. |
| |
| .. code-block:: cpp |
| |
| void hasher(TextView const& v) { |
| size_t hash = 0; |
| for (char c : v) hash = hash * 13 + c; |
| return hash; |
| } |
| |
| The standard functions :code:`strcmp`, :code:`memcmp`, code:`memcpy`, and :code:`strcasecmp` are |
| overloaded for |TV| so that a |TV| can be used as if it were a C-style string. The size is is taken |
| from the |TV| and doesn't need to be passed in explicitly. |
| |
| Basic Operations |
| ================ |
| |
| |TV| is essentially a collection of operations which have been found to be common and useful in |
| manipulating contiguous blocks of text. |
| |
| Construction |
| ------------ |
| |
| Constructing a view means creating a view from another object which owns the memory (for creating |
| views from other views see `Extraction`_). This can be a :code:`char const*` pointer and size, two |
| pointers, a literal string, a :code:`std::string` or a :code:`std::string_view` although in the last |
| case there is presumably yet another object that actually owns the memory. All of these constructors |
| require only the equivalent of two assignment statements. The one thing to be careful of is if a |
| literal string or C-string is used, the resulting |TV| will drop the terminating nul character from |
| the view. This is almost always the correct behavior, but if it isn't an explicit size can be used. |
| |
| A |TV| can be constructed from a null :code:`char const*` pointer or a straight :code:`nullptr`. This |
| will construct an empty |TV| identical to one default constructed. |
| |
| |TV| supports a generic constructor that will accept any class that provides the :code:`data` and |
| :code:`size` methods that return values convertible to :code:`char const *` and :code:`size_t`. |
| This enables greater interoperability with other libraries, as any well written C++ library with |
| its own string class will have these methods implemented sensibly. |
| |
| Searching |
| --------- |
| |
| Because |TV| is a subclass of :code:`std::string_view` all of its search method work on a |TV|. The |
| only search methods provided beyond those are :libswoc:`TextView::find_if` and |
| :libswoc:`TextView::rfind_if` which search the view by a predicate. The predicate takes a single |
| :code:`char` argument and returns a :code:`bool`. The search terminates on the first character for |
| which the predicate returns :code:`true`. |
| |
| Extraction |
| ---------- |
| |
| Extraction is creating a new view from an existing view. Because views cannot in general be expanded |
| new views will be sub-sequences of existing views. This is the primary utility of a |TV|. As |
| noted in the `general description <Description>`_ |TV| supports copying or removing prefixes and |
| suffixes of the view. All of this is possible using the underlying :code:`std::string_view_substr` |
| but this is frequently much clumsier. The development of |TV| was driven to a large extent by the |
| desire to make such code much more compact and expressive, while being at least as safe. In particular |
| extraction methods on |TV| do useful and well defined things when given out of bounds arguments. |
| This is quite handy when extracting tokens based on separator characters. |
| |
| The primary distinction is how a character in the view is selected. |
| |
| * By index, an offset in to the view. These have plain names, such as :libswoc:`TextView::prefix`. |
| |
| * By character comparison, either a single character or set of characters which is matched against a single |
| character in the view. These are suffixed with "at" such as :libswoc:`TextView::prefix_at`. |
| |
| * By predicate, a function that takes a single character argument and returns a bool to indicate a match. |
| These are suffixed with "if", such as :libswoc:`TextView::prefix_if`. |
| |
| A secondary distinction is what is done to the view by the methods. |
| |
| * The base methods make a new view without modifying the existing view. |
| |
| * The "split..." methods remove the corresponding part of the view and return it. The selected character |
| is discarded and not left in either the returned view nor the source view. If the selected character |
| is not in the view, an empty view is returned and the source view is not modified. |
| |
| * The "take..." methods remove the corresponding part of the view and return it. The selected character |
| is discarded and not left in either the returned view nor the source view. If the selected character |
| is not in the view, the entire view is returned and the source view is cleared. |
| |
| * The "clip..." methods remove the corresponding part of the view and return it. Only those characters |
| are removed - in contrast to "split..." and "take..." which drop a (presumed) separator. If the |
| first character doesn't match, the view is not modified and an empty view is returned. These are |
| very similar to the "trim..." methods described below, the difference what part of the original |
| view is returned. |
| |
| .. _`std::string_view::remove_prefix`: https://en.cppreference.com/w/cpp/string/basic_string_view/remove_prefix |
| .. _`std::string_view::remove_suffix`: https://en.cppreference.com/w/cpp/string/basic_string_view/remove_suffix |
| |
| This is a table of the affix oriented methods, grouped by the properties of the methods. "Bounded" |
| indicates whether the operation requires the target character, however specified, to be within the |
| bounds of the view. A bounded method does nothing if the target character is not in the view. On |
| this note, the :code:`remove_prefix` and :code:`remove_suffix` are implemented differently in |TV| |
| compared to :code:`std::string_view`. Rather than being undefined, the methods will clear the view |
| if the size specified is larger than the contents of the view. |
| |
| +-----------------+--------+---------+------------------------------------------+ |
| | Operation | Affix | Bounded | Method | |
| +=================+========+=========+==========================================+ |
| | Copy | Prefix | No | :libswoc:`TextView::prefix` | |
| | + +---------+------------------------------------------+ |
| | | | Yes | :libswoc:`TextView::prefix_at` | |
| | + + +------------------------------------------+ |
| | | | | :libswoc:`TextView::prefix_if` | |
| | +--------+---------+------------------------------------------+ |
| | | Suffix | No | :libswoc:`TextView::suffix` | |
| | + +---------+------------------------------------------+ |
| | | | Yes | :libswoc:`TextView::suffix_at` | |
| | + + +------------------------------------------+ |
| | | | | :libswoc:`TextView::suffix_if` | |
| +-----------------+--------+---------+------------------------------------------+ |
| | Modify | Prefix | No | `std::string_view::remove_prefix`_ | |
| | | +---------+------------------------------------------+ |
| | | | Yes | :libswoc:`TextView::remove_prefix_at` | |
| | | + +------------------------------------------+ |
| | | | | :libswoc:`TextView::remove_prefix_if` | |
| | +--------+---------+------------------------------------------+ |
| | | Suffix | No | `std::string_view::remove_suffix`_ | |
| | | +---------+------------------------------------------+ |
| | | | Yes | :libswoc:`TextView::remove_suffix_at` | |
| | | + +------------------------------------------+ |
| | | | | :libswoc:`TextView::remove_suffix_if` | |
| +-----------------+--------+---------+------------------------------------------+ |
| | Modify and Copy | Prefix | Yes | :libswoc:`TextView::split_prefix` | |
| | | + +------------------------------------------+ |
| | | | | :libswoc:`TextView::split_prefix_at` | |
| | | + +------------------------------------------+ |
| | | | | :libswoc:`TextView::split_prefix_if` | |
| | | + +------------------------------------------+ |
| | | | | :libswoc:`TextView::clip_prefix_of` | |
| | | +---------+------------------------------------------+ |
| | | | No | :libswoc:`TextView::take_prefix` | |
| | | + +------------------------------------------+ |
| | | | | :libswoc:`TextView::take_prefix_at` | |
| | | + +------------------------------------------+ |
| | | | | :libswoc:`TextView::take_prefix_if` | |
| | +--------+---------+------------------------------------------+ |
| | | Suffix | Yes | :libswoc:`TextView::split_suffix` | |
| | | + +------------------------------------------+ |
| | | | | :libswoc:`TextView::split_suffix_at` | |
| | | + +------------------------------------------+ |
| | | | | :libswoc:`TextView::split_suffix_if` | |
| | | + +------------------------------------------+ |
| | | | | :libswoc:`TextView::clip_suffix_of` | |
| | | +---------+------------------------------------------+ |
| | | | No | :libswoc:`TextView::take_suffix` | |
| | | + +------------------------------------------+ |
| | | | | :libswoc:`TextView::take_suffix_at` | |
| | | + +------------------------------------------+ |
| | | | | :libswoc:`TextView::take_suffix_if` | |
| +-----------------+--------+---------+------------------------------------------+ |
| |
| Other |
| ----- |
| |
| The comparison operators for |TV| are inherited from :code:`std::string_view` and therefore use the |
| content of the view to determine the relationship. |
| |
| |TV| provides a collection of "trim" methods which remove leading or trailing characters. These have |
| similar suffixes with the same meaning as the affix methods. This can be done for a single |
| character, one of a set of characters, or a predicate. The most common use is with the predicate |
| :code:`isspace` which removes leading and/or trailing whitespace as needed. |
| |
| While the plethora of view methods can seem a bit much, all of these are useful in different |
| situations and exist because of such use cases. |
| |
| Numeric conversions are provided, in signed (:libswoc:`svtoi`), unsigned (:libswoc:`svtou`), and |
| floating point (:libswoc:`svtod`) flavors. The integer functions are designed to be "complete" in |
| the sense that any other string to integer conversion can be mapped to one of these functions. The |
| floating point conversion is sufficiently accurate - it will return a floating point value that is |
| within one epsilon of the exact value, but not always the closest. This is fine for general use such |
| as in configurations, but possibly not quite enough for high precision work. |
| |
| The standard functions :code:`strcmp`, :code:`strcasecmp`, and :code:`memcmp` are overloaded when |
| at least of the parameters is a |TV|. The length is taken from the view, rather than being an explicit |
| parameter as with :code:`strncasecmp`. |
| |
| When no other useful result can be returned, |TV| methods return a reference to the instance. This |
| makes chaining methods easy. If a list consisted of colon separated elements, each of which was |
| of the form "A.B.old" and just the "A.B" part was needed, sans leading white space: |
| |
| .. literalinclude:: ../../unit_tests/ex_TextView.cc |
| :lines: 223-227 |
| |
| Parsing with TextView |
| ===================== |
| |
| Time for some examples demonstrating string parsing using |TV|. There are two major reasons for |
| developing |TV| parsing. |
| |
| The first was to minimize the need to allocate memory to hold intermediate results. For this reason, the normal |
| style of use is a streaming / incremental one, where tokens are extracted from a source one by one |
| and placed in |TV| instances, with the orignal source |TV| being reduced by each extraction until |
| it is empty. |
| |
| The second was to minimize cut and paste coding. Typical C or C++ parsing logic consists mostly of |
| very generic code to handle pointer and size updates. The point of |TV| is to automate all of that |
| so the resulting code is focused entirely on the parsing logic, not boiler plate string or view manipulation. |
| It is a common occurrence to not get such code exactly correct leading to hard to track bugs. Use |
| of |TV| eliminates those problems. |
| |
| The minimization of exceptions on sizes beyond the view boundaries was done primarily to help |
| parsing. It noticeably simplifies the logic if excessive removal or advancement yields an empty |
| view rather than an exception. |
| |
| CSV Example |
| ----------- |
| |
| For example, assume :arg:`value` contains a null terminated string which is expected to be tokens |
| separated by commas. To handle this generically a function could be written which takes a token |
| handler and calls it for each token. |
| |
| .. literalinclude:: ../../unit_tests/ex_TextView.cc |
| :start-after: doc csv start |
| :end-before: doc csv end |
| |
| If :arg:`value` was :literal:`"bob ,dave, sam"` then :arg:`token` would be successively |
| :literal:`bob`, :literal:`dave`, :literal:`sam`. Each loop iteration is guaranteed to remove text |
| from :arg:`src` making the loop eventually terminate when all text has been removed, because an |
| empty :code:`TextView` is :code:`false`. This is a recommended style because :code:`TextView` instances |
| are very cheap to copy. This is essentially the same as having a current pointer and and end pointer |
| and checking for :code:`current >= end` except :code:`TextView` does all the work, leading to |
| simpler and less buggy code. |
| |
| White space is dropped because of the calls to :code:`ltrim_if` and `rtrim_if`. By calling in the |
| loop condition, the loop exits if the remaining text is only whitespace and no token is processed. |
| Alternatively :code:`trim_if` could be used after extraction. The performance will be *slightly* |
| better because although :code:`trim_if` calls :code:`ltrim_if` and :code:`rtrim_if`, a final |
| token extraction on trailing whitespace will be avoided. In practice it won't make a difference, |
| do what's convenient. |
| |
| It could be tempting to squeeze the code a bit to be |
| |
| .. literalinclude:: ../../unit_tests/ex_TextView.cc |
| :start-after: doc csv non-empty start |
| :end-before: doc csv non-empty end |
| |
| However this causes a significant behavior difference - the loop terminates on an empty token because |
| that token will be :code:`false`. That is, this will work if there is a guarantee of no empty tokens |
| (e.g. adjacent separators). |
| |
| Key / Value Example |
| ------------------- |
| |
| A similar case is parsing a list of key / value pairs in a comma separated list. Each pair is |
| "key=value" where white space is ignored. In this case it is also permitted to have just a keyword |
| for values that are boolean. |
| |
| .. literalinclude:: ../../unit_tests/ex_TextView.cc |
| :start-after: doc kv start |
| :end-before: doc kv end |
| |
| .. sidebar:: Verification |
| |
| `Test code for example <https://github.com/SolidWallOfCode/libswoc/blob/1.4.12/unit_tests/ex_TextView.cc#L73>`__. |
| |
| The basic list processing is the same as the previous example, extracting each comma separated |
| element. The resulting element is treated as a "list" with ``=`` as the separator. Note if there is |
| no ``=`` character then all of the list element is moved to :arg:`key` leaving :arg:`value` empty, |
| which is the desired result. A bit of extra white space trimming it done in case there was space |
| next to the ``=``. |
| |
| Line Processing |
| --------------- |
| |
| |TV| works well when parsing lines from a file. For this example, :libswoc:`load` will |
| be used. This method, given a path, loads the entire content of the file into a :code:`std::string`. |
| This will serve as the owner of the string memory. If it is kept around with the configuration, all |
| of the parsed strings can be instances of |TV| that reference memory in that :code:`std::string`. If |
| the density of useful text is sufficiently high, this is a convenient way to handle parsing with |
| minimal memory allocations. |
| |
| This example counts the number of code lines in the documenations ``conf.py`` file. |
| |
| .. literalinclude:: ../../unit_tests/ex_TextView.cc |
| :lines: 203-217 |
| |
| The |TV| :arg:`src` is constructed from the :code:`std::string` :arg:`content` which contains the |
| file contents. While that view is not empty, a line is taken each look and leading and trailing |
| whitespace is trimmed. If this results in an empty view or one where the first character is the |
| Python comment character ``#`` it is not counted. The newlines are discard by the prefix extraction. |
| The use of :libswoc:`TextView::take_prefix_at` forces the extraction of text even if there is no |
| final newline. If this were a file of key value pairs, then :arg:`line` would be subjected to one of |
| the other examples to extract the values. For all of this, there is only one memory allocation, that |
| needed for :arg:`content` to load the file contents. |
| |
| Entity Tag Lists Example |
| ------------------------ |
| |
| An example from actual production code is this example that parses a quoted, comma separated list of |
| values ("CSV"). This is used for parsing `entity tags |
| <https://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.11>`__ as used for HTTP fields such as |
| "If-Match" (`14.24 <https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html>`__). This will be a CSV |
| each where each value is quoted. To make it interesting these quoted strings may contain commas, |
| which do not count as separators. Therefore the simple approach in previous examples will not work |
| in all cases. This example also does not use the callback style of the previous examples - instead |
| the tokens are pulled off in a streaming style with the source :code:`TextView` being passed by |
| reference in order to be updated by the tokenizer. Further, some callers want the quotes, and some |
| do not, so a flag to strip quotes from the resulting elements is needed. The final result looks like |
| |
| .. literalinclude:: ../../unit_tests/ex_TextView.cc |
| :start-after: "TextView Tokens" |
| :lines: 2-26 |
| |
| .. sidebar:: Verification |
| |
| `Test code for example <https://github.com/SolidWallOfCode/libswoc/blob/1.4.12/unit_tests/ex_TextView.cc#L90>`__. |
| |
| This takes a :code:`TextView&` which is the source view which will be updated as tokens are removed |
| (therefore the caller must do the empty view check). The other arguments are the separator character |
| and the "strip quotes" flag. The algorithm is to find the next "interesting" character, which is either |
| a separator or a quote. Quotes flip the "in quote" flag back and forth, and separators terminate |
| the loop if the "in quote" flag is not set. This skips quoted separators. If neither is found then |
| all of the view is returned as the result. Whitespace is always trimmed and then quotes are trimmed |
| if requested, before the view is returned. In this case keeping an offset of the amount of the source |
| view processed is the most convenient mechanism for tracking progress. The result is a fairly compact |
| piece of code that does non-trivial parsing and conversion on a source string, without a lot of |
| complex parsing state, and no memory allocation. |
| |
| History |
| ******* |
| |
| The first attempt at this functionality was in the TSConfig library in the :code:`ts::Buffer` and |
| :code:`ts::ConstBuffer` classes. Originally intended just as raw memory views, |
| :code:`ts::ConstBuffer` in particular was repeatedly enhanced to provide better support for strings. |
| The header was eventually moved from :literal:`lib/tsconfig` to :literal:`lib/ts` and was used in in |
| various part of the Traffic Server core. |
| |
| There was then a proposal to make these classes available to plugin writers as they proved handy in |
| the core. A suggested alternative was `Boost.StringRef |
| <http://www.boost.org/doc/libs/1_61_0/libs/utility/doc/html/string_ref.html>`_ which provides a |
| similar functionality using :code:`std::string` as the base of the pre-allocated memory. A version |
| of the header was ported to Traffic Server (by stripping all the Boost support and cross includes) but in use |
| proved to provide little of the functionality available in :code:`ts::ConstBuffer`. If extensive |
| reworking was required in any case, it seemed better to start from scratch and build just what was |
| useful in the Traffic Server context. |
| |
| The next step was the :code:`TextView` class which turned out reasonably well. About this time |
| :code:`std::string_view` was officially adopted for C++17, which was a bit of a problem because |
| :code:`TextView` was extremely similar in functionality but quite different in interface. Further, |
| it had a number of quite useful methods that were not in :code:`std::string_view`. To simplify the |
| use of :code:`TextView` (which was actually called "StringView" then) it was made a subclass of |
| :code:`std::string_view` with user defined conversions so that two classes could be used almost |
| interchangeable in an efficient way. Passing a :code:`TextView` to a :code:`std::string_view |
| const&` is zero marginal cost because of inheritance and passing by value is also no more expensive |
| than just :code:`std::string_view`. |
| |
| .. rubric:: Footnotes |
| |
| .. [#] This is a horrible hash function, do not actually use it. |