blob: 0e13b6c6c50f59b340d8ceb08121ce467ecd1f4e [file] [log] [blame]
NAME
pcretest - a program for testing Perl-compatible regular
expressions.
SYNOPSIS
pcretest [-d] [-i] [-m] [-o osize] [-p] [-t] [source] [des-
tination]
pcretest was written as a test program for the PCRE regular
expression library itself, but it can also be used for
experimenting with regular expressions. This man page
describes the features of the test program; for details of
the regular expressions themselves, see the pcre man page.
OPTIONS
-d Behave as if each regex had the /D modifier (see
below); the internal form is output after compila-
tion.
-i Behave as if each regex had the /I modifier;
information about the compiled pattern is given
after compilation.
-m Output the size of each compiled pattern after it
has been compiled. This is equivalent to adding /M
to each regular expression. For compatibility with
earlier versions of pcretest, -s is a synonym for
-m.
-o osize Set the number of elements in the output vector
that is used when calling PCRE to be osize. The
default value is 45, which is enough for 14 cap-
turing subexpressions. The vector size can be
changed for individual matching calls by including
\O in the data line (see below).
-p Behave as if each regex has /P modifier; the POSIX
wrapper API is used to call PCRE. None of the
other options has any effect when -p is set.
-t Run each compile, study, and match 20000 times
with a timer, and output resulting time per com-
pile or match (in milliseconds). Do not set -t
with -m, because you will then get the size output
20000 times and the timing will be distorted.
DESCRIPTION
If pcretest is given two filename arguments, it reads from
the first and writes to the second. If it is given only one
SunOS 5.8 Last change: 1
filename argument, it reads from that file and writes to
stdout. Otherwise, it reads from stdin and writes to stdout,
and prompts for each line of input, using "re>" to prompt
for regular expressions, and "data>" to prompt for data
lines.
The program handles any number of sets of input on a single
input file. Each set starts with a regular expression, and
continues with any number of data lines to be matched
against the pattern. An empty line signals the end of the
data lines, at which point a new regular expression is read.
The regular expressions are given enclosed in any non-
alphameric delimiters other than backslash, for example
/(a|bc)x+yz/
White space before the initial delimiter is ignored. A regu-
lar expression may be continued over several input lines, in
which case the newline characters are included within it. It
is possible to include the delimiter within the pattern by
escaping it, for example
/abc\/def/
If you do so, the escape and the delimiter form part of the
pattern, but since delimiters are always non-alphameric,
this does not affect its interpretation. If the terminating
delimiter is immediately followed by a backslash, for exam-
ple,
/abc/\
then a backslash is added to the end of the pattern. This is
done to provide a way of testing the error condition that
arises if a pattern finishes with a backslash, because
/abc\/
is interpreted as the first line of a pattern that starts
with "abc/", causing pcretest to read the next line as a
continuation of the regular expression.
PATTERN MODIFIERS
The pattern may be followed by i, m, s, or x to set the
PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED
options, respectively. For example:
/caseless/i
These modifier letters have the same effect as they do in
Perl. There are others which set PCRE options that do not
correspond to anything in Perl: /A, /E, and /X set
PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and PCRE_EXTRA respec-
tively.
Searching for all possible matches within each subject
string can be requested by the /g or /G modifier. After
finding a match, PCRE is called again to search the
remainder of the subject string. The difference between /g
and /G is that the former uses the startoffset argument to
pcre_exec() to start searching at a new point within the
entire string (which is in effect what Perl does), whereas
the latter passes over a shortened substring. This makes a
difference to the matching process if the pattern begins
with a lookbehind assertion (including \b or \B).
If any call to pcre_exec() in a /g or /G sequence matches an
empty string, the next call is done with the PCRE_NOTEMPTY
and PCRE_ANCHORED flags set in order to search for another,
non-empty, match at the same point. If this second match
fails, the start offset is advanced by one, and the normal
match is retried. This imitates the way Perl handles such
cases when using the /g modifier or the split() function.
There are a number of other modifiers for controlling the
way pcretest operates.
The /+ modifier requests that as well as outputting the sub-
string that matched the entire pattern, pcretest should in
addition output the remainder of the subject string. This is
useful for tests where the subject contains multiple copies
of the same substring.
The /L modifier must be followed directly by the name of a
locale, for example,
/pattern/Lfr
For this reason, it must be the last modifier letter. The
given locale is set, pcre_maketables() is called to build a
set of character tables for the locale, and this is then
passed to pcre_compile() when compiling the regular expres-
sion. Without an /L modifier, NULL is passed as the tables
pointer; that is, /L applies only to the expression on which
it appears.
The /I modifier requests that pcretest output information
about the compiled expression (whether it is anchored, has a
fixed first character, and so on). It does this by calling
pcre_fullinfo() after compiling an expression, and output-
ting the information it gets back. If the pattern is stu-
died, the results of that are also output.
The /D modifier is a PCRE debugging feature, which also
assumes /I. It causes the internal form of compiled regular
expressions to be output after compilation.
The /S modifier causes pcre_study() to be called after the
expression has been compiled, and the results used when the
expression is matched.
The /M modifier causes the size of memory block used to hold
the compiled pattern to be output.
The /P modifier causes pcretest to call PCRE via the POSIX
wrapper API rather than its native API. When this is done,
all other modifiers except /i, /m, and /+ are ignored.
REG_ICASE is set if /i is present, and REG_NEWLINE is set if
/m is present. The wrapper functions force
PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless
REG_NEWLINE is set.
The /8 modifier causes pcretest to call PCRE with the
PCRE_UTF8 option set. This turns on the (currently incom-
plete) support for UTF-8 character handling in PCRE, pro-
vided that it was compiled with this support enabled. This
modifier also causes any non-printing characters in output
strings to be printed using the \x{hh...} notation if they
are valid UTF-8 sequences.
DATA LINES
Before each data line is passed to pcre_exec(), leading and
trailing whitespace is removed, and it is then scanned for \
escapes. The following are recognized:
\a alarm (= BEL)
\b backspace
\e escape
\f formfeed
\n newline
\r carriage return
\t tab
\v vertical tab
\nnn octal character (up to 3 octal digits)
\xhh hexadecimal character (up to 2 hex digits)
\x{hh...} hexadecimal UTF-8 character
\A pass the PCRE_ANCHORED option to pcre_exec()
\B pass the PCRE_NOTBOL option to pcre_exec()
\Cdd call pcre_copy_substring() for substring dd
after a successful match (any decimal number
less than 32)
\Gdd call pcre_get_substring() for substring dd
after a successful match (any decimal number
less than 32)
\L call pcre_get_substringlist() after a
successful match
\N pass the PCRE_NOTEMPTY option to pcre_exec()
\Odd set the size of the output vector passed to
pcre_exec() to dd (any number of decimal
digits)
\Z pass the PCRE_NOTEOL option to pcre_exec()
When \O is used, it may be higher or lower than the size set
by the -O option (or defaulted to 45); \O applies only to
the call of pcre_exec() for the line in which it appears.
A backslash followed by anything else just escapes the any-
thing else. If the very last character is a backslash, it is
ignored. This gives a way of passing an empty line as data,
since a real empty line terminates the data input.
If /P was present on the regex, causing the POSIX wrapper
API to be used, only B, and Z have any effect, causing
REG_NOTBOL and REG_NOTEOL to be passed to regexec() respec-
tively.
The use of \x{hh...} to represent UTF-8 characters is not
dependent on the use of the /8 modifier on the pattern. It
is recognized always. There may be any number of hexadecimal
digits inside the braces. The result is from one to six
bytes, encoded according to the UTF-8 rules.
OUTPUT FROM PCRETEST
When a match succeeds, pcretest outputs the list of captured
substrings that pcre_exec() returns, starting with number 0
for the string that matched the whole pattern. Here is an
example of an interactive pcretest run.
$ pcretest
PCRE version 2.06 08-Jun-1999
re> /^abc(\d+)/
data> abc123
0: abc123
1: 123
data> xyz
No match
If the strings contain any non-printing characters, they are
output as \0x escapes, or as \x{...} escapes if the /8
modifier was present on the pattern. If the pattern has the
/+ modifier, then the output for substring 0 is followed by
the the rest of the subject string, identified by "0+" like
this:
re> /cat/+
data> cataract
0: cat
0+ aract
If the pattern has the /g or /G modifier, the results of
successive matching attempts are output in sequence, like
this:
re> /\Bi(\w\w)/g
data> Mississippi
0: iss
1: ss
0: iss
1: ss
0: ipp
1: pp
"No match" is output only if the first match attempt fails.
If any of the sequences \C, \G, or \L are present in a data
line that is successfully matched, the substrings extracted
by the convenience functions are output with C, G, or L
after the string number instead of a colon. This is in addi-
tion to the normal full list. The string length (that is,
the return from the extraction function) is given in
parentheses after each string for \C and \G.
Note that while patterns can be continued over several lines
(a plain ">" prompt is used for continuations), data lines
may not. However newlines can be included in data by means
of the \n escape.
AUTHOR
Philip Hazel <ph10@cam.ac.uk>
University Computing Service,
New Museums Site,
Cambridge CB2 3QG, England.
Phone: +44 1223 334714
Last updated: 15 August 2001
Copyright (c) 1997-2001 University of Cambridge.