942 lines
33 KiB
HTML
942 lines
33 KiB
HTML
<HTML><HEAD><TITLE>Tcl Built-In Commands - re_syntax manual page</TITLE></HEAD><BODY>
|
|
<DL>
|
|
<DD><A HREF="re_syntax.htm#M2" NAME="L851">NAME</A>
|
|
<DL><DD>re_syntax - Syntax of Tcl regular expressions.</DL>
|
|
<DD><A HREF="re_syntax.htm#M3" NAME="L852">DESCRIPTION</A>
|
|
<DD><A HREF="re_syntax.htm#M4" NAME="L853">DIFFERENT FLAVORS OF REs</A>
|
|
<DD><A HREF="re_syntax.htm#M5" NAME="L854">REGULAR EXPRESSION SYNTAX</A>
|
|
<DL>
|
|
<DD><A HREF="re_syntax.htm#M6" NAME="L855"><B>*</B></A>
|
|
<DD><A HREF="re_syntax.htm#M7" NAME="L856"><B>+</B></A>
|
|
<DD><A HREF="re_syntax.htm#M8" NAME="L857"><B>?</B></A>
|
|
<DD><A HREF="re_syntax.htm#M9" NAME="L858"><B>{</B><I>m</I><B>}</B></A>
|
|
<DD><A HREF="re_syntax.htm#M10" NAME="L859"><B>{</B><I>m</I><B>,}</B></A>
|
|
<DD><A HREF="re_syntax.htm#M11" NAME="L860"><B>{</B><I>m</I><B>,</B><I>n</I><B>}</B></A>
|
|
<DD><A HREF="re_syntax.htm#M12" NAME="L861"><B>*? +? ?? {</B><I>m</I><B>}? {</B><I>m</I><B>,}? {</B><I>m</I><B>,</B><I>n</I><B>}?</B></A>
|
|
</DL>
|
|
<DL>
|
|
<DD><A HREF="re_syntax.htm#M13" NAME="L862"><B>(</B><I>re</I><B>)</B></A>
|
|
<DD><A HREF="re_syntax.htm#M14" NAME="L863"><B>(?:</B><I>re</I><B>)</B></A>
|
|
<DD><A HREF="re_syntax.htm#M15" NAME="L864"><B>()</B></A>
|
|
<DD><A HREF="re_syntax.htm#M16" NAME="L865"><B>(?:)</B></A>
|
|
<DD><A HREF="re_syntax.htm#M17" NAME="L866"><B>[</B><I>chars</I><B>]</B></A>
|
|
<DD><A HREF="re_syntax.htm#M18" NAME="L867"><B>.</B></A>
|
|
<DD><A HREF="re_syntax.htm#M19" NAME="L868"><B>\</B><I>k</I></A>
|
|
<DD><A HREF="re_syntax.htm#M20" NAME="L869"><B>\</B><I>c</I></A>
|
|
<DD><A HREF="re_syntax.htm#M21" NAME="L870"><B>{</B></A>
|
|
<DD><A HREF="re_syntax.htm#M22" NAME="L871"><I>x</I></A>
|
|
</DL>
|
|
<DL>
|
|
<DD><A HREF="re_syntax.htm#M23" NAME="L872"><B>^</B></A>
|
|
<DD><A HREF="re_syntax.htm#M24" NAME="L873"><B>$</B></A>
|
|
<DD><A HREF="re_syntax.htm#M25" NAME="L874"><B>(?=</B><I>re</I><B>)</B></A>
|
|
<DD><A HREF="re_syntax.htm#M26" NAME="L875"><B>(?!</B><I>re</I><B>)</B></A>
|
|
</DL>
|
|
<DD><A HREF="re_syntax.htm#M27" NAME="L876">BRACKET EXPRESSIONS</A>
|
|
<DD><A HREF="re_syntax.htm#M28" NAME="L877">ESCAPES</A>
|
|
<DL>
|
|
<DD><A HREF="re_syntax.htm#M29" NAME="L878"><B>\a</B></A>
|
|
<DD><A HREF="re_syntax.htm#M30" NAME="L879"><B>\b</B></A>
|
|
<DD><A HREF="re_syntax.htm#M31" NAME="L880"><B>\B</B></A>
|
|
<DD><A HREF="re_syntax.htm#M32" NAME="L881"><B>\c</B><I>X</I></A>
|
|
<DD><A HREF="re_syntax.htm#M33" NAME="L882"><B>\e</B></A>
|
|
<DD><A HREF="re_syntax.htm#M34" NAME="L883"><B>\f</B></A>
|
|
<DD><A HREF="re_syntax.htm#M35" NAME="L884"><B>\n</B></A>
|
|
<DD><A HREF="re_syntax.htm#M36" NAME="L885"><B>\r</B></A>
|
|
<DD><A HREF="re_syntax.htm#M37" NAME="L886"><B>\t</B></A>
|
|
<DD><A HREF="re_syntax.htm#M38" NAME="L887"><B>\u</B><I>wxyz</I></A>
|
|
<DD><A HREF="re_syntax.htm#M39" NAME="L888"><B>\U</B><I>stuvwxyz</I></A>
|
|
<DD><A HREF="re_syntax.htm#M40" NAME="L889"><B>\v</B></A>
|
|
<DD><A HREF="re_syntax.htm#M41" NAME="L890"><B>\x</B><I>hhh</I></A>
|
|
<DD><A HREF="re_syntax.htm#M42" NAME="L891"><B>\0</B></A>
|
|
<DD><A HREF="re_syntax.htm#M43" NAME="L892"><B>\</B><I>xy</I></A>
|
|
<DD><A HREF="re_syntax.htm#M44" NAME="L893"><B>\</B><I>xyz</I></A>
|
|
</DL>
|
|
<DL>
|
|
<DD><A HREF="re_syntax.htm#M45" NAME="L894"><B>\d</B></A>
|
|
<DD><A HREF="re_syntax.htm#M46" NAME="L895"><B>\s</B></A>
|
|
<DD><A HREF="re_syntax.htm#M47" NAME="L896"><B>\w</B></A>
|
|
<DD><A HREF="re_syntax.htm#M48" NAME="L897"><B>\D</B></A>
|
|
<DD><A HREF="re_syntax.htm#M49" NAME="L898"><B>\S</B></A>
|
|
<DD><A HREF="re_syntax.htm#M50" NAME="L899"><B>\W</B></A>
|
|
</DL>
|
|
<DL>
|
|
<DD><A HREF="re_syntax.htm#M51" NAME="L900"><B>\A</B></A>
|
|
<DD><A HREF="re_syntax.htm#M52" NAME="L901"><B>\m</B></A>
|
|
<DD><A HREF="re_syntax.htm#M53" NAME="L902"><B>\M</B></A>
|
|
<DD><A HREF="re_syntax.htm#M54" NAME="L903"><B>\y</B></A>
|
|
<DD><A HREF="re_syntax.htm#M55" NAME="L904"><B>\Y</B></A>
|
|
<DD><A HREF="re_syntax.htm#M56" NAME="L905"><B>\Z</B></A>
|
|
<DD><A HREF="re_syntax.htm#M57" NAME="L906"><B>\</B><I>m</I></A>
|
|
<DD><A HREF="re_syntax.htm#M58" NAME="L907"><B>\</B><I>mnn</I></A>
|
|
</DL>
|
|
<DD><A HREF="re_syntax.htm#M59" NAME="L908">METASYNTAX</A>
|
|
<DL>
|
|
<DD><A HREF="re_syntax.htm#M60" NAME="L909"><B>b</B></A>
|
|
<DD><A HREF="re_syntax.htm#M61" NAME="L910"><B>c</B></A>
|
|
<DD><A HREF="re_syntax.htm#M62" NAME="L911"><B>e</B></A>
|
|
<DD><A HREF="re_syntax.htm#M63" NAME="L912"><B>i</B></A>
|
|
<DD><A HREF="re_syntax.htm#M64" NAME="L913"><B>m</B></A>
|
|
<DD><A HREF="re_syntax.htm#M65" NAME="L914"><B>n</B></A>
|
|
<DD><A HREF="re_syntax.htm#M66" NAME="L915"><B>p</B></A>
|
|
<DD><A HREF="re_syntax.htm#M67" NAME="L916"><B>q</B></A>
|
|
<DD><A HREF="re_syntax.htm#M68" NAME="L917"><B>s</B></A>
|
|
<DD><A HREF="re_syntax.htm#M69" NAME="L918"><B>t</B></A>
|
|
<DD><A HREF="re_syntax.htm#M70" NAME="L919"><B>w</B></A>
|
|
<DD><A HREF="re_syntax.htm#M71" NAME="L920"><B>x</B></A>
|
|
</DL>
|
|
<DD><A HREF="re_syntax.htm#M72" NAME="L921">MATCHING</A>
|
|
<DD><A HREF="re_syntax.htm#M73" NAME="L922">LIMITS AND COMPATIBILITY</A>
|
|
<DD><A HREF="re_syntax.htm#M74" NAME="L923">BASIC REGULAR EXPRESSIONS</A>
|
|
<DD><A HREF="re_syntax.htm#M75" NAME="L924">SEE ALSO</A>
|
|
<DD><A HREF="re_syntax.htm#M76" NAME="L925">KEYWORDS</A>
|
|
</DL><HR>
|
|
<H3><A NAME="M2">NAME</A></H3>
|
|
re_syntax - Syntax of Tcl regular expressions.
|
|
<H3><A NAME="M3">DESCRIPTION</A></H3>
|
|
A <I>regular expression</I> describes strings of characters.
|
|
It's a pattern that matches certain strings and doesn't match others.
|
|
|
|
<H3><A NAME="M4">DIFFERENT FLAVORS OF REs</A></H3>
|
|
Regular expressions (``RE''s), as defined by POSIX, come in two
|
|
flavors: <I>extended</I> REs (``EREs'') and <I>basic</I> REs (``BREs'').
|
|
EREs are roughly those of the traditional <I>egrep</I>, while BREs are
|
|
roughly those of the traditional <I>ed</I>. This implementation adds
|
|
a third flavor, <I>advanced</I> REs (``AREs''), basically EREs with
|
|
some significant extensions.
|
|
<P>
|
|
This manual page primarily describes AREs. BREs mostly exist for
|
|
backward compatibility in some old programs; they will be discussed at
|
|
the end. POSIX EREs are almost an exact subset of AREs. Features of
|
|
AREs that are not present in EREs will be indicated.
|
|
|
|
<H3><A NAME="M5">REGULAR EXPRESSION SYNTAX</A></H3>
|
|
Tcl regular expressions are implemented using the package written by
|
|
Henry Spencer, based on the 1003.2 spec and some (not quite all) of
|
|
the Perl5 extensions (thanks, Henry!). Much of the description of
|
|
regular expressions below is copied verbatim from his manual entry.
|
|
<P>
|
|
An ARE is one or more <I>branches</I>,
|
|
separated by `<B>|</B>',
|
|
matching anything that matches any of the branches.
|
|
<P>
|
|
A branch is zero or more <I>constraints</I> or <I>quantified atoms</I>,
|
|
concatenated.
|
|
It matches a match for the first, followed by a match for the second, etc;
|
|
an empty branch matches the empty string.
|
|
<P>
|
|
A quantified atom is an <I>atom</I> possibly followed
|
|
by a single <I>quantifier</I>.
|
|
Without a quantifier, it matches a match for the atom.
|
|
The quantifiers,
|
|
and what a so-quantified atom matches, are:
|
|
<P>
|
|
<DL>
|
|
<P><DT><A NAME="M6"><B>*</B></A><DD>
|
|
a sequence of 0 or more matches of the atom
|
|
<P><DT><A NAME="M7"><B>+</B></A><DD>
|
|
a sequence of 1 or more matches of the atom
|
|
<P><DT><A NAME="M8"><B>?</B></A><DD>
|
|
a sequence of 0 or 1 matches of the atom
|
|
<P><DT><A NAME="M9"><B>{</B><I>m</I><B>}</B></A><DD>
|
|
a sequence of exactly <I>m</I> matches of the atom
|
|
<P><DT><A NAME="M10"><B>{</B><I>m</I><B>,}</B></A><DD>
|
|
a sequence of <I>m</I> or more matches of the atom
|
|
<P><DT><A NAME="M11"><B>{</B><I>m</I><B>,</B><I>n</I><B>}</B></A><DD>
|
|
a sequence of <I>m</I> through <I>n</I> (inclusive) matches of the atom;
|
|
<I>m</I> may not exceed <I>n</I>
|
|
<P><DT><A NAME="M12"><B>*? +? ?? {</B><I>m</I><B>}? {</B><I>m</I><B>,}? {</B><I>m</I><B>,</B><I>n</I><B>}?</B></A><DD>
|
|
<I>non-greedy</I> quantifiers,
|
|
which match the same possibilities,
|
|
but prefer the smallest number rather than the largest number
|
|
of matches (see MATCHING)
|
|
<P></DL>
|
|
<P>
|
|
The forms using
|
|
<B>{</B> and <B>}</B>
|
|
are known as <I>bound</I>s.
|
|
The numbers
|
|
<I>m</I> and <I>n</I> are unsigned decimal integers
|
|
with permissible values from 0 to 255 inclusive.
|
|
<P>
|
|
An atom is one of:
|
|
<P>
|
|
<DL>
|
|
<P><DT><A NAME="M13"><B>(</B><I>re</I><B>)</B></A><DD>
|
|
(where <I>re</I> is any regular expression)
|
|
matches a match for
|
|
<I>re</I>, with the match noted for possible reporting
|
|
<P><DT><A NAME="M14"><B>(?:</B><I>re</I><B>)</B></A><DD>
|
|
as previous,
|
|
but does no reporting
|
|
(a ``non-capturing'' set of parentheses)
|
|
<P><DT><A NAME="M15"><B>()</B></A><DD>
|
|
matches an empty string,
|
|
noted for possible reporting
|
|
<P><DT><A NAME="M16"><B>(?:)</B></A><DD>
|
|
matches an empty string,
|
|
without reporting
|
|
<P><DT><A NAME="M17"><B>[</B><I>chars</I><B>]</B></A><DD>
|
|
a <I>bracket expression</I>,
|
|
matching any one of the <I>chars</I> (see BRACKET EXPRESSIONS for more detail)
|
|
<P><DT><A NAME="M18"><B>.</B></A><DD>
|
|
matches any single character
|
|
<P><DT><A NAME="M19"><B>\</B><I>k</I></A><DD>
|
|
(where <I>k</I> is a non-alphanumeric character)
|
|
matches that character taken as an ordinary character,
|
|
e.g. \\ matches a backslash character
|
|
<P><DT><A NAME="M20"><B>\</B><I>c</I></A><DD>
|
|
where <I>c</I> is alphanumeric
|
|
(possibly followed by other characters),
|
|
an <I>escape</I> (AREs only),
|
|
see ESCAPES below
|
|
<P><DT><A NAME="M21"><B>{</B></A><DD>
|
|
when followed by a character other than a digit,
|
|
matches the left-brace character `<B>{</B>';
|
|
when followed by a digit, it is the beginning of a
|
|
<I>bound</I> (see above)
|
|
<P><DT><A NAME="M22"><I>x</I></A><DD>
|
|
where <I>x</I> is
|
|
a single character with no other significance, matches that character.
|
|
<P></DL>
|
|
<P>
|
|
A <I>constraint</I> matches an empty string when specific conditions
|
|
are met.
|
|
A constraint may not be followed by a quantifier.
|
|
The simple constraints are as follows; some more constraints are
|
|
described later, under ESCAPES.
|
|
<P>
|
|
<DL>
|
|
<P><DT><A NAME="M23"><B>^</B></A><DD>
|
|
matches at the beginning of a line
|
|
<P><DT><A NAME="M24"><B>$</B></A><DD>
|
|
matches at the end of a line
|
|
<P><DT><A NAME="M25"><B>(?=</B><I>re</I><B>)</B></A><DD>
|
|
<I>positive lookahead</I> (AREs only), matches at any point
|
|
where a substring matching <I>re</I> begins
|
|
<P><DT><A NAME="M26"><B>(?!</B><I>re</I><B>)</B></A><DD>
|
|
<I>negative lookahead</I> (AREs only), matches at any point
|
|
where no substring matching <I>re</I> begins
|
|
<P></DL>
|
|
<P>
|
|
The lookahead constraints may not contain back references (see later),
|
|
and all parentheses within them are considered non-capturing.
|
|
<P>
|
|
An RE may not end with `<B>\</B>'.
|
|
|
|
<H3><A NAME="M27">BRACKET EXPRESSIONS</A></H3>
|
|
A <I>bracket expression</I> is a list of characters enclosed in `<B>[ ]</B>'.
|
|
It normally matches any single character from the list (but see below).
|
|
If the list begins with `<B>^</B>',
|
|
it matches any single character
|
|
(but see below) <I>not</I> from the rest of the list.
|
|
<P>
|
|
If two characters in the list are separated by `<B>-</B>',
|
|
this is shorthand
|
|
for the full <I>range</I> of characters between those two (inclusive) in the
|
|
collating sequence,
|
|
e.g.
|
|
<B>[0-9]</B>
|
|
in ASCII matches any decimal digit.
|
|
Two ranges may not share an
|
|
endpoint, so e.g.
|
|
<B>a-c-e</B>
|
|
is illegal.
|
|
Ranges are very collating-sequence-dependent,
|
|
and portable programs should avoid relying on them.
|
|
<P>
|
|
To include a literal
|
|
<B>]</B>
|
|
or
|
|
<B>-</B>
|
|
in the list,
|
|
the simplest method is to
|
|
enclose it in
|
|
<B>[.</B> and <B>.]</B>
|
|
to make it a collating element (see below).
|
|
Alternatively,
|
|
make it the first character
|
|
(following a possible `<B>^</B>'),
|
|
or (AREs only) precede it with `<B>\</B>'.
|
|
Alternatively, for `<B>-</B>',
|
|
make it the last character,
|
|
or the second endpoint of a range.
|
|
To use a literal
|
|
<B>-</B>
|
|
as the first endpoint of a range,
|
|
make it a collating element
|
|
or (AREs only) precede it with `<B>\</B>'.
|
|
With the exception of these, some combinations using
|
|
<B>[</B>
|
|
(see next
|
|
paragraphs), and escapes,
|
|
all other special characters lose their
|
|
special significance within a bracket expression.
|
|
<P>
|
|
Within a bracket expression, a collating element (a character,
|
|
a multi-character sequence that collates as if it were a single character,
|
|
or a collating-sequence name for either)
|
|
enclosed in
|
|
<B>[.</B> and <B>.]</B>
|
|
stands for the
|
|
sequence of characters of that collating element.
|
|
The sequence is a single element of the bracket expression's list.
|
|
A bracket expression in a locale that has
|
|
multi-character collating elements
|
|
can thus match more than one character.
|
|
So (insidiously), a bracket expression that starts with <B>^</B>
|
|
can match multi-character collating elements even if none of them
|
|
appear in the bracket expression!
|
|
(<I>Note:</I> Tcl currently has no multi-character collating elements.
|
|
This information is only for illustration.)
|
|
<P>
|
|
For example, assume the collating sequence includes a <B>ch</B>
|
|
multi-character collating element.
|
|
Then the RE <B>[[.ch.]]*c</B> (zero or more <B>ch</B>'s followed by <B>c</B>)
|
|
matches the first five characters of `<B>chchcc</B>'.
|
|
Also, the RE <B>[^c]b</B> matches all of `<B>chb</B>'
|
|
(because <B>[^c]</B> matches the multi-character <B>ch</B>).
|
|
<P>
|
|
Within a bracket expression, a collating element enclosed in
|
|
<B>[=</B>
|
|
and
|
|
<B>=]</B>
|
|
is an equivalence class, standing for the sequences of characters
|
|
of all collating elements equivalent to that one, including itself.
|
|
(If there are no other equivalent collating elements,
|
|
the treatment is as if the enclosing delimiters were `<B>[.</B>'
|
|
and `<B>.]</B>'.)
|
|
For example, if
|
|
<B>o</B>
|
|
and
|
|
<B>ô</B>
|
|
are the members of an equivalence class,
|
|
then `<B>[[=o=]]</B>', `<B>[[=ô=]]</B>',
|
|
and `<B>[oô]</B>'
|
|
are all synonymous.
|
|
An equivalence class may not be an endpoint
|
|
of a range.
|
|
(<I>Note:</I>
|
|
Tcl currently implements only the Unicode locale.
|
|
It doesn't define any equivalence classes.
|
|
The examples above are just illustrations.)
|
|
<P>
|
|
Within a bracket expression, the name of a <I>character class</I> enclosed
|
|
in
|
|
<B>[:</B>
|
|
and
|
|
<B>:]</B>
|
|
stands for the list of all characters
|
|
(not all collating elements!)
|
|
belonging to that
|
|
class.
|
|
Standard character classes are:
|
|
<P>
|
|
<DL><P><DD>
|
|
<B>alpha</B> A letter.
|
|
<B>upper</B> An upper-case letter.
|
|
<B><A HREF="../TclCmd/lower.htm">lower</A></B> A lower-case letter.
|
|
<B>digit</B> A decimal digit.
|
|
<B>xdigit</B> A hexadecimal digit.
|
|
<B>alnum</B> An alphanumeric (letter or digit).
|
|
<B>print</B> An alphanumeric (same as alnum).
|
|
<B>blank</B> A space or tab character.
|
|
<B>space</B> A character producing white space in displayed text.
|
|
<B>punct</B> A punctuation character.
|
|
<B>graph</B> A character with a visible representation.
|
|
<B>cntrl</B> A control character.
|
|
</DL>
|
|
<P>
|
|
A locale may provide others.
|
|
(Note that the current Tcl implementation has only one locale:
|
|
the Unicode locale.)
|
|
A character class may not be used as an endpoint of a range.
|
|
<P>
|
|
There are two special cases of bracket expressions:
|
|
the bracket expressions
|
|
<B>[[:<:]]</B>
|
|
and
|
|
<B>[[:>:]]</B>
|
|
are constraints, matching empty strings at
|
|
the beginning and end of a word respectively.
|
|
A word is defined as a sequence of
|
|
word characters
|
|
that is neither preceded nor followed by
|
|
word characters.
|
|
A word character is an
|
|
<I>alnum</I>
|
|
character
|
|
or an underscore
|
|
(<B>_</B>).
|
|
These special bracket expressions are deprecated;
|
|
users of AREs should use constraint escapes instead (see below).
|
|
<H3><A NAME="M28">ESCAPES</A></H3>
|
|
Escapes (AREs only), which begin with a
|
|
<B>\</B>
|
|
followed by an alphanumeric character,
|
|
come in several varieties:
|
|
character entry, class shorthands, constraint escapes, and back references.
|
|
A
|
|
<B>\</B>
|
|
followed by an alphanumeric character but not constituting
|
|
a valid escape is illegal in AREs.
|
|
In EREs, there are no escapes:
|
|
outside a bracket expression,
|
|
a
|
|
<B>\</B>
|
|
followed by an alphanumeric character merely stands for that
|
|
character as an ordinary character,
|
|
and inside a bracket expression,
|
|
<B>\</B>
|
|
is an ordinary character.
|
|
(The latter is the one actual incompatibility between EREs and AREs.)
|
|
<P>
|
|
Character-entry escapes (AREs only) exist to make it easier to specify
|
|
non-printing and otherwise inconvenient characters in REs:
|
|
<P>
|
|
<DL>
|
|
<P><DT><A NAME="M29"><B>\a</B></A><DD>
|
|
alert (bell) character, as in C
|
|
<P><DT><A NAME="M30"><B>\b</B></A><DD>
|
|
backspace, as in C
|
|
<P><DT><A NAME="M31"><B>\B</B></A><DD>
|
|
synonym for
|
|
<B>\</B>
|
|
to help reduce backslash doubling in some
|
|
applications where there are multiple levels of backslash processing
|
|
<P><DT><A NAME="M32"><B>\c</B><I>X</I></A><DD>
|
|
(where X is any character) the character whose
|
|
low-order 5 bits are the same as those of
|
|
<I>X</I>,
|
|
and whose other bits are all zero
|
|
<P><DT><A NAME="M33"><B>\e</B></A><DD>
|
|
the character whose collating-sequence name
|
|
is `<B>ESC</B>',
|
|
or failing that, the character with octal value 033
|
|
<P><DT><A NAME="M34"><B>\f</B></A><DD>
|
|
formfeed, as in C
|
|
<P><DT><A NAME="M35"><B>\n</B></A><DD>
|
|
newline, as in C
|
|
<P><DT><A NAME="M36"><B>\r</B></A><DD>
|
|
carriage return, as in C
|
|
<P><DT><A NAME="M37"><B>\t</B></A><DD>
|
|
horizontal tab, as in C
|
|
<P><DT><A NAME="M38"><B>\u</B><I>wxyz</I></A><DD>
|
|
(where
|
|
<I>wxyz</I>
|
|
is exactly four hexadecimal digits)
|
|
the Unicode character
|
|
<B>U+</B><I>wxyz</I>
|
|
in the local byte ordering
|
|
<P><DT><A NAME="M39"><B>\U</B><I>stuvwxyz</I></A><DD>
|
|
(where
|
|
<I>stuvwxyz</I>
|
|
is exactly eight hexadecimal digits)
|
|
reserved for a somewhat-hypothetical Unicode extension to 32 bits
|
|
<P><DT><A NAME="M40"><B>\v</B></A><DD>
|
|
vertical tab, as in C
|
|
are all available.
|
|
<P><DT><A NAME="M41"><B>\x</B><I>hhh</I></A><DD>
|
|
(where
|
|
<I>hhh</I>
|
|
is any sequence of hexadecimal digits)
|
|
the character whose hexadecimal value is
|
|
<B>0x</B><I>hhh</I>
|
|
(a single character no matter how many hexadecimal digits are used).
|
|
<P><DT><A NAME="M42"><B>\0</B></A><DD>
|
|
the character whose value is
|
|
<B>0</B>
|
|
<P><DT><A NAME="M43"><B>\</B><I>xy</I></A><DD>
|
|
(where
|
|
<I>xy</I>
|
|
is exactly two octal digits,
|
|
and is not a
|
|
<I>back reference</I> (see below))
|
|
the character whose octal value is
|
|
<B>0</B><I>xy</I>
|
|
<P><DT><A NAME="M44"><B>\</B><I>xyz</I></A><DD>
|
|
(where
|
|
<I>xyz</I>
|
|
is exactly three octal digits,
|
|
and is not a
|
|
back reference (see below))
|
|
the character whose octal value is
|
|
<B>0</B><I>xyz</I>
|
|
<P></DL>
|
|
<P>
|
|
Hexadecimal digits are `<B>0</B>'-`<B>9</B>', `<B>a</B>'-`<B>f</B>',
|
|
and `<B>A</B>'-`<B>F</B>'.
|
|
Octal digits are `<B>0</B>'-`<B>7</B>'.
|
|
<P>
|
|
The character-entry escapes are always taken as ordinary characters.
|
|
For example,
|
|
<B>\135</B>
|
|
is
|
|
<B>]</B>
|
|
in ASCII,
|
|
but
|
|
<B>\135</B>
|
|
does not terminate a bracket expression.
|
|
Beware, however, that some applications (e.g., C compilers) interpret
|
|
such sequences themselves before the regular-expression package
|
|
gets to see them, which may require doubling (quadrupling, etc.) the `<B>\</B>'.
|
|
<P>
|
|
Class-shorthand escapes (AREs only) provide shorthands for certain commonly-used
|
|
character classes:
|
|
<P>
|
|
<DL>
|
|
<P><DT><A NAME="M45"><B>\d</B></A><DD>
|
|
<B>[[:digit:]]</B>
|
|
<P><DT><A NAME="M46"><B>\s</B></A><DD>
|
|
<B>[[:space:]]</B>
|
|
<P><DT><A NAME="M47"><B>\w</B></A><DD>
|
|
<B>[[:alnum:]_]</B>
|
|
(note underscore)
|
|
<P><DT><A NAME="M48"><B>\D</B></A><DD>
|
|
<B>[^[:digit:]]</B>
|
|
<P><DT><A NAME="M49"><B>\S</B></A><DD>
|
|
<B>[^[:space:]]</B>
|
|
<P><DT><A NAME="M50"><B>\W</B></A><DD>
|
|
<B>[^[:alnum:]_]</B>
|
|
(note underscore)
|
|
<P></DL>
|
|
<P>
|
|
Within bracket expressions, `<B>\d</B>', `<B>\s</B>',
|
|
and `<B>\w</B>'
|
|
lose their outer brackets,
|
|
and `<B>\D</B>', `<B>\S</B>',
|
|
and `<B>\W</B>'
|
|
are illegal.
|
|
(So, for example, <B>[a-c\d]</B> is equivalent to <B>[a-c[:digit:]]</B>.
|
|
Also, <B>[a-c\D]</B>, which is equivalent to <B>[a-c^[:digit:]]</B>, is illegal.)
|
|
<P>
|
|
A constraint escape (AREs only) is a constraint,
|
|
matching the empty string if specific conditions are met,
|
|
written as an escape:
|
|
<P>
|
|
<DL>
|
|
<P><DT><A NAME="M51"><B>\A</B></A><DD>
|
|
matches only at the beginning of the string
|
|
(see MATCHING, below, for how this differs from `<B>^</B>')
|
|
<P><DT><A NAME="M52"><B>\m</B></A><DD>
|
|
matches only at the beginning of a word
|
|
<P><DT><A NAME="M53"><B>\M</B></A><DD>
|
|
matches only at the end of a word
|
|
<P><DT><A NAME="M54"><B>\y</B></A><DD>
|
|
matches only at the beginning or end of a word
|
|
<P><DT><A NAME="M55"><B>\Y</B></A><DD>
|
|
matches only at a point that is not the beginning or end of a word
|
|
<P><DT><A NAME="M56"><B>\Z</B></A><DD>
|
|
matches only at the end of the string
|
|
(see MATCHING, below, for how this differs from `<B>$</B>')
|
|
<P><DT><A NAME="M57"><B>\</B><I>m</I></A><DD>
|
|
(where
|
|
<I>m</I>
|
|
is a nonzero digit) a <I>back reference</I>, see below
|
|
<P><DT><A NAME="M58"><B>\</B><I>mnn</I></A><DD>
|
|
(where
|
|
<I>m</I>
|
|
is a nonzero digit, and
|
|
<I>nn</I>
|
|
is some more digits,
|
|
and the decimal value
|
|
<I>mnn</I>
|
|
is not greater than the number of closing capturing parentheses seen so far)
|
|
a <I>back reference</I>, see below
|
|
<P></DL>
|
|
<P>
|
|
A word is defined as in the specification of
|
|
<B>[[:<:]]</B>
|
|
and
|
|
<B>[[:>:]]</B>
|
|
above.
|
|
Constraint escapes are illegal within bracket expressions.
|
|
<P>
|
|
A back reference (AREs only) matches the same string matched by the parenthesized
|
|
subexpression specified by the number,
|
|
so that (e.g.)
|
|
<B>([bc])\1</B>
|
|
matches
|
|
<B>bb</B>
|
|
or
|
|
<B>cc</B>
|
|
but not `<B>bc</B>'.
|
|
The subexpression must entirely precede the back reference in the RE.
|
|
Subexpressions are numbered in the order of their leading parentheses.
|
|
Non-capturing parentheses do not define subexpressions.
|
|
<P>
|
|
There is an inherent historical ambiguity between octal character-entry
|
|
escapes and back references, which is resolved by heuristics,
|
|
as hinted at above.
|
|
A leading zero always indicates an octal escape.
|
|
A single non-zero digit, not followed by another digit,
|
|
is always taken as a back reference.
|
|
A multi-digit sequence not starting with a zero is taken as a back
|
|
reference if it comes after a suitable subexpression
|
|
(i.e. the number is in the legal range for a back reference),
|
|
and otherwise is taken as octal.
|
|
<H3><A NAME="M59">METASYNTAX</A></H3>
|
|
In addition to the main syntax described above, there are some special
|
|
forms and miscellaneous syntactic facilities available.
|
|
<P>
|
|
Normally the flavor of RE being used is specified by
|
|
application-dependent means.
|
|
However, this can be overridden by a <I>director</I>.
|
|
If an RE of any flavor begins with `<B>***:</B>',
|
|
the rest of the RE is an ARE.
|
|
If an RE of any flavor begins with `<B>***=</B>',
|
|
the rest of the RE is taken to be a literal string,
|
|
with all characters considered ordinary characters.
|
|
<P>
|
|
An ARE may begin with <I>embedded options</I>:
|
|
a sequence
|
|
<B>(?</B><I>xyz</I><B>)</B>
|
|
(where
|
|
<I>xyz</I>
|
|
is one or more alphabetic characters)
|
|
specifies options affecting the rest of the RE.
|
|
These supplement, and can override,
|
|
any options specified by the application.
|
|
The available option letters are:
|
|
<P>
|
|
<DL>
|
|
<P><DT><A NAME="M60"><B>b</B></A><DD>
|
|
rest of RE is a BRE
|
|
<P><DT><A NAME="M61"><B>c</B></A><DD>
|
|
case-sensitive matching (usual default)
|
|
<P><DT><A NAME="M62"><B>e</B></A><DD>
|
|
rest of RE is an ERE
|
|
<P><DT><A NAME="M63"><B>i</B></A><DD>
|
|
case-insensitive matching (see MATCHING, below)
|
|
<P><DT><A NAME="M64"><B>m</B></A><DD>
|
|
historical synonym for
|
|
<B>n</B>
|
|
<P><DT><A NAME="M65"><B>n</B></A><DD>
|
|
newline-sensitive matching (see MATCHING, below)
|
|
<P><DT><A NAME="M66"><B>p</B></A><DD>
|
|
partial newline-sensitive matching (see MATCHING, below)
|
|
<P><DT><A NAME="M67"><B>q</B></A><DD>
|
|
rest of RE is a literal (``quoted'') string, all ordinary characters
|
|
<P><DT><A NAME="M68"><B>s</B></A><DD>
|
|
non-newline-sensitive matching (usual default)
|
|
<P><DT><A NAME="M69"><B>t</B></A><DD>
|
|
tight syntax (usual default; see below)
|
|
<P><DT><A NAME="M70"><B>w</B></A><DD>
|
|
inverse partial newline-sensitive (``weird'') matching (see MATCHING, below)
|
|
<P><DT><A NAME="M71"><B>x</B></A><DD>
|
|
expanded syntax (see below)
|
|
<P></DL>
|
|
<P>
|
|
Embedded options take effect at the
|
|
<B>)</B>
|
|
terminating the sequence.
|
|
They are available only at the start of an ARE,
|
|
and may not be used later within it.
|
|
<P>
|
|
In addition to the usual (<I>tight</I>) RE syntax, in which all characters are
|
|
significant, there is an <I>expanded</I> syntax,
|
|
available in all flavors of RE
|
|
with the <B>-expanded</B> switch, or in AREs with the embedded x option.
|
|
In the expanded syntax,
|
|
white-space characters are ignored
|
|
and all characters between a
|
|
<B>#</B>
|
|
and the following newline (or the end of the RE) are ignored,
|
|
permitting paragraphing and commenting a complex RE.
|
|
There are three exceptions to that basic rule:
|
|
<DL><P><DD>
|
|
<P>
|
|
a white-space character or `<B>#</B>' preceded by `<B>\</B>' is retained
|
|
<P>
|
|
white space or `<B>#</B>' within a bracket expression is retained
|
|
<P>
|
|
white space and comments are illegal within multi-character symbols
|
|
like the ARE `<B>(?:</B>' or the BRE `<B>\(</B>'
|
|
</DL>
|
|
<P>
|
|
Expanded-syntax white-space characters are blank, tab, newline, and
|
|
any character that belongs to the <I>space</I> character class.
|
|
<P>
|
|
Finally, in an ARE,
|
|
outside bracket expressions, the sequence `<B>(?#</B><I>ttt</I><B>)</B>'
|
|
(where
|
|
<I>ttt</I>
|
|
is any text not containing a `<B>)</B>')
|
|
is a comment,
|
|
completely ignored.
|
|
Again, this is not allowed between the characters of
|
|
multi-character symbols like `<B>(?:</B>'.
|
|
Such comments are more a historical artifact than a useful facility,
|
|
and their use is deprecated;
|
|
use the expanded syntax instead.
|
|
<P>
|
|
<I>None</I> of these metasyntax extensions is available if the application
|
|
(or an initial
|
|
<B>***=</B>
|
|
director)
|
|
has specified that the user's input be treated as a literal string
|
|
rather than as an RE.
|
|
<H3><A NAME="M72">MATCHING</A></H3>
|
|
In the event that an RE could match more than one substring of a given
|
|
string,
|
|
the RE matches the one starting earliest in the string.
|
|
If the RE could match more than one substring starting at that point,
|
|
its choice is determined by its <I>preference</I>:
|
|
either the longest substring, or the shortest.
|
|
<P>
|
|
Most atoms, and all constraints, have no preference.
|
|
A parenthesized RE has the same preference (possibly none) as the RE.
|
|
A quantified atom with quantifier
|
|
<B>{</B><I>m</I><B>}</B>
|
|
or
|
|
<B>{</B><I>m</I><B>}?</B>
|
|
has the same preference (possibly none) as the atom itself.
|
|
A quantified atom with other normal quantifiers (including
|
|
<B>{</B><I>m</I><B>,</B><I>n</I><B>}</B>
|
|
with
|
|
<I>m</I>
|
|
equal to
|
|
<I>n</I>)
|
|
prefers longest match.
|
|
A quantified atom with other non-greedy quantifiers (including
|
|
<B>{</B><I>m</I><B>,</B><I>n</I><B>}?</B>
|
|
with
|
|
<I>m</I>
|
|
equal to
|
|
<I>n</I>)
|
|
prefers shortest match.
|
|
A branch has the same preference as the first quantified atom in it
|
|
which has a preference.
|
|
An RE consisting of two or more branches connected by the
|
|
<B>|</B>
|
|
operator prefers longest match.
|
|
<P>
|
|
Subject to the constraints imposed by the rules for matching the whole RE,
|
|
subexpressions also match the longest or shortest possible substrings,
|
|
based on their preferences,
|
|
with subexpressions starting earlier in the RE taking priority over
|
|
ones starting later.
|
|
Note that outer subexpressions thus take priority over
|
|
their component subexpressions.
|
|
<P>
|
|
Note that the quantifiers
|
|
<B>{1,1}</B>
|
|
and
|
|
<B>{1,1}?</B>
|
|
can be used to force longest and shortest preference, respectively,
|
|
on a subexpression or a whole RE.
|
|
<P>
|
|
Match lengths are measured in characters, not collating elements.
|
|
An empty string is considered longer than no match at all.
|
|
For example,
|
|
<B>bb*</B>
|
|
matches the three middle characters of `<B>abbbc</B>',
|
|
<B>(week|wee)(night|knights)</B>
|
|
matches all ten characters of `<B>weeknights</B>',
|
|
when
|
|
<B>(.*).*</B>
|
|
is matched against
|
|
<B>abc</B>
|
|
the parenthesized subexpression
|
|
matches all three characters, and
|
|
when
|
|
<B>(a*)*</B>
|
|
is matched against
|
|
<B>bc</B>
|
|
both the whole RE and the parenthesized
|
|
subexpression match an empty string.
|
|
<P>
|
|
If case-independent matching is specified,
|
|
the effect is much as if all case distinctions had vanished from the
|
|
alphabet.
|
|
When an alphabetic that exists in multiple cases appears as an
|
|
ordinary character outside a bracket expression, it is effectively
|
|
transformed into a bracket expression containing both cases,
|
|
so that
|
|
<B>x</B>
|
|
becomes `<B>[xX]</B>'.
|
|
When it appears inside a bracket expression, all case counterparts
|
|
of it are added to the bracket expression, so that
|
|
<B>[x]</B>
|
|
becomes
|
|
<B>[xX]</B>
|
|
and
|
|
<B>[^x]</B>
|
|
becomes `<B>[^xX]</B>'.
|
|
<P>
|
|
If newline-sensitive matching is specified, <B>.</B>
|
|
and bracket expressions using
|
|
<B>^</B>
|
|
will never match the newline character
|
|
(so that matches will never cross newlines unless the RE
|
|
explicitly arranges it)
|
|
and
|
|
<B>^</B>
|
|
and
|
|
<B>$</B>
|
|
will match the empty string after and before a newline
|
|
respectively, in addition to matching at beginning and end of string
|
|
respectively.
|
|
ARE
|
|
<B>\A</B>
|
|
and
|
|
<B>\Z</B>
|
|
continue to match beginning or end of string <I>only</I>.
|
|
<P>
|
|
If partial newline-sensitive matching is specified,
|
|
this affects <B>.</B>
|
|
and bracket expressions
|
|
as with newline-sensitive matching, but not
|
|
<B>^</B>
|
|
and `<B>$</B>'.
|
|
<P>
|
|
If inverse partial newline-sensitive matching is specified,
|
|
this affects
|
|
<B>^</B>
|
|
and
|
|
<B>$</B>
|
|
as with
|
|
newline-sensitive matching,
|
|
but not <B>.</B>
|
|
and bracket expressions.
|
|
This isn't very useful but is provided for symmetry.
|
|
<H3><A NAME="M73">LIMITS AND COMPATIBILITY</A></H3>
|
|
No particular limit is imposed on the length of REs.
|
|
Programs intended to be highly portable should not employ REs longer
|
|
than 256 bytes,
|
|
as a POSIX-compliant implementation can refuse to accept such REs.
|
|
<P>
|
|
The only feature of AREs that is actually incompatible with
|
|
POSIX EREs is that
|
|
<B>\</B>
|
|
does not lose its special
|
|
significance inside bracket expressions.
|
|
All other ARE features use syntax which is illegal or has
|
|
undefined or unspecified effects in POSIX EREs;
|
|
the
|
|
<B>***</B>
|
|
syntax of directors likewise is outside the POSIX
|
|
syntax for both BREs and EREs.
|
|
<P>
|
|
Many of the ARE extensions are borrowed from Perl, but some have
|
|
been changed to clean them up, and a few Perl extensions are not present.
|
|
Incompatibilities of note include `<B>\b</B>', `<B>\B</B>',
|
|
the lack of special treatment for a trailing newline,
|
|
the addition of complemented bracket expressions to the things
|
|
affected by newline-sensitive matching,
|
|
the restrictions on parentheses and back references in lookahead constraints,
|
|
and the longest/shortest-match (rather than first-match) matching semantics.
|
|
<P>
|
|
The matching rules for REs containing both normal and non-greedy quantifiers
|
|
have changed since early beta-test versions of this package.
|
|
(The new rules are much simpler and cleaner,
|
|
but don't work as hard at guessing the user's real intentions.)
|
|
<P>
|
|
Henry Spencer's original 1986 <I>regexp</I> package,
|
|
still in widespread use (e.g., in pre-8.1 releases of Tcl),
|
|
implemented an early version of today's EREs.
|
|
There are four incompatibilities between <I>regexp</I>'s near-EREs
|
|
(`RREs' for short) and AREs.
|
|
In roughly increasing order of significance:
|
|
<P>
|
|
<DL><P><DD>
|
|
In AREs,
|
|
<B>\</B>
|
|
followed by an alphanumeric character is either an
|
|
escape or an error,
|
|
while in RREs, it was just another way of writing the
|
|
alphanumeric.
|
|
This should not be a problem because there was no reason to write
|
|
such a sequence in RREs.
|
|
<P>
|
|
<B>{</B>
|
|
followed by a digit in an ARE is the beginning of a bound,
|
|
while in RREs,
|
|
<B>{</B>
|
|
was always an ordinary character.
|
|
Such sequences should be rare,
|
|
and will often result in an error because following characters
|
|
will not look like a valid bound.
|
|
<P>
|
|
In AREs,
|
|
<B>\</B>
|
|
remains a special character within `<B>[ ]</B>',
|
|
so a literal
|
|
<B>\</B>
|
|
within
|
|
<B>[ ]</B>
|
|
must be written `<B>\\</B>'.
|
|
<B>\\</B>
|
|
also gives a literal
|
|
<B>\</B>
|
|
within
|
|
<B>[ ]</B>
|
|
in RREs,
|
|
but only truly paranoid programmers routinely doubled the backslash.
|
|
<P>
|
|
AREs report the longest/shortest match for the RE,
|
|
rather than the first found in a specified search order.
|
|
This may affect some RREs which were written in the expectation that
|
|
the first match would be reported.
|
|
(The careful crafting of RREs to optimize the search order for fast
|
|
matching is obsolete (AREs examine all possible matches
|
|
in parallel, and their performance is largely insensitive to their
|
|
complexity) but cases where the search order was exploited to deliberately
|
|
find a match which was <I>not</I> the longest/shortest will need rewriting.)
|
|
</DL>
|
|
<H3><A NAME="M74">BASIC REGULAR EXPRESSIONS</A></H3>
|
|
BREs differ from EREs in several respects. `<B>|</B>', `<B>+</B>',
|
|
and
|
|
<B>?</B>
|
|
are ordinary characters and there is no equivalent
|
|
for their functionality.
|
|
The delimiters for bounds are
|
|
<B>\{</B>
|
|
and `<B>\}</B>',
|
|
with
|
|
<B>{</B>
|
|
and
|
|
<B>}</B>
|
|
by themselves ordinary characters.
|
|
The parentheses for nested subexpressions are
|
|
<B>\(</B>
|
|
and `<B>\)</B>',
|
|
with
|
|
<B>(</B>
|
|
and
|
|
<B>)</B>
|
|
by themselves ordinary characters.
|
|
<B>^</B>
|
|
is an ordinary character except at the beginning of the
|
|
RE or the beginning of a parenthesized subexpression,
|
|
<B>$</B>
|
|
is an ordinary character except at the end of the
|
|
RE or the end of a parenthesized subexpression,
|
|
and
|
|
<B>*</B>
|
|
is an ordinary character if it appears at the beginning of the
|
|
RE or the beginning of a parenthesized subexpression
|
|
(after a possible leading `<B>^</B>').
|
|
Finally,
|
|
single-digit back references are available,
|
|
and
|
|
<B>\<</B>
|
|
and
|
|
<B>\></B>
|
|
are synonyms for
|
|
<B>[[:<:]]</B>
|
|
and
|
|
<B>[[:>:]]</B>
|
|
respectively;
|
|
no other escapes are available.
|
|
|
|
<H3><A NAME="M75">SEE ALSO</A></H3>
|
|
<B><A HREF="../TkCmd/regexp.htm">RegExp</A></B>, <B><A HREF="../TkCmd/regexp.htm">regexp</A></B>, <B><A HREF="../TkCmd/regsub.htm">regsub</A></B>, <B><A HREF="../TkCmd/lsearch.htm">lsearch</A></B>, <B><A HREF="../TkCmd/switch.htm">switch</A></B>, <B><A HREF="../TclCmd/text.htm">text</A></B>
|
|
<H3><A NAME="M76">KEYWORDS</A></H3>
|
|
<A href="../Keywords/M.htm#match">match</A>, <A href="../Keywords/R.htm#regular expression">regular expression</A>, <A href="../Keywords/S.htm#string">string</A>
|
|
<HR><PRE>
|
|
<A HREF="../copyright.htm">Copyright</A> © 1998 Sun Microsystems, Inc.
|
|
<A HREF="../copyright.htm">Copyright</A> © 1999 Scriptics Corporation
|
|
<A HREF="../copyright.htm">Copyright</A> © 1995-1997 Roger E. Critchlow Jr.</PRE>
|
|
</BODY></HTML>
|