The RegExpSyntax
structure provides an abstract-syntax-tree
representation of regular expressions. Its main purpose is to
provide communication between different front-ends (implementing
different RE specification languages), and different back-ends
(implementing different compilation/searching algorithms).
It is also possible, however, to use it as a way to directly
specify a regular expression for a back-end engine.
Synopsis
signature REGEXP_SYNTAX
structure RegExpSyntax : REGEXP_SYNTAX
Interface
exception CannotCompile
structure CharSet : ORD_SET where type Key.ord_key = char
datatype syntax
= Group of syntax
| Alt of syntax list
| Concat of syntax list
| Interval of (syntax * int * int option)
| MatchSet of CharSet.set
| NonmatchSet of CharSet.set
| Char of char
| Begin
| End
val optional : syntax -> syntax
val closure : syntax -> syntax
val posClosure : syntax -> syntax
val fromRange : char * char -> CharSet.set
val addRange : CharSet.set * char * char -> CharSet.set
val allChars : CharSet.set
val alnum : CharSet.set
val alpha : CharSet.set
val ascii : CharSet.set
val blank : CharSet.set
val cntl : CharSet.set
val digit : CharSet.set
val graph : CharSet.set
val lower : CharSet.set
val print : CharSet.set
val punct : CharSet.set
val space : CharSet.set
val upper : CharSet.set
val word : CharSet.set
val xdigit : CharSet.se
Description
exception CannotCompile
-
This exception is meant to be raised by back-ends when they encounter a feature that they cannot handle.
structure CharSet : ORD_SET where type Key.ord_key = char
-
This substructure implements sets of 8-bit characters. Currently it is implemented using sorted lists (i.e., using the
ListSetFn
functor), but that may be changed in the future. datatype syntax
-
This datatype defines the abstract syntax of regular expressions that is supported by the library. The constructors are defined as follows:
-
Group re
:: defines a match group (i.e., that produce a corresponding match-tree node for the input matched byre
. -
Alt[re1, re2, …, ren]
:: matches any ofre1
,re2
, …,ren
. If the list is empty, then it matches nothing. -
Concat[re1, re2, …, ren]
:: matches the concatenation ofre1
,re2
, …,ren
. If the list is empty, then it matches the empty string. -
Interval(re, n, NONE)
:: matchesre
repeated at leastn
times. -
Interval(re, n, SOME m)
:: matchesre
repeated fromn
tom
times. -
MatchSet cs
:: matches a single character that is in the setcs
. -
NonmatchSet cs
:: matches a single character that is not in the setcs
. -
Char c
:: matches the single characterc
. -
Begin
:: matches beginning of the input stream. -
End
:: matches end of the input stream.
-
val optional : syntax → syntax
-
optional re
is equivalent toInterval(re, 0, SOME 1)
. val closure : syntax → syntax
-
closure re
is equivalent toInterval(re, 0, NONE)
. val posClosure : syntax → syntax
-
posClosure re
is equivalent toInterval(re, 1, NONE)
. val fromRange : char * char -> CharSet.set
-
fromRange (c1, c2)
returns the set containing the characters in the range fromc1
toc2
(inclusive). This expression raises theSize
exception ifc2 < c1
. val addRange : CharSet.set * char * char -> CharSet.set
-
addRange (cs, c1, c2)
adds the set of characters in the range fromc1
toc2
(inclusive) tocs
. This expression raises theSize
exception ifc2 < c1
. val allChars : CharSet.set
-
is the set of all 8-bit characters.
POSIX Character Classes
The RegExpSyntax
structure pre-defines the following character sets,
which are part of the POSIX regular-expression standard (plus a couple
of extras):
val alnum : CharSet.set
-
is the set of letters and digits.
val alpha : CharSet.set
-
is the set of letters.
val ascii : CharSet.set
-
is the set of characters
c
such that0 <= ord c <= 127
. val blank : CharSet.set
-
is the set of
#"\t"
and space. val cntl : CharSet.set
-
is the set of non-printable characters.
val digit : CharSet.set
-
is the set of decimal digits.
val graph : CharSet.set
-
is the set of visible characters (does not include space).
val lower : CharSet.set
-
is the set of lower-case letters.
val print : CharSet.set
-
is the set of printable characters (includes space).
val punct : CharSet.set
-
is the set of visible characters other than letters and digits.
val space : CharSet.set
-
is the set of
#"\t"
,#"\r"
,#"\n"
,#"\v"
,#"\f"
, and space. val upper : CharSet.set
-
is the set of upper-case letters.
val word : CharSet.set
-
is the set of letters, digit, and
#"_"
. val xdigit : CharSet.set
-
is the set of hexadecimal digits.