Introduction
The RegExp Library is designed for flexibility; it allows mixing and matching of different front-end syntax with back-end engines, as well as supporting arbitrary input sources. This flexibility, however, comes at the cost of making some of the simple applications a bit less obvious. This tutorial shows how the RegExp Library can be used in a variety of common applications.
Assembling an Regular Expression Matcher
Before we can do anything else, we must assemble a regular-expression
matcher. For the purposes of this tutorial, we use a combination
of the AwkSyntax
front-end and the BackTrackEngine
back-end.
structure RE = RegExpFn(
structure P = AwkSyntax
structure E = BackTrackEngine)
Match trees
Regular expressions may contain grouping
operators. When a pattern matches a string, these groups induce a nested
tree structure on the matched string.
The MatchTree
structure defines
a polymorphic representation of this structure, along with a
number of utility functions for extracting information from
a match.
structure MT = MatchTree
Example: scanning tokens
The match
function in the REGEXP
signature allows one to distinguish
between a set of possible regular expression matches. One application of
this mechanism is a simple scanner. Let us define a datatype for tokens,
which can be white space, numbers, or identifiers.
datatype tok
= WS | NUM of IntInf.int | ID of string
We can then define the scanner as follows:
fun scanner getc gets = let
fun getMatch cons (MT.Match({pos, len}, _)) = cons (gets (pos, len))
in
RE.match [
("[ \t\n]+", getMatch (fn _ => WS)),
("[0-9]+", getMatch (fn s => NUM(valOf(IntInf.fromString s)))),
("[a-zA-Z][a-zA-Z0-9]*", getMatch ID)
] getc
end
Here the getc
parameter is the standard character reader; we have also included
the gets
parameter, which is a function of type
'strm * int -> string
for getting a string from a stream. For many input sources, the gets
function
has an efficient and direct implementation, but it can also be implemented in
terms of the getc
function as follows:
fun gets getc (strm, n) = let
fun getChrs (0, _, chrs) = String.implodeRev chrs
| getChrs (n, strm, chrs) = (case getc strm
of NONE => raise Fail "empty stream"
| SOME(c, strm) => getChrs (n-1, strm, c::chrs)
(* end case *))
in
getChrs (n, strm, [])
end;
Because this function is only called after the scanner
function has matched
a sequence of n
characters from strm
, the "empty stream"
case will not
occur for well behaving input streams.
Here is an example of using the scanner to tokenize strings, where we use the Basis Library substring type to implement the stream type:
fun tokens s = let
fun gets (ss, n) = Substring.string(Substring.slice (ss, 0, SOME n))
val scan = scanner Substring.getc gets
fun lp (ss, toks) = (case scan ss
of SOME(tok, ss') => lp (ss', tok::toks)
| NONE => List.rev toks
(* end case *))
in
lp (Substring.full s, [])
end;