Regular expressions are pattern matching algorithms for strings found in
many contemporary languages. You can add regular expression
functionality to Gforth with require regexp.fs
.
The classical implementation for this pattern matching is a backtracking
algorithm, which is also necessary if you want to have features like
backreferencing. Gforth implements regular expressions by providing a
language to define backtracking programs for pattern matching. Basic
element is the control structure FORK
… JOIN
, which
is a forward call within a word, and therefore allows to code a
lightweight try and fail control structure.
AHEAD-like control structure: calls the code after JOIN.
THEN-like control structure for FORK
You can program any sort of arbitrary checks yourself by computing a
flag and ?LEAVE
when the check fails. Your regular expression
code is enclosed in ((
and ))
.
start regexp block
end regexp block
Pattern matching in regular expressions have character sets as elements,
so a number of functions allow you to create and modify character sets
(called charclass
). All characters here are bytes, so this
doesn’t extend to unicode characters.
Create a charclass
add a char to the current charclass
remove a char from the current charclass
add a range of chars to the current charclass
add a string of chars to the current charclass
union of charclass class and the current charclass
subtract the charclass class from the current charclass
There are predefined charclasses and tests for them, and generic checks. If a check fails, the next possible alternative of the regular expression is tried, or a loop is terminated.
check addr for membership in charclass class
check addr for not membership in charclass class
check for digit
check for blanks
check for any single charachter
check for not digit
check for not blank
check for particular char
check for particular char
You can certainly also check for start and end of the string, and for whole string constants.
check for string start
check for string end
check for a computed string on the stack (possibly a backreference)
check for string
Loops that check for repeated character sets can be greedy or non-greedy.
greedy zero-or-more pattern
end of greedy zero-or-more pattern
greedy one-or-more pattern
end of greedy one-or-more pattern
non-greedy zero-or-more pattern
end of non-greedy zero-or-more pattern
non-greedy one-or-more pattern
end of non-greedy one-or-more pattern
Example: Searching for a substring really is a non-greedy match of anything in front of it.
search for string
Alternatives are written with
Start of alternatives
separator between alternatives
end of alternatives
You can use up to 9 variables named \1
to \9
to refer to
matched substrings
start of matching variable; variables are referred as \\1–9
end of matching variable
the whole string
Certainly, you can also write code to replace patterns you found.
Start replace pattern region
Start arbitrary replacement code, the code shall compute a string
on the stack and pass it to <<
Replace string from start of replace pattern region with addr u
Replace string from start of replace pattern region with string
start search/replace loop
search end
end search/replace single loop
end search/replace all loop
Examples can be found in test/regexp-test.fs
.