cwal

whcl POSIX Regexes
Login

whcl POSIX Regexes

(⬑Main Module Docs)

POSIX Regexes and Features Common to POSIX and JS Regexes

Jump to:

The Regular Expressions API

This module exposes a single function which compiles regular expression strings in the POSIX regex grammar and returns a module-specific type:

RegexType module string pattern [string compileFlags]

The first parameter is a regular expression pattern in the module's regex dialect.

The second parameter is an optional string describing regex compilation flags, each one a single letter. The POSIX-specific compilation flags are described in a subsection below. An empty string is legal for the flags.

The compilation function throws an exception if the pattern is syntactically invalid or an invalid flag letter is provided.

The compilation function is used like this:

decl -const regcomp [whcl load-module '/path/to/regex-posix.so']
# or, if it's built in to whcl:
decl -const regcomp [whcl install-api regex-posix]
decl regex [regcomp "foo" i]

Minor Achtung: Capture Limits

The POSIX regex API has some peculiar rules regarding capture limits, how it deals with them, and limitations it places on regexes which are compiled with the NOSUB flag.

This library imposes a compile-time limit on the number of capture groups which a regex may have (10, as of this writing, including the full-string match). The POSIX regex will gladly process a regex which has more captures than that limit, but will silently ignore any captures above that limit. This API, however, will trigger an exception if a regex is run with more captures than this limit. We "could" silently ignore such a situation, and that design decision is up for reconsideration later, but practice implies that we should apply the stricter option until/unless we later decide that the laxer approach would be more useful.

Instance Methods

The regex type returned from module is a different type but they have nearly-identical interfaces. Their common methods and behaviours, as well as any significant differences in behaviour, are described below.

destroy

Usage: destroy

Immediately frees all native resources used by this regex. That also happens when garbage collection reaps the regex, but clients may force it immediately with this method. After this method is called, calling any regex methods on this object will throw an exception because the underlying C-level regex instance no longer exists.

each-match

usage: each-match text string|function callback [matchFlags]

For each match of this regex in the given text, this function calls the given callback:

Any return/result value of the callback is ignored.

matchFlags may be any flags accepted by exec or replace, noting that the $ and E replace-specific flags are automatically implied if the callback is a string or Script (because this function would be useless with a string callback without those flags). The E flag is ignored if the callback is a function. The $ flag can be used to provide a callback function access to sub-captures.

Achtung: the POSIX regex C API does not support each-match for regexes which are compiled with the s (NOSUB) flag, and attempting it will trigger an exception.

exec

Usage: exec text [matchFlags]

Returns false if the input string does not match, else it returns a list of the matches (but see the caveat below!). Element 0 in the list is the entire match and each subsequent element is the contents of a captured subexpression. Thus capture number N is element N in the result list. Both flavours of regex have a hard upper limit of captures (see the comments above on this topic), including the whole-match entry.

The second argument is an optional string of letters representing regex-flavor-specific match-time flags. The legal flags and letters for POSIX are listed in a following subsection.

Achtung: the POSIX regexec() C API cannot report the substring position of matches for regexes which are compiled with the s (NOSUB) flag (not even for the whole-match part), so this method will, for such regexes, return true, instead of a list, on a match.

match-all

Usages:

This function has two distinct modes:

In both cases, if no match is found a falsy value is returned.

matchFlags is an optional flag to change how matching works, exactly as described for exec.

Achtung: the POSIX regex C API does not support match-all for regexes which are compiled with the s (NOSUB) flag, and attempting it will trigger an exception.

Minor Achtung: remember that captures for the POSIX regex API are a genuine pain in the butt when compiled with the BASIC flag.

replace

Usages:

Replaces instances of the regex's match in the given string with the given replacement. replace only accepts strings, not buffers, as input text.

Achtung: the POSIX regex C API does not support replace for regexes which are compiled with the s (NOSUB) flag, and attempting it will trigger an exception.

The given replacement value is applied in the part(s) of the text argument which match the regex. High-level replacement values like objects and arrays will be appended in JSON form, but any cycles in such constructs will trigger an exception. If replacement is a function then:

  1. It is passed the complete match text of each match and the result of the call becomes the replacement.
  2. In the context of the callback, this refers to the regex instance.

If replacement is a Script value then it's treated as if it were a string and the E_ match-flags (see below) are in effect. This is far more efficient than passing a string with that flag, as this for only has to be compiled once, instead of on each iteration.

If maxReplacements is passed in, it must be an integer. A value of 0 or less means unlimited, else the number of replacements is limited to the given value.

matchFlags is an optional flag to change how matching works, exactly as described for exec, plus it supports the following string-form flag letters which only work for this routine and each-match:

If the replacement is a Script value then both of those flags are implicit.

TODO: replace the final arg with flags.

Sidebar: we do not export symbols named $0...$N, as would be more conventional vis-a-vis other regex APIs, because: (A) it would be much less efficient to do so. (B) it won't work with our var naming syntax.

When using a function for the replacement, the function is passed the full string of each match. If the $ matchFlag is used then captures are available as described above. Without that flag, the function may nonetheless perform replacement based on capture groups by calling exec on the regex from within the replacement callback, like so:

decl x [regcomp '[ \t]*([a-z]+)[ \t]*(;)?']
# Normalize inputs to upper-case and strip extraneous spaces:
assert 'A;B;C' == [re replace 'a\t  ;\tb; c' [proc -anon {a} {
    decl m [this.exec $a]
    # At this point we know this regex matches the input,
    # so we don't need to bother checking whether m is falsy.
    decl u [m.1.to-upper]
    if {m.2} {set u [$u concat m.2]}
    return $u
}]]

However, that requires running the regex twice on each matching part of the input, so using the $ flag is recommended if captures beyond the whole-match capture are needed.

Here's a functionally equivalent example which uses the $ flag to make the captures available and the E flags to demonstrate using a string as a callback body:

# Using the same regex as above:
assert 'A;B;C' == [re.replace
    'a\t  ;\tb; c'
    {
        decl rc
        if {_.2} {
            set rc [[_.1.to-upper].concat _.2]
        } else {
            set rc [_.1.to-upper]
        }
        eval $rc; # sets result of the outer eval block
    }
    '$E']

Remember that a {...}, when passed to a function, is just a special case of string, not a code block which gets evaluated before calling the function. The E flag then causes that string to be eval'd for each replacement.

Sidebar: no buffers as input? Buffers are not allowed as input here because, when used in conjunction with a replacement callback/eval, it would be possible for the buffer to be modified during traversal, which would lead to Undefined Behaviour. Working around that (by moving its contents out of the way during traversal, similar to how Buffer.eval-contents works) would be rather fidgety. Interestingly, though, passing a buffer as a replacement value is legal because without a callback there is no risk of it being modified during the replacement process. It might seem sane to permit buffer inputs when the replacement is not a callback, but that could still potentially backfire badly when invoked in certain convoluted recursive contexts.

split

Usages:

This works similarly to "astring".split "pattern" except that it splits the first argument on this regex's pattern.

If a limit is 0 or greater, it captures, at most, that many elements, otherwise it captures all it can (just like string.split). (Yes a limit of 0 is valid, but i have no idea why - JS allows it. Note that this differs from string.split, where a limit of 0 means unlimited.)

matchFlags is an optional flag to change how matching works, exactly as described for exec.

Note that, like string.split, this treats matching separators at the start and end of the input as empty entries at the start resp. end of the result list.

Note also that while string.split '' splits the string into its componenent characters, there is no equivalent with this API because the API forbids empty regexes. Splitting on a regex of "." will behave much differently, treating each character as a separator and returning a list of what's between those separators (empty strings).

Achtung: the POSIX regex C API does not support split for regexes which are compiled with the s (NOSUB) flag, and attempting it will trigger an exception.

test

Usage: test text [matchFlags]

Works like exec but returns true if the given text matches the regex, else false. i.e. it does not allocate a data structure for the result, so it's more efficient (but less informative) than exec.

Minor Achtung: Arrays vs Tuples

The regex APIs explicitely do not use the terms "array" or "tuple" when refferring to list-type return values. Any given routine may return either in any context where a list is returned, that type may differ between versions of the modules.

For 99+% of use client-side cases, it doesn't make a difference either way: the two list types are used the same way until/unless the client wants to change their size or sort them or perform some other array-only operation. Element access and foreach iteration are the same for both list types, and those are the only operations normally performed on string-matching APIs of this sort.

FWIW, the "preferred" return type is a tuple, as they're memory-lighter, but that's only possible when a given function knows, in advance, how many slots the result list will need (which isn't always the case).

Big Achtung: Strings vs. Buffers

Most places where the regex APIs accept a string, they also accept a Buffer, but results are undefined if such a buffer contains non-String data. Routines which accept accept callbacks (e.g. to iterate over matches or for implementing dynamic string replacement) disallow buffers as input because it would be disastrous if the buffers were modified by/via such callbacks while this API is traversing its contents.

Properties

Module-level Properties

Each regex module's compilation function has a property named instance-prototype, the prototype which gets assigned to each new regex instance. This can be used to modify the behaviour of regexes without having to first instantiate one to get at its prototype.

An example of such modification would be to extend the string prototype in order to be able to make use of certain regex functionality. For example, the following code replaces string.split with a proxy which can make use of either (or both) of the regex modules when a regex is passed as its first argument:

decl R [whcl install-api regex-posix]
decl S "".__prototype
if {S.split} {
    set S.split proc proxy {/*pattern, limit*/} {
        if {[info is-native argv.0] && argv[0][__prototype] == $P} {
            return [argv[0].split this (argv.1 || -1)]
        } else {
            return [X.apply this $argv]
        }
    } using -scope {
        decl -const P R.instance-prototype
        decl -const X S.split
    }
}
# Now demonstrate it...
# First, split on a regex:
decl m ["a;b;c".split [R " *; *"]]
assert 3 == [m.length]
assert 'c' == m.2
# Now split on a string:
set m ["A;B;C".split ";"]
assert 3 == [m.length]
assert 'C' == m.2

A similar proxy could be used to allow string.replace(pattern,replacement) to accept (regex,function) arguments:

decl -const R [whcl install-api regex-posix]
decl -const S "".__prototype
if {S.replace} {
    set S.replace proc {/*needle, replacement*/} {
        if {[info is-native argv.0] && argv[0][__prototype] == $P} {
            return [argv[0].replace this argv.1]
        } else {
            return [X.apply this $argv]
        }
    } using -scope {
        decl -const P R.instance-prototype
        decl -const X S.replace
    }
}
# Now try it out...
# Replace via regex:
assert abc == ["AbC".replace [R '[A-Z]'] [proc -anon {x} {return [x.to-lower]}]]
# Replace via string:
assert abc == ["Abc".replace A a]

Per-Instance Properties

Each regex instance has the following standard properties assigned to it:

The values of those properties can be used to save and restore a regex for later use.

Regex Flags

Regex Compilation Flags

Regex compilation may be modified by providing a string of single-letter flags. Unknown flags cause an exception to be thrown. The flags for this module are listed below. The upper-case names listed next to each flag are the C-level names for the flags, with the exception of BASIC, which is does not exist at the C level (C-level POSIX regexes default to BASIC mode unless the EXTENDED flag is used, whereas this API does the opposite because basic-style regexes are not terribly useful.)

Regex Match-time Flags

Like compilation flags, flags which change regex matching behaviour at match-time may be provided as a string of letters describing the flag(s):

Symbol Resolution via Callbacks

Normally in whcl, the search for resolving symbols stops at a function call boundary. That makes using certain types of callbacks tedious, in particular it severely limits the use of eval'able strings as callback handlers. The following methods are configured such that symbol resolution is permitted to pass on through them, to simplify implementing client-side callbacks:

When passing callback functions, as opposed to eval'able script snippets, to these methods, the callbacks still need the -xlookup flag if they want to reference symbols which are accessible from the scope in which the above method is called.

See the proc docs for more details about this feature.

Achtung: Locale-dependent Matching

The POSIX regex API uses locale-dependent pattern matching and this module does not set the software's locale because it cannot know if the overlying software has done so, or what effects changing that setting might have on the rest of the app. Thus, unless the software or environment changes the locale, these regexes will use the "C" locale for matching purposes and will likely not match non-ASCII strings or patterns.