cwal

s2: Grammar
Login

s2: Grammar

(⬑Table of Contents)

s2 Grammar

This section provides an overview of s2's built-in symbols, as well as the rules regarding client-defined symbols (i.e. identifiers).

Jump to...

Related topics covered in separate files:

Terminology

This documentation generally assumes some familiarity with at least one other scripting language, but does not assume a detailed knowledge of any of them. Anyone familiar with one or more scripting environments will have no trouble following along.

Some terminology used throughout this documentation which might be unfamiliar to readers (in no particular order):

Anatomy of an Expression

s2 has no formal grammar chart which tells us what, precisely, is and is not legal. It is an expression parser which linearly reads values and operators, applying the operators to the values, following conventional (C-like) operator precedence rules to govern their processing order, with the exception that it also treats keywords as expressions. Any "misplaced" tokens (values or operators) produce parsing errors, whereas valid ones run through a stack machine. The results, one will find, are rather conventional, and mostly intuitive to anyone who is familiar with any of the more common JavaScript-like languages languages. With the exception of its symbol lookup rules, there should be few surprises to anyone with experience with at least a couple programming environments.

Non-symbols and Expression Boundaries

s2 generally treats any whitespace, newlines, and comments as insignificant non-symbols, with the exception that it recognizes end-of-line (EOL) as an expression terminator (a.k.a. EOX: end of expression) in some very few special cases. End-of-file (EOF) is always treated as an expression terminator, so it may cause a syntax error if encountered mid-expression. As a special exception, if the first line of a script starts with "#!", it is assumed to be a Unix shebang line, and is skipped (up to and including the first newline character).

In theory s2 handles both common end-of-line styles (\n and \r\n) equivalently, but in truth nobody's tested the latter as of this writing because… well, because Windows is icky. Newlines in string literals will be retained as they are entered in the string (that might or might not be a bug - hasn't been a problem so far, but it could potentially be one).

s2 supports both C- and C++-style comments, which may appear essentially anywhere in code:

// comment until end of line
/* comment until … */

Sidebar: inside of comments, backslash escapes are not processed. That means that a //-style comment which ends in a backslash followed by a newline does not, contrary to the C standard, combine the next line with the comment.

When evaluating, comments are ignored as if they were spaces. Newlines are almost always treated as whitespace, with the exception that some block-level constructs allow (for convenience and convention) a newline to act as an end-of-expression. Here is an example of the effect of semicolons and newlines around an expression:

expr ; // === expr
expr <NEWLINE>;<NEWLINE> // === expr, no matter how many newlines surround ';'
expr ; /* === expr */ ; // === <NULL> see below

A second semicolon "deletes" the current pending expression result value. A single semicolon, along with any number of newlines, or a series of newlines without a semicolon, retain the pending expression's value. This is more notable in eval/scope blocks, where too many semicolons can cause it to "lose" its result value, causing the eval/scope block to evaluate to the undefined value (but that can also be used to avoid propagating a local value out of a scope). While an EOF is implicitly an EOX, only semicolons are counted for purposes of tracking the current expression result. i.e. an EOF in place of any of those semicolons except that very last one has no effect on the result of the expression, and replacing the last semicolon with an EOF would cause the script to evaluate to the result of the expression on the last line.

s2 evaluates parenthesis/brace groups and {script} blocks as atomic tokens, a side-effect of which is that such blocks end (from the perspective of the sub-parser) at a virtual EOF at the closing parenthesis/brace/bracket. Because EOF is treated as an implicit EOX, a semicolon is optional for the final expression at the end of any such construct (and might even be a syntax error, depending on the context).

Design note regarding semicolons: s2 initially used an EOL as an optional expression terminator. This was comfortable to use, but caused several annoying limitations in the placement of operators (namely, that binary and ternary ops required their LHS to be on the same line as the operator). To get around such limitations, without requiring special-case lookahead in many places, and allow the removal of a good deal of special-case handling of EOLs, it was changed to require semicolons to end expressions, with a small handful of cases where it special-cases EOL as EOX because (let's face it) nobody really wants to have to end their if/while/for/scope blocks with a semicolon. This switch, in hindsight, was the right thing to do, as it allows much more flexibility in script coding style than the EOL-as-EOX world does, provides better predictability for those writing script code, and makes the parser more robust.

Identifiers and Strings

s2 supports only UTF-8 inputs. Any input scripts and their script-side strings may contain arbitrary UTF-8. Identifiers in s2 are case-sensitive and may consist of:

s2 expects its UTF-8 inputs to be well-formed. When encountering invalid UTF-8, it will stop or skip whatever it's doing with those bytes, possibly silently, and behaviour with non-UTF-8 encodings is essentially undefined (but "should" never outright crash, corrupt memory, or similar). When client C code creates string values (using cwal_new_string() and the various printf-style string generators), it is up to that client code to ensure that the string is legal UTF-8. cwal provides cwal_utf8_read_char(), which reads a single UTF-8 character from a C-string, and that's the routine the library uses to traverse UTF-8 internally.

String Syntaxes

s2 supports several string syntaxes: single-quoted, double-quoted, and heredocs.

"double-quoted" and 'single-quoted' strings are functionally identical:

Heredocs come in two forms, the first being <<<IDENTIFIER content IDENTIFIER. They are treated like string literals, but are not subject to any form of unescaping (but see the unescape() string method) and have "trimming" behaviour described below. They behave mostly like heredocs in other languages:

Heredocs have a second syntax which plays better with syntax highlighting and auto-indention modes: {{{heredoc}}}. Examples:

{{{ blah }}};
{{{
   spaces are stripped
   as for the <<< heredoc form
}}};
{{{: colon works as for the <<<: form. }}}

Built-in Constants

The following constant values are implemented as keywords, and behave like any other values:

The integer values (-1, 0, 1), double values (-1.0, 0.0, 1.0), length-0 strings, and all length-1 ASCII strings (byte values 0 to 127, inclusive) are also constant/shared values which, like the above-listed constants, never require allocation and do not partake in lifetime management. i.e. '', "", and <<<EOF EOF (all empty strings) all refer to the exact same, shared C-level empty string as well as the same cwal-level Value instance (and, despite that, they all have a refcount of 0 because builtins do not partake in lifetime management!).

Sidebar: cwal can, and normally does, compile a larger range of integer (but not double) values in as built-in constants. Looking at the source code right this minute, it seems i went overboard and configured it for the inclusive range [-10,20]. Those interested in the details should take a peek at cwal.c and search for CWAL_BUILTIN_INT_FIRST - a reasonable amount of documentation can be found there.

Scopes and Scoping

Scopes are containers for variables and also play a major role in the lifetime management of non-variables (temporary/unnamed Values, as well as C-side Values not visible from script code). Scopes are organized in a stack and are created implicitly by almost all block-level constructs, including array- and object-literals. Scopes use objects as property storage for scope-level variables, so share the same search performance characteristics.

s2 always, except in the case of an abnormal C-level exit (e.g. a C-level assert() failure crashes the app, C's exit() or abort() functions are called, or the s2 client fails to clean up the s2 engine instance), unwinds its scope stack, even in the case of a syntax error, the s2 exit or fatal keywords are used, or when a script-side assertion fails. The primary side-effect of this is that all finalizers1 for all values will get called even in the face of s2-level exit/fatal/assert. This is of note mainly for Native bindings which require proper finalization for correct behaviour (which effectively means any non-trivial types which are worth scripting). Not all scripting engines which allow client-defined finalizers guaranty that they will ever be called, and that was, in fact, one of the catalysts which got the cwal project started (after several years of bikeshedding about the GC model). (Specifically, the Google V8 JavaScript engine essentially never calls them, and its developers justify that behaviour as being a valid speed optimization for their engine. Nevermind that it's semantically incorrect to not properly finalize a file or database handle if you expect the underlying resources to behave properly.)

Where Scopes are Created

This list describes every(?) place where s2 creates a scope:

It seems that's about it.

Script-fatal Errors vs. Exceptions

s2 is (at the C level) capable of maintaining error state as exceptions (intended primarily for the script level) or non-exception errors (intended solely for the C level), both with script location information, an error code (integer), and a descriptive error string. Non-exception errors stop the execution of the script entirely, whereas exceptions can potentially be caught by script code and recovered from. The current thinking is that we will convert C-level errors to exceptions for all cases which we believe are (at least potentially) recoverable from script code. i.e. when a syntax error happens inside of a function imported from a separate script file, we don't necessarily want the whole script to die. These details are being decided on the fly as s2 is developed.

For the most part, s2 is capable of pinpointing error locations at the exact symbols which trigger them, and exceptions provides line/column and stack trace info to help narrow down the search. There is a case or two where it will produce confusing results, namely when a dynamically-constructed string gets eval'd but contains an error. In such a case, the location information refers to a "virtual" script which the client cannot see (and possibly no longer exists by the time the eval returns), which (for error reporting purposes) starts at the eval point.

Footnotes

all instances of that type). Client-defined Native bindings and client-defined Function state can also have finalizers which are called when the containing native/function is finalized by cwal.


  1. ^ Internally, all value types have a finalizer (one shared by