cwal

whcl: Tokenization and Evaluation
Login

whcl: Tokenization and Evaluation

(⬑Table of Contents)

Tokenization and Evaluation

This document provides an overview of whcl's tokenization and evaluation models.

Tokenization

whcl uses two different tokenizers internally:

  1. The lowest-level tokenizer takes a range of string input and feeds us each subsequent token, recording its type, the range of its bytes, and the line/column number. This one is completely static, requires no dynamic memory, and does not remember tokens after it's processed them. This particular tokenizer is largely grammar-agnostic and is directly derived from the same one used in several predecessor projects. This phase is where we distinguish, among other things, numeric literals and strings.
  2. A higher-level tokenizer uses the first one to completely tokenize an input and store it in a different form: an array of tokens (as distinct from a linked list or tree of tokens). This phase is colloquially called "compilation." Though the tokens are allocated as a single array, they also act as a singly-linked list, but with links running in up to 3 directions.

With that initial compilation out of the way, a post-compile step restructures the tokens based on the needs of the evaluation engine. "Restructure" is used loosely here: the order of the tokens is retained but their links to each other (made using numeric offsets instead of pointers, to reduce memory usage) get redirected to create what amounts to a tree. A non-exhaustive list of things which get transformed this way:

The API provides a function for dumping out the final token tree, and here's a snippet of what that looks like:

$ cat x.whcl
for {decl i 0} {$i < 1} {incr i} {
    if {true} {
        incr i [while {1} {break 1}]
    }
}

$ ./whclsh -e 'pragma dump-tokens -v -eof -f "x.whcl"' 
#1: BIC (#BIC_for)@ 1, 0: len 3 ==> #2 for
#2: SquigglyGroup@ 1, 4: len 10 ==> #7 {decl i 0}
  i#3: BIC (#BIC_decl)@ 1, 5: len 4 ==> #4 decl
  i#4: Identifier@ 1, 10: len 1 ==> #5 i
  i#5: LiteralNumber (#LiteralIntDec)@ 1, 12: len 1 ==> #6 0
  i#6: EOF@ 1, 13: len 1 ==> #0
#7: SquigglyGroup@ 1, 15: len 8 ==> #12 {$i < 1}
  i#8: IdentifierDeref@ 1, 16: len 2 ==> #9 $i
  i#9: OpCmpLT@ 1, 19: len 1 ==> #10 <
  i#10: LiteralNumber (#LiteralIntDec)@ 1, 21: len 1 ==> #11 1
  i#11: EOF@ 1, 22: len 1 ==> #0
#12: SquigglyGroup@ 1, 24: len 8 ==> #16 {incr i}
  i#13: BIC (#BIC_incr)@ 1, 25: len 4 ==> #14 incr
  i#14: Identifier@ 1, 30: len 1 ==> #15 i
  i#15: EOF@ 1, 31: len 1 ==> #0
#16: SquigglyGroup@ 1, 33: len 62 ==> #40 {
    if {true} {
        incr i [while {1} {break 1}]
    }
}
  i#17: EOL@ 1, 34: len 1 ==> #18
  i#18: BIC (#BIC_if)@ 2, 4: len 2 ==> #19 if
  i#19: SquigglyGroup@ 2, 7: len 6 ==> #22 {true}
    i#20: BIV (#BIV_true)@ 2, 8: len 4 ==> #21 true
    i#21: EOF@ 2, 12: len 1 ==> #0
  i#22: SquigglyGroup@ 2, 14: len 44 ==> #38 {
        incr i [while {1} {break 1}]
    }
    i#23: EOL@ 2, 15: len 1 ==> #24
    i#24: BIC (#BIC_incr)@ 3, 8: len 4 ==> #25 incr
    i#25: Identifier@ 3, 13: len 1 ==> #26 i
    i#26: BraceGroup@ 3, 15: len 21 ==> #36 [while {1} {break 1}]
      i#27: BIC (#BIC_while)@ 3, 16: len 5 ==> #28 while
      i#28: SquigglyGroup@ 3, 22: len 3 ==> #31 {1}
        i#29: LiteralNumber (#LiteralIntDec)@ 3, 23: len 1 ==> #30 1
        i#30: EOF@ 3, 24: len 1 ==> #0
      i#31: SquigglyGroup@ 3, 26: len 9 ==> #35 {break 1}
        i#32: BIC (#BIC_break)@ 3, 27: len 5 ==> #33 break
        i#33: LiteralNumber (#LiteralIntDec)@ 3, 33: len 1 ==> #34 1
        i#34: EOF@ 3, 34: len 1 ==> #0
      i#35: EOF@ 3, 35: len 1 ==> #0
    i#36: EOL@ 3, 36: len 1 ==> #37
    i#37: EOF@ 4, 4: len 1 ==> #0
  i#38: EOL@ 4, 5: len 1 ==> #39
  i#39: EOF@ 5, 0: len 1 ==> #0
#40: EOL@ 5, 1: len 1 ==> #41
#41: EOF@ 6, 0: len 0 ==> #0

The token format in this dump (subject to change at any time):

The numerous "EOF" tokens are how the tokenizer marks the end of a given logical block. When restructuring a block construct into a parent/child relationship, we replace its closing token, e.g. }, with a virtual EOF. Those EOF tokens are critical to how the eval engine works.

The "BIC" tokens are references to builtin commands.

Evaluation

With the above tree of tokens in place...

From the top, down, evaluation looks more or less like this:

First, we find the first "command," or the first "command-looking" token in the script. We evaluate that token to ensure that it is_is_ a command. If not, trigger an error.

Then we figure out its arguments: all tokens between the command and the next end-of-expression (EOX), an EOX being any of:

  1. A newline
  2. A semicolon
  3. An EOF token

That collection process is simply traversing the in-memory tokens

After we've collected all of the arguments, we have to evaluate each of the tokens. How this is done depends on whether the command is a built-in command or a command bound as a script engine callback function:

What primarily distinguishes builtin commands from "native" commands is that the former have direct access to the tokens being evaluated, rather than having a translated form (though most builtins transform at least part of their arguments to Values).

Evaluating a token can involve any number of things, depending on the token's type, the evaluation context, and the average price of cheese in Wisconsin. In essence, all token evaluation is about converting a token a value, noting that even commands, including builtin ones, evaluate to a value.

We can think about the script as one token, each line of commands a list/tree of tokens, and each argument to each command being a single token of arbitrary complexity. Recall that block constructs are themselves treated as individual "supertokens," and they contain an arbitrary number of other tokens. Certain block constructs impose limitations on their types and numbers of contents. Some such limitations are to simplify evaluation and other are somewhat artificial, e.g. [command ...] does not allow more than one inner command and disallows a semicolon.

Expressions

During the process process of evaluating the top-level list of commands in a script. There are essentially two types of execution context in whcl:

Like TCL, whcl treats each of those distinctly and does allow one to freely intermix logical operations in a list of command arguments.

Evaluation of expressions may involve both logical and math operations, using essentially arbitrary tokens as operands. It is implemented using a conventional stack machine, with one exception: it uses eval recursion, rather than the stack machine, to dive into block-construct tokens. The end result is the same, but it simplified internal structure significantly over having one central eval stack to manage.

At the client-script level, this feature is generally exposed via the expr command, but is also implicit in some other contexts. e.g. the COND part of if {COND} {...}. Similarly, (...) is essentially shorthand for [expr ...]. Because a command nested using [...] is a single token for eval purposes, we can short-circuit commands using expressions:

decl x (($v && [command-one]) || [command-two])

Which almost the same as ($v ? command-one : command-two) in C-like languages, except that command-two would run if the LHS subexpression is falsy.