If you are reading this on github (a read-only mirror): most of the links in this doc, and various formatting, will not work on github because this page is written for the Fossil SCM repository hosted at this project's canonical home: https://fossil.wanderinghorse.net/r/c-pp
These are the docs for the "trunk" version of
c-pp. See the "lite" branch for the lighter-weight fork referenced by the SQLite JS/WASM docs (and continues to be maintained for that purpose).
The C-minus Preprocessor (a.k.a. c-pp or cmpp) is a minimalistic C-preprocessor-like application. Why? Because C preprocessors can process non-C code but generally make quite a mess of it1. The purpose of this application is a minimal preprocessor with only the most basic functionality of a C preprocessor (see below). It was conceived for use with JavaScript code but is generic enough to be used with essentially arbitrary UTF-8 text (including C code).
Like a C preprocessor, this tool reads input from text-based sources
and conditionally filters out parts. Unlike CPP, c-pp does only the
most basic of inline expansion of content, namely (and optionally)
tokens in the form @TOKEN@.
file "Input" arrow ellipse "MAGIC!" arrow file "Output"→ /pikchrshow
Features of potential interest:
Can perform moderately sophisticated filtering of text inputs in a fashion similar, but not identical, to a C preprocessor.
Well documented, in the form of this file and libcmpp.h.
It can stream its input from any source via an input stream abstraction. It provides implementations for
FILEand file-descriptor sources, and creating custom implementations is usually trivial.Can send its output anywhere via an output stream abstraction. It includes implementations for
FILEand file-descriptor destinations, as well as to strings which it dynamically allocates on demand to buffer the output.Can process multiple distinct inputs and outputs in a single invocation, allowing it to be automated in interesting ways. The test script demonstrates how that can be useful.
Supports registering custom stateful directive handlers, either linked in or loaded dynamically from DLLs. This capability is, to the very best of my fallible knowledge, a world's first in a generic preprocessor. (IBM's COBOL preprocessor is reportedly extensible but is limited to COBOL input.)
Supports savepointing within scripts, to limit the scope of any given
#defineor to temporarily override it, reverting to its old value when the savepoint is rolled back. i.e. it supports "local variables".Its #pipe directive allows it to run external programs to filter input. That is: an HTML template could embed markdown- or pikchr-formatted code directly and preprocess it using an external converter. This also allows it to wrap a C preprocessor, should one ever really want to. (Pikchr is also supported by an optional directive.)
WASM-friendly. Though it is as-yet untested in WASM/WASI builds, its API is designed to be friendly to those. Still TODO is to optionally eliminate all dependencies on C-level I/O APIs in such builds, to improve WASM portability. (Such I/O routines are currently only used for debug output and as a default output channel. They are not a core component or requirement of the library.)
Strictly single-threaded and synchronous, if only to provide evidence that not everything needs to be made async.
Distributed as a single source file of portable C99, making it easy and portable to copy around. It builds as both a library and a standalone CLI application. These docs cover the high-level features of the library, and the app is a very thin wrapper around that. The API docs are in libcmpp.h. (A good Perl hacker could probably implement most or all of this library in about 100 lines of Perl. This implementation is in C, so is significantly larger than that.)
See c-pp --help for usage details of the application (as opposed to
the library interface, which is in libcmpp.h), in particular the fact
that it processes its arguments and flags in the order they're
provided, which allows chaining of multiple input and output files in
a single invocation.
Design note: this tool makes use of SQLite. Though not strictly needed in order to implement it, it was specifically created for use with the sqlite3 project's own JavaScript code in order to facilitate creation of different builds, so there's no reason not to make use of sqlite3 to do some of the heavy lifting (it does much of that lifting). c-pp does not require any cutting-edge sqlite3 features and should be usable with any post-2020 version.
- Building it
- Preprocessor Markup
- Token Types
- "Define" Keys (a.k.a. "Macros")
- Directives
- #arg is kinda odd
- #assert the truth
- #attach and #detach databases
- #define symbols
- #delimiter separates us
- #error when one must
- #expressing boolean conditions
- #if, #else, #elif, #/if are decisive
- #include other files
- #join values together
- #module loader
- #@policy for
@tokens@ - #pipe data in and out
- #pragma wants to be left alone
- #query the database
- #savepoints manage scope-local defines
- #stderr accepts your input
- #undefine symbols
- #undefined-policy is fickle
- #// (comments) are sometimes helpful
- Add-on Directives
- "Function calls"
- The Library API
- Background: Why?
- Potential TODOs
- Reminders to self...
Formalities
Dependencies: A C99-capable C compiler, SQLite, and the target system's libc. It includes a copy of SQLite in the source tree but can use any relatively recent version.
Project home: https://fossil.wanderinghorse.net/r/c-pp
License: the SQLite Blessing
Author: Stephan Beal https://wanderinghorse.net/home/stephan/
Contributors are welcomed - please get in touch via the link above or post to this project's forum.
Building It
Grab a copy of the source code from /download or by cloning the repository using fossil:
$ fossil clone https://fossil.wanderinghorse.net/r/c-pp
Then, from its top-most directory:
$ ./configure --prefix=$HOME $ make # optionally: $ make test $ make install
Markup
c-pp is, like CPP, a line-oriented preprocessor. It looks for lines which start with its current delimiter (see below) and processes them. Other lines are normally passed through unmodified, but enabling @token@ parsing will cause non-preprocessor lines to be filtered. Similarly, specific directives may treat the content of their own block differently than other content (e.g #define heredocs).
The general syntax for a c-pp directive line is:
DELIMITER DIRECTIVE ...args
Where DELIMITER is the symbolic # described
below and DIRECTIVE is one of the operations supported by the
preprocessor.
The delimiter "#" used in these docs is symbolic only. The delimiter
is configurable and defaults to ##2. Define
CMPP_DEFAULT_DELIM to a string when compiling to set the default at
build-time. The delimiter may also be modifed via the
--delimiter=... command-line flag. This documentation, for brevity
and clarity, exclusively uses # unless its specifically
demonstrating changing the delimiter.
See #directives for examples and more syntax details.
Token Types
c-pp directive arguments must each follow one of the following forms:
word: this is a near-arbitrary token with no spaces. Most of the time,wordtokens resolve as define keys. Sometimes a directive will instead treat them as literal values, such that the wordfoois interpreted asfooinstead of whatever valuefoois defined to (if any).int: if it looks like an integer, with an optional +/- sign, it's tagged as such."string"or'string': this token starts out with quotation marks around it, but they're not part of its value. c-pp does not support backslash-escaping within a string. That is, all backslashes are retained as-is and there is no way to escape the outer quote character within the string.@"..."or@'...': is a string which gets passed through@token@parsing when it's evaluated.- Group constructs:
(...): is currently only used in subexpressions.[...]: context-dependent, but the convention is to use this for lists of other tokens. See #query for an example. See also: the "call" syntax.{...}: is for context-specific free-form content or, sometimes, used like a quoted string. See #query for an example.- Syntactic quirks and limitations:
- No group may contain an unbalanced closing character.
- There is no mechanism for escaping a group opening or closing character. i.e. app openers must be balanced by a closer.
- Their contents do not require backslash-escaped newlines. If
newlines are escaped then the backslashes are stripped from
them but the newlines are retained. It is not currently
possible to double-backslash newlines to force them to remain
backslash-escaped after parsing.
Potential TODO is transform the escaped newlines to spaces (like we used to). The main problem with that is that it would affect how all directive lines are parsed, not just grouping tokens, and side effects need to be ruled out. - Leading and trailing space characters, up to and including the first resp. last newline, are trimmed but the content is otherwise left as-is because it may contain text intended for external parsing, e.g. via #pipe or #query. Hard tabs are not considered spaces in this specific context so that they may be used in custom content.
Aside from the backslash-escaped newline case mentioned above, c-pp
does not support backslash escaping of anything. That is: it treats
all other backslashes, in all other contexts (unless explicitly noted
otherwise), just as any other character. It does this primarily to
give directives like #pipe flexibility in passing on
arguments. It is, however, admittedly sometimes a problem and it may
eventually need to be solved (i.e. changed to unescape certain
sequences, perhaps opt-in on a case-by-case basis or via the addition
of an as-yet-hypothetical #unescape function).
"Define" Keys (a.k.a. "Macros")
These docs frequently refer to "define keys". That's its term (for
lack of a better one ("macro" doesn't really fit here)) for the
names managed via #define, #undef, and
the -D.../-U.../-F... CLI flags.
Define key naming rules are:
- It disallows any control characters, spaces, and most punctuation, but allows alphanumeric characters (but must not start with a number).
- Allows any of
-./:_(but may not start with-). - Any characters with a high bit set are assumed to be UTF-8 and are permitted as well.
- Its length is limited, rather arbitrarily, to 64 bytes.
- Names with the prefix
cmppare reserved for use by the library. It does not loudly impose this rule, but it handles its internal defines such that attempts to override them will silently have no effect.
See #undefined-policy for how the library deals with references to undefined values.
Directives
c-pp directives both look and function a good deal like C preprocessor (CPP) directives do. They begin with a delimiter, followed by a directive, followed by any directive-dependent arguments.
A fundamental difference from a CPP is that c-pp's delimiter is
configurable, rather than being hard-coded to #. These docs use #
for brevity, but they always mean "the currently-configured delimiter"
(which can be changed while processing inputs).
Another fundamental difference is that each c-pp instance starts off with no directives installed. When it finds a directive in an input stream it checks its internal list of candidates and registers them on-demand. If it cannot find one, it falls back to the client-registered auto-loader and, if that isn't set or doesn't yield results, it will try to load them dynamically from a DLL. Client applications are free to register any they like in advance, and that's normally simpler than setting up an auto-loader fallback.
Example:
#if a That #if is a directive, but the one on this line is not because it has non-space content before it. #/if
Spaces and tabs before and after the delimiter, and between arguments,
are ignored, so the following #if is equivalent to the the previous
one:
# if a ... # /if
A directive may span lines by backslash-escaping each end-of-line character:
#if this is unusually \ long \ "so we'll wrap it" ... #/if
No spaces may follow such a backslash. As an exception, the bodies of
(...), {...}, and [...] may span lines without requiring
backslash-escaped newlines:
##assert ( 1 ) and defined x and \ ( x=3 )
Backslashes are optional within the confines of each group.
So-called "block" directives, like #if, have both an opening and a
closing line. The closing line is always in the form #/DIRECTIVE,
e.g. #/if, #/query, or #/pipe. The closing tags ignore any
arguments so that they can be decorated with informative comments by
document maintainers:
#if defined foo ... 1000 imaginary lines of text ... #/if defined foo
Non-block directives have one-time effects which take place when they are parsed. Their effects may change the behavior of further parsing.
The following subsections cover each directive in alphabetical order.
#arg
#arg is primarily intended for use as a function. It
expands its argument and emits it. "The plan" is to add flags to this
to perform meta-operations on arguments, e.g. fetching their type
or raw value instead of their expanded value.
Usage:
#arg ?flags? one-argument
Flags:
-trim-left|-trim-right|-trim
Trim the given side(s) of space and newlines.- TODO:
-raw
Do not expand the value before emitting it. This would strip the outer quotes from a string, for example, but not process the contents of an at-string.
It's currently difficult to envision a usage for this outside of testing this library.
#assert
This works like #expr except that (A) it emits no output
and (B) it fails if its expression is false. #assert is
essentially syntactical sugar for:
#if not foo #error ... #/if
Which can be shortened to:
#assert foo
#attach or #detach a Database File
This directive is a thin proxy for SQLite's ATTACH command, which "attaches" a database to the current db connection:
#attach "/path/to/my.db" as "foo"
On its own it's not of much use, but it's intended to be paired with #query.
It won't create a new db without a URL-style db name like
file://foo.db?mode=rwc (assuming the linked-in SQLite has that
feature enabled (most builds do)). We don't really want to create or
administer arbitrary dbs from c-pp (there are much, much better ways
to do that). It is, however, useful for as a basic templating system,
e.g.:
#attach "my.db" as "foo" #query {select a, b, c from foo.t order by a} a=@a@, b=@b@, c=@c@ #/query #detach "foo"
#define: Set Preprocessor Symbols
This directive "defines" values, in the same sense that a C preprocessor does, the main difference being that defines in c-pp behave more like variables, in that they can be freely overwritten without first having to undefine them.
Usages:
#define foo #// the same as #define foo 1 #define foo "this is foo" #define bar foo #assert bar="this is foo"
Prior to 2025-09-27, the equal sign was "just another identifier
letter", but it is now no longer permitted by this directive. In the
context of expressions, = is a comparison operation.
If a define is given no value, it has an implicit value of 1.
Multiple defines can be set at once with:
#define { x -> 2 y -> 3 } #assert (x=2) and (y=3)
This form requires a value for each key - there is no default. Each key is interpreted literally and each value is interpreted in the usual ways:
#define {a -> "hi there" b -> a}'
Will define both a and b to hi there.
Values in the form (...) are interpreted as integer
expressions.
Design note: after some experimentation, the -> is required (A)
because it's easier [for me] to read that way than {k v k2 v2...}
is and (B) to avoid over-complicating the parsing by optionally
allowing -> or =. My eyes find = to be less legible in that
context.
To define a variable to the contents of a file, use the function call syntax:
#define x [include -raw the-file]
Potential TODOs:
- Flag(s?) to change how define interprets its value.
#define "Heredocs"
#define can also assign a value from a content block using a
heredoc-like syntax:
#define foo << content goes here #/define
Notes and limitations:
- It must end with
#/defineon a line of its own. - Its content may contain other
#directivesbut they must be completely contained, not interwoven. - The final newline in the content is included but that can be
supressed with the
-chompflag (see below). - It is currently parsed for
@tokens@when it is read (if the current policy is not "off"). The thinking is that it would normally be more useful to delay that, but currently there is no straightforward way to expand it (or to know whether to expand it) after-the-fact.
#define accepts the following flags immediately before the <<:
-chomp: will remove one trailing newline from (a.k.a. "chomp") the block before assigning it.-chompcan be given any number of times to chomp that many newlines. Chomping has no effect if the content does not end on a newline, but content blocks will always, because of how c-pp's syntax works, have at least one trailing newline unless they are completely empty. Tip:<<<is syntactic sugar for-chomp <<.Potential TODO: a
-@policyspecific to this define. For that to be useful, e.g. delaying@token@expansion until the define's value is later read, we first need a good way to be able to tell c-pp to expand later (e.g. by tagging the value as an @-string and recognizing that when fetching the value. Options are being explored for that but the most obvious ones would affect the lowest-level routines and i'm not sure this feature belongs there. Maybe it does. Who knows?
#delimiter: Change the Directive Delimiter
This directive changes the directive delimiter. The delimiter is managed as a stack, the same way as #@policy. The stack always starts out with the library's compile-time-defined delimiter on top.
Usages:
#delimiter DELIM
Changes the current delimiter toDELIM.#delimiter push DELIM
PushesDELIMas a new delimiter on the stack, making it the current delim.#delimiter pop
Pops the most-recently pushed delimiter. It is illegal to invoke this unless one has invoked a correspondingpush.
A DELIM argument of default, predictably enough, uses the
default delimiter (set when the library is compiled).
A final argument of << indicates that the new delimiter remains in
place only until a following #/delimiter directive, noting that the
closing directive has to be delimited by the newly-pushed delimiter.
In this form, it is an error if EOF is encountered before the closing
tag is found.
When used in a function call then (A) << is not permitted
and (B) if given no arguments then it will emit the current delimiter.
Example:
##delimiter push @@ @@delimiter push !! << !!expr 1 !!/delimiter @@expr 2 @@delimiter pop ##expr 3
Results in the output 1\n2\n3.
PS: don't do that.
#error Breaks Things
Immediately stops processing with an error.
#error the rest of the line is an error message
As a special case, if the line both starts and ends with the same
character of " or ' then those are stripped from the result.
#expr Evaluates Things
This directive evaluates an expression (described in the next subsection) and emits its result (typically an integer).
#expr expression...
There are no known practical uses for this directive beyond in testing c-pp itself, but see #assert and #if for more practical uses of expressions.
Expression Rules
An expression, in this context, is a series of operators and operands which evaluate to either true or false. Expressions are used by several directives, most significantly #if.
The general syntax is:
[not] [defined] value [COMPARISON-OPERATOR value] [and ...] [or ...] [glob ...]
X COMPARISON-OP Ycompares the define ofXagainstY.Ymay be an integer, a quoted string, a define name, or a(...)subexpression. The following comparison operators are supported and spaces between them and their operands is optional:=,!=,<,>,<=,>=. Value comparisons are, for the most part, internally against strings, but expressions evaluate to an integer value.X, with no comparison operator, performs a boolean check: empty values and those with a value of0(zero) are false. All other values are true. (This is a string comparison, so000is true!)"..."or'...'are strings. Strings do not currently support any form of backslash-(un)escaping, so a string may not contain its own quote characters. All backslash characters in strings are retained as-is.@"..."or@'...'are "at-strings". They work like strings but, in most contexts, get the same expansion handling as@tokens@.(...)is a subexpression, the contents of which may be any legal expression. These may be nested. They currently always evaluate to an integer. "The hope" is to also support string expressions at some point, but the addition of function calls may make that unnecessary.
The following unary operators are supported:
notnegates the result of the expression.notmay optionally be written as!. It may also be used multiple times in a row, each pair of which cancels each other out.definedchanges the expression such that if the argument refers to a defined value, regardless of its value, the expression evaluates to true. The operand must be awordtoken. It does not accept strings, subexpressions, or other operators as its operand. Tip: the string#ifis technically a word-type token, so it qualifies here, anddefined #DIRECTIVE-NAMEevaluates to true if a given directive exists. (As an exception to this documentation's conventions: it expects a literal single#, not the current directive delimiter!) This can be used to test for whether a given custom directive has been installed.
Sidebar:definedvery specifically does not trigger a search for a dynamically-loadable directive. It may trigger an autoloader, and that may trigger a DLL search. Hmm. (Removing the autoloader from that search causes tests to fail and also fails to give me the semantics i'd prefer. Hmm. Maybedefinedneeds a flag to specify whether or not to search the various sources for directives (registered ones, autoloadable ones, DLL-loadable ones, noting that an autoloader may do whatever it likes to load a directive).
The unary operators bind tightly to their RHS argument, but without
consideration for whether it is the beginning of a longer
expression. That is (not a=3) will parse as ((not a)=3). The
workaround is to use a subexpression: not (a=3) or, even simpler,
a!=3. (Patches to fix that, even if it means rewriting the beast of
an expression engine, would be very thoughtfully considered!)
The following binary operators are supported:
- The comparison operators listed above.
X and YX or YX glob Y
Evaluates to true ifXmatches glob patternY, else false.Xis currently restricted to a quoted string or a define name.Yis required to be a quoted string or an at-string.(X not glob Y)is syntactic sugar for(not (X glob Y)).
All of the binary operators are evaluated strictly left-to-right, with equal precedence for each.
Sidebar: there is currently no short-circuiting of
andandorbecause the evaluation and parsing are closely tied together, but none of their operands have visible side-effects so no harm is done in not short-circuiting (so there's no rush to change this).
- Most glaring is that chains of binary operators may need
subexpressions: (
a=b and b=3) does not parse how it looks like/should, and currently needs to be written as (a=b and (b=3)). - FIXME: call syntax needs to be permitted for operands.
#if, #else, #elif, #/if
#if and friends cause blocks of the input to be emitted or elided
depending on the result of an expression. The expression syntax
differs from that of a C preprocessor but the end result is the same. This
family of directives includes #elif, #else, and #/if.
#if's arguments must make up an expresion. #/if ignores
all of its arguments - it's commonly useful to add a note there saying which
block is being ended.
Example:
#if foo=1 ... #elif foo < 2 or foo > 5 ... #elif bar or baz or not defined charlie ... #else ... #/if foo=1
(Any text after /if on that last line will be ignored, which is
useful for annotating the line with the purpose of the block it's
closing.)
#include External Files
This directive emits the contents of other files into the output:
#include ?-raw? filename...
The filename arguments may optionally be quoted, and must be if they contain any quote or space characters.
The -raw flag specifies that each file's contents are to be passed
through to the current output channel with no interpretation,
otherwise each file is filtered through the preprocessor as if it were
part of the current file.
The filenames are searched for in the so-called "include path", which
works just like a C/C++ include path. If no path is provided when
invoking c-pp then it defaults to using a path of "." (only the
current directory). If the -Idirname flag is provided then the
default of "." is not applied. -I... can be used any number of
times to specify search directories and they will be searched in the
order provided.
#join Arguments Together
The #join directive concatenates its arguments and emits the result
to the output stream.
It accepts the following flags:
-s SEPARATOR: sets the separator which gets emitted between the following arguments. It may be used multiple times to change separators. The default is a single space.-nonl: When running in non-call mode, do not emit a newline. The default is to emit a newline. (In call mode newlines are trimmed automatically by a higher level.)
$ ./c-pp -Db=2 -e '##join 1 b 3' 1 2 3 $ ./c-pp -e '##join 1 2 3 [join -s X 4 5 6]' 1 2 3 [4X5X6X]
TODO: unescaping of the separator to allow newlines and tabs. This needs to be done at a different level of the API, though.
#module Loads Directives from DLLs
The #module directive can load new directives from DLLs. It will
neither register nor run in safe mode.
#module "dllName" "directive-name"
It tries to open the given DLL, find an entry point with the given
name, which it assumes to be of type cmpp_loadable_module*, and it
invokes the module's callback. The intent is that such callbacks
register new directives. The DLL name argument may include the
platform's conventional DLL exception (".so" on most platforms), but
that's optional - the search includes checking the name both
as-provided and with the DLL extension added to it.
This support currently only works on Unix-esque platforms: those with
either dlopen() or ld_dlopen(). Patches to add support for other
platforms would be welcomed.
Registration of modules is handled via macros named CMPP_MODULE_...
in libcmpp.h.
The directive name can be left off if the module in question is specifically built and registered as the sole module in that DLL (in which case it uses a pre-defined entry point name). Whether that's the case depends on how it is built and which module registration macro(s) it uses.
This directive perfors no filename transformation beyond the path lookup and automatic DLL extension.
When this directive is invoked, if the module search path is empty and
the CMPP_MODULE_PATH environment variable is set, it is added to the
module path. If set, it is expected be in the form of a colon- or
semicolon-delimited list of directories (the former on Unix-like
systems, the later on Windows).
Example:
./c-pp -Ddll=libcmpp.so \
-e '##assert not defined #dyno' \
-e '##module dll dyno' \
-e '##dyno hi there' \
-e '##assert defined #dyno'
cmpp_dx_f_dyno() arg: cmpp_TT_Word hi
cmpp_dx_f_dyno() arg: cmpp_TT_Word there
When the DLL is built with a singleton module registration the entry point name is not required, as the singleton uses a well-defined name:
$ ./c-pp -e '##module "libcmpp.so"' -e '##dyno hi there'
cmpp_dx_f_dyno() arg: cmpp_TT_Word hi
cmpp_dx_f_dyno() arg: cmpp_TT_Word there
Example module: /file/src/d-dyno.c
Directives in Loadable Module DLLs
If built with DLL support and it's not running in safe mode then the
library will, when encountering an unknown directive, search for a
matching DLL. For purposes of this search, the DLL is expected to be
named libcmpp-d-NAME.so. The module search path defaults to the
$CMPP_MODULE_PATH environment variable, but it can also be set with
the -L flag to c-pp or the the C API's cmpp_module_dir_add().
If it finds a matching DLL, it opens it, and, if it finds a loadable module in it, that module's registration function is called. If that call registers the being-sought directive, the library continues processing. If not, then it fails with an "unknown directive" error.
The C API also offers an "autoloader" API which clients can install to load their own statically-linked directives on demand or to implement their own DLL search. That's independent of the library's automatic DLL search (which is, in terms of search priority, last on the list).
@policy Controls Expansion of @tokens@
#@policy ?push? POLICY-NAME #@policy pop
By default c-pp does no expansion of content beyond the filtering of
content blocks using #if. If passed the -@ flag or this
policy is used in a script, then it will perform a restricted type of
expansion on content blocks: tokens in the form @TOKEN@ are
processed as described below.
#@policy takes a policy name argument, defaulting to error, which
describes how to deal with @tokens@ in the input:
off(the default): no processing of@is performed.error: fail if an undefinedXis referenced. This is the default if#@policyis used without an argument.retain: emit any unresolved@X@tokens as-is to the output stream.elide: omit unresolved@X@from the output, as if their values were empty.
The push option tells it to set the policy and remember the previous
policy. The pop option restores that previous policy and will error
out if there is no level to pop.
Behavior and limitations:
@token@expansion generally happens only in "content" parts - not preprocessor lines. That is,#if foo=@bar@won't try to expand@bar@(just usefoo=barfor that). at-strings can be used in some contexts to perform@token@expansion on directive arguments.It will not cross line boundaries looking for a closing
@. i.e. theXpart of@X@may not contain newlines. The (expanded value may contain newlines.)The
Xpart of@X@is treated as a define key. If no match is found, then the current#@policyspecifies how to deal with it.If a match is found then
@X@gets replaced by the define's value.
The --no-@ CLI flag or #@policy off both disable expansion until
either a subsequent -@ or @policy flag re-enables it.
A demonstration of the "@policy":
$ echo 'a@x@c' | ./c-pp --@policy=off a@x@c $ echo 'a@x@c' | ./c-pp --@policy=retain a@x@c $ echo 'a@x@c' | ./c-pp --@policy=elide ac $ echo 'a@x@c' | ./c-pp --@policy=error a ./c-pp: @<stdin>:1: Undefined key: @x@
Predefined @tokens@
__FILE__resolves to the current input file's name.
#pipe Filters Content through External Processes
This directive is not currently available on Windows builds (patches to improve that would be thoughtfully considered!).
#pipe runs an external command, optionally feeds it input from the
script, and emits the output from that command:
#pipe -- /usr/bin/sed -e 's/this/that/' this content #/pipe
Will pipe this content\n into sed and get that content\n back.
Similarly:
#define cmd "echo" #pipe -no-input -chomp-output -- cmd this is from echo
Will emit this is from echo and chomp the trailing newline
from the output.
Arguments and flags:
-chomp: each time this flag is used, it causes one newline to be removed from the directive's input block.-chomp-output: each time this flag is used, it causes one newline to be removed from the external command's output.-no-input: tells this directive to not consume the following content looking for a#/pipedirective. The external command is sent no input from this directive.-exec-direct: normally the external command and its arguments are passed to the OS as a suffix of/bin/sh -c. This flag tells it to run that command directly, without the intermediary shell. This can only work if the command has no arguments, otherwise the arguments will be treated as part of the command name. (We could optionally implicitly set this if the command has no arguments.)-path: tells it to search the$PATHwhen looking for the command, as documented forexeclp(3)andexecvp(3). More specifically, it usesexeclp(3)orexecvp(3), depending on the form of the command (see below), instead ofexecl(3)orexecv(3).-debug: emit the post-processed command to stderr before running it.--: must immediately precede the command name. This tells the directive that we are switching from c-pp's token parsing to near-arbitrary input.
The final argument must be the command and its arguments in one of two forms:
command-name ...args: if the command name is not quoted then is treated as a define key unless it contains any/or\or.or-characters. In those cases it is assumed to be a filename or command switch and is not subject to any further interpretation. At-strings, as well as define names which do not match the aforementioned matterns, will be expanded appropriately.
Only the first token of the command string is parsed so that command names may be runtime-configurable via defines. The remaining arguments, because they may be essentially free-form, are not parsed as arguments by c-pp, but are passed on almost as-is to the command. The only interpretation they go through are (A) to determine where this directive's line ends and (B) any newlines from backslash-escaped newlines in the arguments get elided entirely, as if they were not there.[command-name args...]: in this form, each argument in the given list is treated like a normal directive argument. Each may be a string, at-string, number, or word. The one exception to their normal processing is the same one described for the previous command name, but in this form that rule applies to all unquoted word tokens. The--flag is optional for this call form because the[...]group unambiguously tells us that it's the command.
The external command gets piped, via its stdin, the contents of the
directive's block unless -no-input is used. The command's stdout
output is collected and emitted in its place. The output is not
currently post-processed in any way except as per the -chomp-output
flag, but should the need arise we can easily add optional
at-token parsing to the output via a flag.
Stupid #pipe trick: run a C preprocessor through it:
##pipe -path -- 'cpp' -E #include <stdio.h> ##/pipe
That requires using a directive delimiter other than #
to avoid a conflict with cpp's #.
TODOs:
- BUG: it will hang, waiting on I/O, in some constructs, e.g. the one marked BUG in this file.
- A build option and CLI flag to disable both this and
#include, to make it safer for use with potentially untrusted inputs. - Flag(s) to control whether or not to @-parse the command arguments.
- A
-define Xflag which sets X to the piped output instead of emitting it. - Figure out how to report when the underlying
exec()call fails due to an invalid command name. "The problem" is that the command is run as an argument to/bin/sh -c, andexec()succeeds in calling that, but then/bin/shfails to find the command. That happens in the child process, so we can't directly report it to the parent. Currently this situation results in empty output (and maybe a cryptic message from/bin/shon stderr) but no error. (Maybe we should close the child's stderr? Or maybe capture it separately and error if stderr produces any output? How do we do that?)
#pragma Is for Debugging
This directive is undocumented. It changes at the whim of the library's developer, primarily to support testing and debugging.
#query Renders Data from a Database
This directive runs SQL queries. c-pp internally uses only one
(private) database, so #query isn't much use on its own except for
in testing c-pp, but #attach can be used to attach
arbitrary databases (and was added to support #query).
This directive has two forms:
First, it can run an SQL query, set scope-local defines for each result column, and filter its block's contents for @tokens@ using the current @token@ policy:
Your list of foo: #query {select name AS name, price AS price from foo order by name} @price@ @name@ #query:no-rows This part is optional and is emitted if the query has no results. #/query
That form requires a terminating #/query directive but the
#query:no-rows sub-directive is optional (and may not appear more
than once).
Secondly, it can define one or more symbols from the first row of an SQL query:
#query define {select a, b from c order by a}
This form does not use a terminating #/query directive.
For the first form the body of the query block is
@token@-expanded to the output stream one time for each
result row. Before each is expanded, defines are set matching the
names of the result columns. The defines are set within
the context of a local savepoint so that after the
query is processed the defines are either unset or reverted to their
previous values. If no rows are found, the (optional) #query:no-rows
block is emitted. If that block is not set, no output is emited for
querie which have no result rows.
The query block may contain other directives, but any directives
need to be completely enclosed inside the #query...#/query body, not
interwoven.
The "define" form sets corresponding defines for the first row of the result set and does not use a savepoint. If no result rows are found it sets each define to an empty value. (Potential TODO: add a flag to error out in that case, or maybe provide default values.)
Sidebar: remember that the only guaranteed reliable way to get a result column's name is to set it oneself using
SELECT x AS x(with the "AS" being optional).
Formatting of the results, if needed, can be done using SQLite's
format() function. It is exceedingly unlikely that c-pp will ever be
extended to include formatting-related features. (However, function
calls bring that capability within easy reach.)
Potential TODOs:
- Maybe make the
@token@policy for the content part configurable for this call, rather than using the current policy. It seems that a mode of "error" is the best fit for this use and it's difficult to imagine wanting any other mode here. However, there's an internal reason which enforces that we use the current policy here, and that still needs to be resolved.
Binding Query Parameters
Query parameters can be bound either by name or index, but not
a mix of both, by adding a bind argument:
#query {select :a a, $b b} bind {:a -> 1 $b -> 2}#query {select ?1 a, ?2 b} bind ["one" "two"]
Sidebar: SQLite supports a prefix of
@in addition to:and$but it's not supported here because of syntactic confusion with at-strings.
Bind values may be any of:
- Quoted string (the quotes are not part of the bound value).
- A
{...}is treated like a quoted string, supported here soley for the outlier case where a value has to contain both single- and double-quotes. - A define name gets expanded to its value.
- An at-string gets expanded.
- An integer.
- An integer expression enclosed in
(...).
#savepoint: Scoped Defines
Savepoints are like nestable transactions. In c-pp they let us define/undefine values in a scoped manner. That is, a symbol defined in a savepoint will become undefined, or revert to its pre-savepoint value, if that savepoint is rolled back. It might be interesting to someday explore how savepoints might be used for content blocks as well, but the internals are not currently set up to do such a thing (we'd need to buffer all output to the db or memory, rather than sending it directly to the output channel).
#savepoint requires a single argument:
beginstarts a new savepointcommitsaves all changes and closes the savepointrollbackdiscards all changes made since the start of the most recent savepoint and closes that savepoint.
If a savepoint is neither committed nor rolled back by the end of its script file, it will automatically be rolled back. It is an error to try to end a savepoint when none is currently open.
$ cat foo #@policy error #define bar=2 #savepoint begin #define foo #if not foo # error expecting foo #/if foo is @foo@ bar is @bar@ rolling back... #savepoint rollback #if foo # error foo should not be set #else foo is gone bar is @bar@ #/if #if not bar=2 # error expecting bar=2 #/if begin again... #savepoint begin #define foo=again #if not foo=again # error expecting foo=again #/if foo is @foo@ bar is @bar@ committing... #savepoint commit bar is @bar@ #if not foo=again # error expecting foo=again #/if bar is @bar@ foo is @foo@ the end $ ./c-pp --delimiter '#' foo foo is 1 bar is 2 rolling back... foo is gone bar is 2 begin again... foo is again bar is 2 committing... bar is 2 bar is 2 foo is again the end
Why was #savepoint added? An idle thought of "wouldn't it be
interesting to automatically undefine these vars at the end of the
file which defined them?" led to "oh, savepoints can do that". Then
it was actually really easy to add.
#stderr
Emits remainder of line to stderr.
#stderr This goes to stderr along with file location info.
#undef
Undefines one or more defines:
#undef foo bar baz
#undefined-policy
Specifies how c-pp should react to references made to undefined keys:
#undef ?push? error|null #undef pop
The policy values are:
null(the default): treat undefined keys as falsy.error: trigger an error if resolving an expression would require using an undefined key. This should probably be the default. Thedefinedexpression operator specifically does not trigger such errors.
push and pop work exactly as described for #@policy.
#//: Comments
Infrequently useful, but...
#// This is a c-pp comment.
There must be a space after the // because that // is, despite
appearances, parsed as a directive name.
Multi-line comments are not supported but #if can be used for the same effect:
#if defined nope ... #/if
Add-on Directives
This section describes directives which are not part of the core library but which are in this source tree, available for copy/paste reuse. They may require third-party software. They may or may not also be pre-built into the library or CLI app.
The directives are listed in alphabetical order.
#c-code
This proof of concept directive filters input into C code formats.
Source file: d-c-code.c
#c-code -mode byte-array \ -getter get_mah_bytes { this is content }
Emits something like:
unsigned char const * get_mah_bytes_get(unsigned * pLen){
static unsigned char const _a[] = {
10,116,104,105,115,32,105,115,32,99,111,110,116,101,110,116
};
if(pLen) *pLen=sizeof(_a);
return _a;
}
And:
#c-code -mode byte-array -hex -name mah_bytes ...content goes here... #/c-code
Emits:
unsigned char const mah_bytes[] = {
0x23,0x69,<big snip>...
0x0a
};
The block content may contain other directives, which is
especially useful here with #include -raw
-mode cstr has it emit the content as a string literal.
#pikchr
This directive reads pikchr input and emits SVG-format image output.
Source file: d-pikchr.c
Usages:
#pikchr ...flags ... pikchr markup... #/pikchr
Or:
#pikchr ...flags { ...pikchr markup... }
Those differ in the following ways:
The block form may contain other directives, whereas {...} may not.
The block form's output is implicitly @token@-parsed using the current
@token@policy. The{...}form is not @token@-parsed by default but see the-@flag.
It emits an SVG-format image or an error message. In the case of a
pikchr() error, this directive emits the full pikchr result to the
ouput stream before setting the error state to something less
verbose than pikchr()'s error dump.
Flags:
-@tells the{...markup...}form to @token@-parse the{...}block using the current policy. This flag is illegal in the block form (which is implicitly @token@-parsed according to the current policy).-darktells pikchr to prefer a "dark-mode" color scheme.-css-class STRINGadds the given CSS class(es) to the generated SVG image.-unchompforces an additional newline on the output. May be used multiple times.-chompremove ones trailing newline from it. May be used multiple times.
Regarding newlines: it's not specified whether pikchr output always includes a trailing newline. If -unchomp and -chomp are used together, results may be unpredictable.
"Function Calls"
As of 2025-11-11 c-pp supports a limited form of "function call" in
the form [D ...args] where D is the name of a directive. This only
works for directives which can function without a closing directive
(even if they do so only conditionally, e.g. #query, in
which case only the closing-directive-less forms are legal here).
Calls work by doing the following:
- Copy the input string, prepending the current
delimiter to it. We have to copy it because
$REASONS. - Redirect the current output stream to a buffer.
- Process the buffer from #1 an input document.
- Restore the output stream to its previous state.
- Any output from that document is now in the buffer from #2, which becomes the result of the call. A single trailing newline is trimmed from the result.
It's still being determined where this syntax should be legal, but here are some examples of where it currently is:
- Expression tokens
- #query bind values
- #define values
- During @token@ parsing,
@[...]@is experimentally a form of call and the...part may span lines like[...]may. - The
[sum ...args]directive was created as a demonstration of this feature, simply adding all integer-looking arguments together.
Some of the functions currently available: #arg, #join.
The Library API
This section demonstrates how to use the library API from client C code. It is not an exhaustive guide (that's what the API docs are for) but is enough to get started with the library.
The first step is getting a preprocessor instance:
#include "libcmpp.h" ... cmpp * pp = 0; int rc = cmpp_ctor(&pp, 0/*optional flags*/); if( rc ){ // error if( pp ){ // cmpp_err_get() will get the error info. cmpp_free(pp); } return; } ... use pp ...
(Initialization will only fail if an allocation fails or if
optional custom initialization code fails. In the former
case, pp will always be NULL. In the latter case, the pp's error
state holds info about the failure.)
Next, we set up an output channel:
cmpp_outputer out = cmpp_outputer_FILE; out.state = stdout; cmpp_outputer_set(pp, &out, "<stdout>");
Any output destination which can be wrapped in the cmpp_output_f()
interface is suitable. Implementions are provided for FILE*, file
descriptors, and cmpp_buffer (basically a dynamic string buffer),
and adding one's own is normally trivial. e.g. send output directly to
a UI widget.
Then we feed it some input:
unsigned char const *input = ...a script full of input...; int rc = cmpp_process_string(pp, "my-input.txt", input, -1); if( 0==rc ) { ... success ... }
In essence it can take input from anywhere, but it requires that the
input be completely available when parsing starts, so the lowest level
of feeding it input is cmpp_process_string(), where each call
equates to a new input source. cmpp_process_file() and
cmpp_process_stream() are both thin proxies around
cmpp_process_string().
On success, all of the output will show up in the provided output channel. On error, the output may have been partially generated and must not be trusted as being complete or usable. Most errors cannot be recovered from without cleaning up all state, and practice shows that in this context there's little or no reason to attempt it.
When we're done we need to clean up:
cmpp_dtor(pp);
For the most part, that's really all there is to it.
The library can be extended with custom directives and several are
demonstrated in d-demo.c and d-pikchr.c. Custom directives
can perform essentially any jobs the builtin directives do, the
notable exception being flow-control changes (like #if does). More
properly, they can implement flow control but must provide the
infrastructure needed for nesting such constructs and ensuring that
they're closed properly. The internal infrastructure for doing so is
probably not well-suited to general-purpose flow control, e.g. adding
a hypothetical #while or #foreach loop. Similarly, the
expression-evaluation API is not yet in the public API, and it's still
being determined whether to make it so (because it's rather
primitive).
Library Build Options
The library, for client-side use, is distributed in two files,
libcmpp.[ch], which can be created with make libcmpp.c.
The following CPP defines influence how libcmpp.c is built. They
have no effect on client code.
-DCMPP_CTOR_INSTANCE_INIT=function_name: if set, the given function must have a signature of:
int function_name(cmpp*)
It will be called as part ofcmpp_ctor()so that any custom directives can be added to each new instance. It will be called before the preprocessor installs any of its built-in directives, so custom directives may override builtin ones.-DCMPP_MAIN: includes themain()impl for thec-ppbinary.-DCMPP_MAIN_INIT=func: works just likeCMPP_CTOR_INSTANCE_INIT(see above) but applies only to the instance whichmain()uses. This is used, e.g., for plugging in custom/non-core/demo directives.-DCMPP_MAIN_AUTOLOADER=func: if defined thenfuncmust have the signature ofcmpp_d_autoload_f(). It is installed as the main instance's directive autoloader.
-DCMPP_OMIT_...:
The following features are optional because they give scripts ways to access near-arbitrary content and may, in some uses, be security-relevant:CMPP_OMIT_D_DB: omit #query, #attach, and #detach.CMPP_OMIT_D_PIPE: omit #pipe.CMPP_OMIT_D_INCLUDE: omit #include.CMPP_OMIT_D_MODULE: omit #module.CMPP_OMIT_ALL_UNSAFE: sets all of aboveOMITflags and will include any future directives which access the filesystem, invoke external processes, or similar. This flag currently only affects directives, not other library-level APIs, but an eventual goal is to be able to make all filesystem-specific parts optional. ("Unsafe" is too strongly-worded here but its heart is in the right place.)
Background: Why Create c-pp?
In mid-2022 the SQLite project started work on its JS/WASM bindings. It was initially written for "vanilla" JS for the simple reason of personal preference of the guy writing the code, but it was clear we would eventually need to support ESM (ES6 modules) because that's what the modern-day JS ecosystem uses. Vanilla JS and ESM are 99.9% identical but each has tiny context-specific syntactic differences. Most differences in JS can be resolved via runtime introspection but syntactic differences make code outright illegal in one or other of the modes.
We had several options for dealing with this:
- Ignore it. It might go away. This was tried, but pressure eventually mounted and my proverbial white flag had to be raised. (Tip: having a support contract with SQLite greatly increases the odds of ones own specific variety of pressure bearing fruit!)
- Switch to ESM only. That wasn't going to happen (A) for the aforementioned reason about the one doing the coding and (B) because, at the time, some browsers could not yet launch ESM modules as Workers. Since the "killer feature" of the project's JS bindings was expected to be its integration with persistent client-side storage via OPFS, and OPFS is only available in JS Workers, point (B) held significant weight.
- Maintain two copies with slight differences. No way. No. way. Nope.
- Construct the sources dynamically. This could easily turn into a huge mess of scripts but... it still sounded like the best of the available options.
A notable restriction: one rule of the SQLite project is that we cannot simply import random code into it, so any tooling was going to have to be hand-rolled by members of the project. Spoiler alert: only one team member needed this tool, so it was up to them to implement it (double-spoiler alert: 🙋♂️).
First we tried a C preprocessor, as that's precisely the type of
thing we needed, but it didn't take more than 15 minutes to determine
that it was unsuitable for the job. Summary: C preprocessors make a
mess of non-C code by injecting it with C-isms like #line markers
or, in the case of GCC, a GNU license header. If gcc's preprocessor
could have been taught to emit only its filtered inputs, without
irrelevant other content, the story would have ended there and much
subsequent effort could have been spared.
The SQLite project has a strong culture of "keep it simple" and "don't be shy about writing your own tools", instilled the hard way over 2.5 decades, and that culture has seeped into me in my time there. My built-in tendency, however, is to over-engineer everything, even otherwise simple shell scripts, a fault at odds with The SQLite Way. Even so... we needed a preprocessor, or something like it.
For logistical reasons, the choices had to come down to Tcl, dependency-free C, or the core Unix tools like sed, awk, and sh. A large handful of Tcl scripts already generate the core of SQLite, some much like a very-specific-purpose preprocessor. At the time, my Tcl-fu was not strong enough for me to confidently pull off my envisioned tool in Tcl. Maintaining JS code using shell scripts was, and remains, simply unappealing. So C became the implementation route of choice.
Writing dependency-free C code can be somewhat tedious, as one invariably ends up re-inventing the same set of utillity code, like a memory buffer class and a function to read in a file's whole contents at once (possibly into one of those buffers). In this case we'd also need a hashtable early on and, sigh, it would have to be written3.
It turns out, though, that we could use sqlite3.h and still be
effectively dependency-free because this tool would be embedded in
SQLite's own tree. How convenient! Long story short: being able to use
an in-memory db as a hashtable was a huge time-saver and had further
downstream benefits.
So work began on the preprocessor with the self-imposed restriction that it do only what we need, and not (contrary to my core nature!) be designed as a generic, client-agnostic, tool. That meant, for example, that it would use only global state, read only from a single file handle, and write only to one file handle. (Whereas my natural tendency would be to abstract the I/O channel into a client-extensible interface, taking up more code, more time, and adding a feature we ultimately wouldn't use. Sigh.)
And thus c-pp was born.
c-pp has proven invaluable for its initial role. SQLite has, as of late 2025, some 8 or 10 different JS builds, all from the same core source files, and that would have been nigh impossible for our tiny team to reliably manage without some sort of source-filtering tool.
It turns out that there's a third build mode we didn't know about at the time: "bundler-friendly builds". "Bundlers" are source code analysis tools which look through the multitudes of dependencies used by modern JS dev approaches and "bundle" them into sets which contain only the reachable parts of that code. One of their limitations is that they cannot resolve dynamically-generated string references to external filenames, which means that they cannot resolve dependent file names which are processed in code. They have to be fed such file names as string literals instead. Sigh. Bundler builds differ from ESM only in their requirement for hard-coded string literals for part of SQLite which have to load external scripts (like its OPFS VFS or its "worker1" API). We cannot unilaterally use hard-coded strings because (A) that's icky and (B) we don't know the full paths to some files at compile-time. Bundler builds work around (B) by hard-coding a name which will only work in limited contexts.
At some point my natural urge to over-engineer got the best of me and c-pp was refactored from a single-purpose monolithic app into a client-agnostic library, quickly more than tripling in code and docs. It would be difficult to justify adding that sort of complexity and code bloat to the SQLite tree, given that that tree needs exactly none of it, so the original/"lite" version is maintained over in the lite branch, tweaked only insofar as necessary for SQLite-side JS maintenance.
The trunk branch, contrariwise, is where my over-engineering gets to run rampant, without risk to the SQLite JS builds. Some of the remnants of the c-pp's orignal monolithic-app shape are still visible in its interface and code, but the trunk version has become a significantly different thing than its predecessor.
But why? Why do we need an over-engineered, client-extensible preprocessor?
We don't. Spoiler alert: i don't, either! The world has lots of problems and the ones this project ostensibly solves aren't among them. It is done because it interests me to do, and for no other reason.
Potential TODOs
Add the ability to persist the db? "The problem" with that is that the schema would then be "public", so couldn't be modified without some hassle. This would allow us to build up a db of values before processing, e.g. via a configure script. What could we really do with it, anyway?
Maybe
#/*and#*/as comment blocks.#if 0works fine, though.Maybe a
#dbdirective with operations like:#db open filename as dbName#db trace dbName ?expanded? to filename(would be especially helpful)#db trace dbName off#db query dbName ...(as for#query)#db close dbName
Reminders to self...
Should the @ for token replacement be configurable?
Why would it need to be? Configuring it to a pair of single characters would be an easy change, but changing it to a pair of arbitrary-length strings would require more effort (and for what gain?).
- ^
C preprocessors, when running in comment-retention mode, tend to
inject
#characters all over the place and may do silly things like automatically include compiler-specific headers and emit the comments from those. e.g. usinggcc -E -CCwill include a gcc-internal header and emit a GPL license header in the output. e.g. try:
$ echo 'extern int x;' > y.c; gcc -E -CC y.c - ^
We do not use a default of
#because some source files this tool was initially designed to handle have lines which start with that (JavaScript class private members). In that particular tree we use a delimiter of//#. Even so, the docs use#because it's easier on the eyes than the real default is. - ^ Writing hashtables is one of those things which becomes tedious the fourth of fifth time around.