C-Minus Preprocessor

C-Minus Preprocessor
Login

C-Minus Preprocessor

If you are reading this on github (a read-only mirror): most of the links in this doc, and various formatting, will not work on github because this page is written for the Fossil SCM repository hosted at this project's canonical home: https://fossil.wanderinghorse.net/r/c-pp

These are the docs for the "trunk" version of c-pp. See the "lite" branch for the lighter-weight fork referenced by the SQLite JS/WASM docs (and continues to be maintained for that purpose).

The C-minus Preprocessor (a.k.a. c-pp or cmpp) is a minimalistic C-preprocessor-like application. Why? Because C preprocessors can process non-C code but generally make quite a mess of it1. The purpose of this application is a minimal preprocessor with only the most basic functionality of a C preprocessor (see below). It was conceived for use with JavaScript code but is generic enough to be used with essentially arbitrary UTF-8 text (including C code).

Like a C preprocessor, this tool reads input from text-based sources and conditionally filters out parts. Unlike CPP, c-pp does only the most basic of inline expansion of content, namely (and optionally) tokens in the form @TOKEN@.

Input MAGIC! Output
file "Input"
arrow
ellipse "MAGIC!"
arrow
file "Output"

Features of potential interest:

See c-pp --help for usage details of the application (as opposed to the library interface, which is in libcmpp.h), in particular the fact that it processes its arguments and flags in the order they're provided, which allows chaining of multiple input and output files in a single invocation.

Design note: this tool makes use of SQLite. Though not strictly needed in order to implement it, it was specifically created for use with the sqlite3 project's own JavaScript code in order to facilitate creation of different builds, so there's no reason not to make use of sqlite3 to do some of the heavy lifting (it does much of that lifting). c-pp does not require any cutting-edge sqlite3 features and should be usable with any post-2020 version.

Table of Contents:

Formalities

Dependencies: A C99-capable C compiler, SQLite, and the target system's libc. It includes a copy of SQLite in the source tree but can use any relatively recent version.

Project home: https://fossil.wanderinghorse.net/r/c-pp

License: the SQLite Blessing

Author: Stephan Beal https://wanderinghorse.net/home/stephan/

Contributors are welcomed - please get in touch via the link above or post to this project's forum.

Building It

Grab a copy of the source code from /download or by cloning the repository using fossil:

$ fossil clone https://fossil.wanderinghorse.net/r/c-pp

Then, from its top-most directory:

$ ./configure --prefix=$HOME
$ make
# optionally:
$ make test
$ make install

Markup

c-pp is, like CPP, a line-oriented preprocessor. It looks for lines which start with its current delimiter (see below) and processes them. Other lines are normally passed through unmodified, but enabling @token@ parsing will cause non-preprocessor lines to be filtered. Similarly, specific directives may treat the content of their own block differently than other content (e.g #define heredocs).

The general syntax for a c-pp directive line is:

DELIMITER DIRECTIVE ...args

Where DELIMITER is the symbolic # described below and DIRECTIVE is one of the operations supported by the preprocessor.

The delimiter "#" used in these docs is symbolic only. The delimiter is configurable and defaults to ##2. Define CMPP_DEFAULT_DELIM to a string when compiling to set the default at build-time. The delimiter may also be modifed via the --delimiter=... command-line flag. This documentation, for brevity and clarity, exclusively uses # unless its specifically demonstrating changing the delimiter.

See #directives for examples and more syntax details.

Token Types

c-pp directive arguments must each follow one of the following forms:

Aside from the backslash-escaped newline case mentioned above, c-pp does not support backslash escaping of anything. That is: it treats all other backslashes, in all other contexts (unless explicitly noted otherwise), just as any other character. It does this primarily to give directives like #pipe flexibility in passing on arguments. It is, however, admittedly sometimes a problem and it may eventually need to be solved (i.e. changed to unescape certain sequences, perhaps opt-in on a case-by-case basis or via the addition of an as-yet-hypothetical #unescape function).

"Define" Keys (a.k.a. "Macros")

These docs frequently refer to "define keys". That's its term (for lack of a better one ("macro" doesn't really fit here)) for the names managed via #define, #undef, and the -D.../-U.../-F... CLI flags.

Define key naming rules are:

See #undefined-policy for how the library deals with references to undefined values.

Directives

c-pp directives both look and function a good deal like C preprocessor (CPP) directives do. They begin with a delimiter, followed by a directive, followed by any directive-dependent arguments.

A fundamental difference from a CPP is that c-pp's delimiter is configurable, rather than being hard-coded to #. These docs use # for brevity, but they always mean "the currently-configured delimiter" (which can be changed while processing inputs).

Another fundamental difference is that each c-pp instance starts off with no directives installed. When it finds a directive in an input stream it checks its internal list of candidates and registers them on-demand. If it cannot find one, it falls back to the client-registered auto-loader and, if that isn't set or doesn't yield results, it will try to load them dynamically from a DLL. Client applications are free to register any they like in advance, and that's normally simpler than setting up an auto-loader fallback.

Example:

#if a
 That #if is a directive, but the one on this line is not
 because it has non-space content before it.
#/if

Spaces and tabs before and after the delimiter, and between arguments, are ignored, so the following #if is equivalent to the the previous one:

  #  if  a
...
 #  /if

A directive may span lines by backslash-escaping each end-of-line character:

#if this is unusually \
 long                 \
 "so we'll wrap it"
...
#/if

No spaces may follow such a backslash. As an exception, the bodies of (...), {...}, and [...] may span lines without requiring backslash-escaped newlines:

##assert (
  1
) and defined x and \
(
  x=3
)

Backslashes are optional within the confines of each group.

So-called "block" directives, like #if, have both an opening and a closing line. The closing line is always in the form #/DIRECTIVE, e.g. #/if, #/query, or #/pipe. The closing tags ignore any arguments so that they can be decorated with informative comments by document maintainers:

#if defined foo
... 1000 imaginary lines of text ...
#/if defined foo

Non-block directives have one-time effects which take place when they are parsed. Their effects may change the behavior of further parsing.

The following subsections cover each directive in alphabetical order.

#arg

#arg is primarily intended for use as a function. It expands its argument and emits it. "The plan" is to add flags to this to perform meta-operations on arguments, e.g. fetching their type or raw value instead of their expanded value.

Usage:

#arg ?flags? one-argument

Flags:

It's currently difficult to envision a usage for this outside of testing this library.

#assert

This works like #expr except that (A) it emits no output and (B) it fails if its expression is false. #assert is essentially syntactical sugar for:

#if not foo
#error ...
#/if

Which can be shortened to:

#assert foo

#attach or #detach a Database File

This directive is a thin proxy for SQLite's ATTACH command, which "attaches" a database to the current db connection:

#attach "/path/to/my.db" as "foo"

On its own it's not of much use, but it's intended to be paired with #query.

It won't create a new db without a URL-style db name like file://foo.db?mode=rwc (assuming the linked-in SQLite has that feature enabled (most builds do)). We don't really want to create or administer arbitrary dbs from c-pp (there are much, much better ways to do that). It is, however, useful for as a basic templating system, e.g.:

#attach "my.db" as "foo"
#query {select a, b, c from foo.t order by a}
a=@a@, b=@b@, c=@c@
#/query
#detach "foo"

#define: Set Preprocessor Symbols

This directive "defines" values, in the same sense that a C preprocessor does, the main difference being that defines in c-pp behave more like variables, in that they can be freely overwritten without first having to undefine them.

Usages:

#define foo
#// the same as
#define foo 1
#define foo "this is foo"
#define bar foo
#assert bar="this is foo"

Prior to 2025-09-27, the equal sign was "just another identifier letter", but it is now no longer permitted by this directive. In the context of expressions, = is a comparison operation.

If a define is given no value, it has an implicit value of 1.

Multiple defines can be set at once with:

#define {
  x -> 2
  y -> 3
}
#assert (x=2) and (y=3)

This form requires a value for each key - there is no default. Each key is interpreted literally and each value is interpreted in the usual ways:

#define {a -> "hi there" b -> a}'

Will define both a and b to hi there.

Values in the form (...) are interpreted as integer expressions.

Design note: after some experimentation, the -> is required (A) because it's easier [for me] to read that way than {k v k2 v2...} is and (B) to avoid over-complicating the parsing by optionally allowing -> or =. My eyes find = to be less legible in that context.

To define a variable to the contents of a file, use the function call syntax:

#define x [include -raw the-file]

Potential TODOs:

#define "Heredocs"

#define can also assign a value from a content block using a heredoc-like syntax:

#define foo <<
content goes here
#/define

Notes and limitations:

#define accepts the following flags immediately before the <<:

#delimiter: Change the Directive Delimiter

This directive changes the directive delimiter. The delimiter is managed as a stack, the same way as #@policy. The stack always starts out with the library's compile-time-defined delimiter on top.

Usages:

A DELIM argument of default, predictably enough, uses the default delimiter (set when the library is compiled).

A final argument of << indicates that the new delimiter remains in place only until a following #/delimiter directive, noting that the closing directive has to be delimited by the newly-pushed delimiter. In this form, it is an error if EOF is encountered before the closing tag is found.

When used in a function call then (A) << is not permitted and (B) if given no arguments then it will emit the current delimiter.

Example:

##delimiter push @@
@@delimiter push !! <<
!!expr 1
!!/delimiter
@@expr 2
@@delimiter pop
##expr 3

Results in the output 1\n2\n3.

PS: don't do that.

#error Breaks Things

Immediately stops processing with an error.

#error the rest of the line is an error message

As a special case, if the line both starts and ends with the same character of " or ' then those are stripped from the result.

#expr Evaluates Things

This directive evaluates an expression (described in the next subsection) and emits its result (typically an integer).

#expr expression...

There are no known practical uses for this directive beyond in testing c-pp itself, but see #assert and #if for more practical uses of expressions.

Expression Rules

An expression, in this context, is a series of operators and operands which evaluate to either true or false. Expressions are used by several directives, most significantly #if.

The general syntax is:

[not] [defined] value [COMPARISON-OPERATOR value] [and ...] [or ...] [glob ...]

The following unary operators are supported:

The unary operators bind tightly to their RHS argument, but without consideration for whether it is the beginning of a longer expression. That is (not a=3) will parse as ((not a)=3). The workaround is to use a subexpression: not (a=3) or, even simpler, a!=3. (Patches to fix that, even if it means rewriting the beast of an expression engine, would be very thoughtfully considered!)

The following binary operators are supported:

All of the binary operators are evaluated strictly left-to-right, with equal precedence for each.

Sidebar: there is currently no short-circuiting of and and or because the evaluation and parsing are closely tied together, but none of their operands have visible side-effects so no harm is done in not short-circuiting (so there's no rush to change this).

Expression limits:

#if, #else, #elif, #/if

#if and friends cause blocks of the input to be emitted or elided depending on the result of an expression. The expression syntax differs from that of a C preprocessor but the end result is the same. This family of directives includes #elif, #else, and #/if.

#if's arguments must make up an expresion. #/if ignores all of its arguments - it's commonly useful to add a note there saying which block is being ended.

Example:

#if foo=1
...
#elif foo < 2 or foo > 5
...
#elif bar or baz or not defined charlie
...
#else
...
#/if foo=1

(Any text after /if on that last line will be ignored, which is useful for annotating the line with the purpose of the block it's closing.)

#include External Files

This directive emits the contents of other files into the output:

#include ?-raw? filename...

The filename arguments may optionally be quoted, and must be if they contain any quote or space characters.

The -raw flag specifies that each file's contents are to be passed through to the current output channel with no interpretation, otherwise each file is filtered through the preprocessor as if it were part of the current file.

The filenames are searched for in the so-called "include path", which works just like a C/C++ include path. If no path is provided when invoking c-pp then it defaults to using a path of "." (only the current directory). If the -Idirname flag is provided then the default of "." is not applied. -I... can be used any number of times to specify search directories and they will be searched in the order provided.

#join Arguments Together

The #join directive concatenates its arguments and emits the result to the output stream.

It accepts the following flags:

$ ./c-pp -Db=2 -e '##join 1  b    3'
1 2 3

$ ./c-pp -e '##join 1 2 3 [join -s X 4 5 6]'
1 2 3 [4X5X6X]

TODO: unescaping of the separator to allow newlines and tabs. This needs to be done at a different level of the API, though.

#module Loads Directives from DLLs

The #module directive can load new directives from DLLs. It will neither register nor run in safe mode.

#module "dllName" "directive-name"

It tries to open the given DLL, find an entry point with the given name, which it assumes to be of type cmpp_loadable_module*, and it invokes the module's callback. The intent is that such callbacks register new directives. The DLL name argument may include the platform's conventional DLL exception (".so" on most platforms), but that's optional - the search includes checking the name both as-provided and with the DLL extension added to it.

This support currently only works on Unix-esque platforms: those with either dlopen() or ld_dlopen(). Patches to add support for other platforms would be welcomed.

Registration of modules is handled via macros named CMPP_MODULE_... in libcmpp.h.

The directive name can be left off if the module in question is specifically built and registered as the sole module in that DLL (in which case it uses a pre-defined entry point name). Whether that's the case depends on how it is built and which module registration macro(s) it uses.

This directive perfors no filename transformation beyond the path lookup and automatic DLL extension.

When this directive is invoked, if the module search path is empty and the CMPP_MODULE_PATH environment variable is set, it is added to the module path. If set, it is expected be in the form of a colon- or semicolon-delimited list of directories (the former on Unix-like systems, the later on Windows).

Example:

./c-pp -Ddll=libcmpp.so \
  -e '##assert not defined #dyno' \
  -e '##module dll dyno' \
  -e '##dyno hi there' \
  -e '##assert defined #dyno'
cmpp_dx_f_dyno() arg: cmpp_TT_Word hi
cmpp_dx_f_dyno() arg: cmpp_TT_Word there

When the DLL is built with a singleton module registration the entry point name is not required, as the singleton uses a well-defined name:

$ ./c-pp -e '##module "libcmpp.so"' -e '##dyno hi there'
cmpp_dx_f_dyno() arg: cmpp_TT_Word hi
cmpp_dx_f_dyno() arg: cmpp_TT_Word there

Example module: /file/src/d-dyno.c

Directives in Loadable Module DLLs

If built with DLL support and it's not running in safe mode then the library will, when encountering an unknown directive, search for a matching DLL. For purposes of this search, the DLL is expected to be named libcmpp-d-NAME.so. The module search path defaults to the $CMPP_MODULE_PATH environment variable, but it can also be set with the -L flag to c-pp or the the C API's cmpp_module_dir_add().

If it finds a matching DLL, it opens it, and, if it finds a loadable module in it, that module's registration function is called. If that call registers the being-sought directive, the library continues processing. If not, then it fails with an "unknown directive" error.

The C API also offers an "autoloader" API which clients can install to load their own statically-linked directives on demand or to implement their own DLL search. That's independent of the library's automatic DLL search (which is, in terms of search priority, last on the list).

@policy Controls Expansion of @tokens@

#@policy ?push? POLICY-NAME
#@policy pop

By default c-pp does no expansion of content beyond the filtering of content blocks using #if. If passed the -@ flag or this policy is used in a script, then it will perform a restricted type of expansion on content blocks: tokens in the form @TOKEN@ are processed as described below.

#@policy takes a policy name argument, defaulting to error, which describes how to deal with @tokens@ in the input:

The push option tells it to set the policy and remember the previous policy. The pop option restores that previous policy and will error out if there is no level to pop.

Behavior and limitations:

The --no-@ CLI flag or #@policy off both disable expansion until either a subsequent -@ or @policy flag re-enables it.

A demonstration of the "@policy":

$ echo 'a@x@c' | ./c-pp --@policy=off
a@x@c
$ echo 'a@x@c' | ./c-pp --@policy=retain
a@x@c
$ echo 'a@x@c' | ./c-pp --@policy=elide
ac
$ echo 'a@x@c' | ./c-pp --@policy=error
a
./c-pp: @<stdin>:1: Undefined key: @x@

Predefined @tokens@

#pipe Filters Content through External Processes

This directive is not currently available on Windows builds (patches to improve that would be thoughtfully considered!).

#pipe runs an external command, optionally feeds it input from the script, and emits the output from that command:

#pipe -- /usr/bin/sed -e 's/this/that/'
this content
#/pipe

Will pipe this content\n into sed and get that content\n back.

Similarly:

#define cmd "echo"
#pipe -no-input -chomp-output -- cmd this is from echo

Will emit this is from echo and chomp the trailing newline from the output.

Arguments and flags:

The final argument must be the command and its arguments in one of two forms:

The external command gets piped, via its stdin, the contents of the directive's block unless -no-input is used. The command's stdout output is collected and emitted in its place. The output is not currently post-processed in any way except as per the -chomp-output flag, but should the need arise we can easily add optional at-token parsing to the output via a flag.

Stupid #pipe trick: run a C preprocessor through it:

##pipe -path -- 'cpp' -E
#include <stdio.h>
##/pipe

That requires using a directive delimiter other than # to avoid a conflict with cpp's #.

TODOs:

#pragma Is for Debugging

This directive is undocumented. It changes at the whim of the library's developer, primarily to support testing and debugging.

#query Renders Data from a Database

This directive runs SQL queries. c-pp internally uses only one (private) database, so #query isn't much use on its own except for in testing c-pp, but #attach can be used to attach arbitrary databases (and was added to support #query).

This directive has two forms:

First, it can run an SQL query, set scope-local defines for each result column, and filter its block's contents for @tokens@ using the current @token@ policy:

Your list of foo:
#query {select name AS name, price AS price from foo order by name}
@price@ @name@
#query:no-rows
This part is optional and is emitted if the query has no results.
#/query

That form requires a terminating #/query directive but the #query:no-rows sub-directive is optional (and may not appear more than once).

Secondly, it can define one or more symbols from the first row of an SQL query:

#query define {select a, b from c order by a}

This form does not use a terminating #/query directive.

For the first form the body of the query block is @token@-expanded to the output stream one time for each result row. Before each is expanded, defines are set matching the names of the result columns. The defines are set within the context of a local savepoint so that after the query is processed the defines are either unset or reverted to their previous values. If no rows are found, the (optional) #query:no-rows block is emitted. If that block is not set, no output is emited for querie which have no result rows.

The query block may contain other directives, but any directives need to be completely enclosed inside the #query...#/query body, not interwoven.

The "define" form sets corresponding defines for the first row of the result set and does not use a savepoint. If no result rows are found it sets each define to an empty value. (Potential TODO: add a flag to error out in that case, or maybe provide default values.)

Sidebar: remember that the only guaranteed reliable way to get a result column's name is to set it oneself using SELECT x AS x (with the "AS" being optional).

Formatting of the results, if needed, can be done using SQLite's format() function. It is exceedingly unlikely that c-pp will ever be extended to include formatting-related features. (However, function calls bring that capability within easy reach.)

Potential TODOs:

Binding Query Parameters

Query parameters can be bound either by name or index, but not a mix of both, by adding a bind argument:

Sidebar: SQLite supports a prefix of @ in addition to : and $ but it's not supported here because of syntactic confusion with at-strings.

Bind values may be any of:

#savepoint: Scoped Defines

Savepoints are like nestable transactions. In c-pp they let us define/undefine values in a scoped manner. That is, a symbol defined in a savepoint will become undefined, or revert to its pre-savepoint value, if that savepoint is rolled back. It might be interesting to someday explore how savepoints might be used for content blocks as well, but the internals are not currently set up to do such a thing (we'd need to buffer all output to the db or memory, rather than sending it directly to the output channel).

#savepoint requires a single argument:

If a savepoint is neither committed nor rolled back by the end of its script file, it will automatically be rolled back. It is an error to try to end a savepoint when none is currently open.

$ cat foo
#@policy error
#define bar=2
#savepoint begin
#define foo
#if not foo
#  error expecting foo
#/if
foo is @foo@
bar is @bar@
rolling back...
#savepoint rollback
#if foo
#  error foo should not be set
#else
foo is gone
bar is @bar@
#/if
#if not bar=2
#  error expecting bar=2
#/if
begin again...
#savepoint begin
#define foo=again
#if not foo=again
#  error expecting foo=again
#/if
foo is @foo@
bar is @bar@
committing...
#savepoint commit
bar is @bar@
#if not foo=again
#  error expecting foo=again
#/if
bar is @bar@
foo is @foo@
the end

$ ./c-pp --delimiter '#' foo
foo is 1
bar is 2
rolling back...
foo is gone
bar is 2
begin again...
foo is again
bar is 2
committing...
bar is 2
bar is 2
foo is again
the end

Why was #savepoint added? An idle thought of "wouldn't it be interesting to automatically undefine these vars at the end of the file which defined them?" led to "oh, savepoints can do that". Then it was actually really easy to add.

#stderr

Emits remainder of line to stderr.

#stderr This goes to stderr along with file location info.

#undef

Undefines one or more defines:

#undef foo bar baz

#undefined-policy

Specifies how c-pp should react to references made to undefined keys:

#undef ?push? error|null
#undef pop

The policy values are:

push and pop work exactly as described for #@policy.

#//: Comments

Infrequently useful, but...

#// This is a c-pp comment.

There must be a space after the // because that // is, despite appearances, parsed as a directive name.

Multi-line comments are not supported but #if can be used for the same effect:

#if defined nope
...
#/if

Add-on Directives

This section describes directives which are not part of the core library but which are in this source tree, available for copy/paste reuse. They may require third-party software. They may or may not also be pre-built into the library or CLI app.

The directives are listed in alphabetical order.

#c-code

This proof of concept directive filters input into C code formats.

Source file: d-c-code.c

#c-code -mode byte-array \
  -getter get_mah_bytes {
this is content
}

Emits something like:

unsigned char const * get_mah_bytes_get(unsigned * pLen){
  static unsigned char const _a[] = {
    10,116,104,105,115,32,105,115,32,99,111,110,116,101,110,116
  };
  if(pLen) *pLen=sizeof(_a);
  return _a;
}

And:

#c-code -mode byte-array -hex -name mah_bytes
...content goes here...
#/c-code

Emits:

unsigned char const mah_bytes[] = {
    0x23,0x69,<big snip>...
    0x0a
};

The block content may contain other directives, which is especially useful here with #include -raw

-mode cstr has it emit the content as a string literal.

#pikchr

This directive reads pikchr input and emits SVG-format image output.

Source file: d-pikchr.c

Usages:

#pikchr ...flags
... pikchr markup...
#/pikchr

Or:

#pikchr ...flags {
  ...pikchr markup...
}

Those differ in the following ways:

It emits an SVG-format image or an error message. In the case of a pikchr() error, this directive emits the full pikchr result to the ouput stream before setting the error state to something less verbose than pikchr()'s error dump.

Flags:

Regarding newlines: it's not specified whether pikchr output always includes a trailing newline. If -unchomp and -chomp are used together, results may be unpredictable.

"Function Calls"

As of 2025-11-11 c-pp supports a limited form of "function call" in the form [D ...args] where D is the name of a directive. This only works for directives which can function without a closing directive (even if they do so only conditionally, e.g. #query, in which case only the closing-directive-less forms are legal here).

Calls work by doing the following:

  1. Copy the input string, prepending the current delimiter to it. We have to copy it because $REASONS.
  2. Redirect the current output stream to a buffer.
  3. Process the buffer from #1 an input document.
  4. Restore the output stream to its previous state.
  5. Any output from that document is now in the buffer from #2, which becomes the result of the call. A single trailing newline is trimmed from the result.

It's still being determined where this syntax should be legal, but here are some examples of where it currently is:

Some of the functions currently available: #arg, #join.

The Library API

This section demonstrates how to use the library API from client C code. It is not an exhaustive guide (that's what the API docs are for) but is enough to get started with the library.

The first step is getting a preprocessor instance:

#include "libcmpp.h"
...
cmpp * pp = 0;
int rc = cmpp_ctor(&pp, 0/*optional flags*/);
if( rc ){
  // error
  if( pp ){
    // cmpp_err_get() will get the error info.
    cmpp_free(pp);
  }
  return;
}
... use pp ...

(Initialization will only fail if an allocation fails or if optional custom initialization code fails. In the former case, pp will always be NULL. In the latter case, the pp's error state holds info about the failure.)

Next, we set up an output channel:

cmpp_outputer out = cmpp_outputer_FILE;
out.state = stdout;
cmpp_outputer_set(pp, &out, "<stdout>");

Any output destination which can be wrapped in the cmpp_output_f() interface is suitable. Implementions are provided for FILE*, file descriptors, and cmpp_buffer (basically a dynamic string buffer), and adding one's own is normally trivial. e.g. send output directly to a UI widget.

Then we feed it some input:

unsigned char const *input = ...a script full of input...;
int rc = cmpp_process_string(pp, "my-input.txt", input, -1);
if( 0==rc ) { ... success ... }

In essence it can take input from anywhere, but it requires that the input be completely available when parsing starts, so the lowest level of feeding it input is cmpp_process_string(), where each call equates to a new input source. cmpp_process_file() and cmpp_process_stream() are both thin proxies around cmpp_process_string().

On success, all of the output will show up in the provided output channel. On error, the output may have been partially generated and must not be trusted as being complete or usable. Most errors cannot be recovered from without cleaning up all state, and practice shows that in this context there's little or no reason to attempt it.

When we're done we need to clean up:

cmpp_dtor(pp);

For the most part, that's really all there is to it.

The library can be extended with custom directives and several are demonstrated in d-demo.c and d-pikchr.c. Custom directives can perform essentially any jobs the builtin directives do, the notable exception being flow-control changes (like #if does). More properly, they can implement flow control but must provide the infrastructure needed for nesting such constructs and ensuring that they're closed properly. The internal infrastructure for doing so is probably not well-suited to general-purpose flow control, e.g. adding a hypothetical #while or #foreach loop. Similarly, the expression-evaluation API is not yet in the public API, and it's still being determined whether to make it so (because it's rather primitive).

Library Build Options

The library, for client-side use, is distributed in two files, libcmpp.[ch], which can be created with make libcmpp.c.

The following CPP defines influence how libcmpp.c is built. They have no effect on client code.

Background: Why Create c-pp?

In mid-2022 the SQLite project started work on its JS/WASM bindings. It was initially written for "vanilla" JS for the simple reason of personal preference of the guy writing the code, but it was clear we would eventually need to support ESM (ES6 modules) because that's what the modern-day JS ecosystem uses. Vanilla JS and ESM are 99.9% identical but each has tiny context-specific syntactic differences. Most differences in JS can be resolved via runtime introspection but syntactic differences make code outright illegal in one or other of the modes.

We had several options for dealing with this:

A notable restriction: one rule of the SQLite project is that we cannot simply import random code into it, so any tooling was going to have to be hand-rolled by members of the project. Spoiler alert: only one team member needed this tool, so it was up to them to implement it (double-spoiler alert: 🙋‍♂️).

First we tried a C preprocessor, as that's precisely the type of thing we needed, but it didn't take more than 15 minutes to determine that it was unsuitable for the job. Summary: C preprocessors make a mess of non-C code by injecting it with C-isms like #line markers or, in the case of GCC, a GNU license header. If gcc's preprocessor could have been taught to emit only its filtered inputs, without irrelevant other content, the story would have ended there and much subsequent effort could have been spared.

The SQLite project has a strong culture of "keep it simple" and "don't be shy about writing your own tools", instilled the hard way over 2.5 decades, and that culture has seeped into me in my time there. My built-in tendency, however, is to over-engineer everything, even otherwise simple shell scripts, a fault at odds with The SQLite Way. Even so... we needed a preprocessor, or something like it.

For logistical reasons, the choices had to come down to Tcl, dependency-free C, or the core Unix tools like sed, awk, and sh. A large handful of Tcl scripts already generate the core of SQLite, some much like a very-specific-purpose preprocessor. At the time, my Tcl-fu was not strong enough for me to confidently pull off my envisioned tool in Tcl. Maintaining JS code using shell scripts was, and remains, simply unappealing. So C became the implementation route of choice.

Writing dependency-free C code can be somewhat tedious, as one invariably ends up re-inventing the same set of utillity code, like a memory buffer class and a function to read in a file's whole contents at once (possibly into one of those buffers). In this case we'd also need a hashtable early on and, sigh, it would have to be written3.

It turns out, though, that we could use sqlite3.h and still be effectively dependency-free because this tool would be embedded in SQLite's own tree. How convenient! Long story short: being able to use an in-memory db as a hashtable was a huge time-saver and had further downstream benefits.

So work began on the preprocessor with the self-imposed restriction that it do only what we need, and not (contrary to my core nature!) be designed as a generic, client-agnostic, tool. That meant, for example, that it would use only global state, read only from a single file handle, and write only to one file handle. (Whereas my natural tendency would be to abstract the I/O channel into a client-extensible interface, taking up more code, more time, and adding a feature we ultimately wouldn't use. Sigh.)

And thus c-pp was born.

c-pp has proven invaluable for its initial role. SQLite has, as of late 2025, some 8 or 10 different JS builds, all from the same core source files, and that would have been nigh impossible for our tiny team to reliably manage without some sort of source-filtering tool.

It turns out that there's a third build mode we didn't know about at the time: "bundler-friendly builds". "Bundlers" are source code analysis tools which look through the multitudes of dependencies used by modern JS dev approaches and "bundle" them into sets which contain only the reachable parts of that code. One of their limitations is that they cannot resolve dynamically-generated string references to external filenames, which means that they cannot resolve dependent file names which are processed in code. They have to be fed such file names as string literals instead. Sigh. Bundler builds differ from ESM only in their requirement for hard-coded string literals for part of SQLite which have to load external scripts (like its OPFS VFS or its "worker1" API). We cannot unilaterally use hard-coded strings because (A) that's icky and (B) we don't know the full paths to some files at compile-time. Bundler builds work around (B) by hard-coding a name which will only work in limited contexts.

At some point my natural urge to over-engineer got the best of me and c-pp was refactored from a single-purpose monolithic app into a client-agnostic library, quickly more than tripling in code and docs. It would be difficult to justify adding that sort of complexity and code bloat to the SQLite tree, given that that tree needs exactly none of it, so the original/"lite" version is maintained over in the lite branch, tweaked only insofar as necessary for SQLite-side JS maintenance.

The trunk branch, contrariwise, is where my over-engineering gets to run rampant, without risk to the SQLite JS builds. Some of the remnants of the c-pp's orignal monolithic-app shape are still visible in its interface and code, but the trunk version has become a significantly different thing than its predecessor.

But why? Why do we need an over-engineered, client-extensible preprocessor?

We don't. Spoiler alert: i don't, either! The world has lots of problems and the ones this project ostensibly solves aren't among them. It is done because it interests me to do, and for no other reason.

Potential TODOs

Reminders to self...

Should the @ for token replacement be configurable?

Why would it need to be? Configuring it to a pair of single characters would be an easy change, but changing it to a pair of arbitrary-length strings would require more effort (and for what gain?).


  1. ^ C preprocessors, when running in comment-retention mode, tend to inject # characters all over the place and may do silly things like automatically include compiler-specific headers and emit the comments from those. e.g. using gcc -E -CC will include a gcc-internal header and emit a GPL license header in the output. e.g. try:
    $ echo 'extern int x;' > y.c; gcc -E -CC y.c
  2. ^ We do not use a default of # because some source files this tool was initially designed to handle have lines which start with that (JavaScript class private members). In that particular tree we use a delimiter of //#. Even so, the docs use # because it's easier on the eyes than the real default is.
  3. ^ Writing hashtables is one of those things which becomes tedious the fourth of fifth time around.