(⬑Table of Contents)
string
See also: String syntaxes
Jump to...
Strings
Strings are probably the single most-used data type in many scripts, in terms of how many the user types and how many the interpreter implicitly processes (non-keyword identifiers are internally strings). In s2 strings may contain any UTF-8 input and are immutable. (The Buffer class offers what amounts to a mutable string.) Strings have a size limit, as described in the #type-string-limits section.
See String syntaxes for the various ways to construct strings.
String Methods
While strings are not objects and may not have custom properties, they do have a prototype.
The shared methods may be extended by modifying this type's prototype. For example:
"".prototype.firstChar = proc(asInteger=false){
return this.charAt(0, asInteger);
};
assert "a" === "abc".firstChar();
The protype includes, by default, the following methods, listed in alphabetical order:
string applyFormat(...)
Treats this string as if it is formatted using the
Buffer.appendf() rules and treats any
arguments as values to apply those rules to. i.e. it's functionally
equivalent to aBuffer.reset().appendf(thisString,...).takeString()
.
Returns a new, formatted string.
integer byteAt(integer byteOffset)
Similar to charAt()
but returns the integer value of the byte at the
given byte offset (not character offset), or undefined
if out of
range. If (str.byteAt(X)!==str.charAt(X,true))
then the returned
byte is part of a multibyte character or X
is out of range for
charAt()
(remember that the byte length of a string will never be
less, and may be more, than the character length (assuming valid
UTF-8)).
mixed charAt(integer charIndex [,asInteger=false])
Returns the character at the given character position (not byte position), or undefined if out of range. If the boolean parameter is false (the default) then the result is a length-1 string, else it is the integer value (the Unicode code point). Note that this is an O(N) operation, not O(1), due to the UTF-8 character-counting required.
Sidebar: As of 20171115, strings support indexed access just like arrays, being functionally equivalent to
charAt()
but being more efficient because it happens in the operator layer and doesn't require a function call. e.g.('abc'[2] === 'c')
and('abc'[1][0][0][0] === 'b')
. Now that strings support indexed access, it might make sense to change the 2nd parameter's default to true.
integer compare(Value [,Value])
Compares this value to another one using "memcmp semantics". If passed two values it compares those instead.
string concat(Value...)
Returns a new string comprised of this string plus the stringified form of all arguments. When concatenating more than one string, this is much more efficient than chaining more than one + operation because it avoids extra temporary strings which that operator necessarily invokes.
mixed evalContents(… various …)
Works identically to Buffer.evalContents().
integer indexOf(string [,offset=0])
Returns the index of the first instance of the given string in this string. If the 2nd parameter is provided it is the starting character (not byte) offset to start the search at. A negative offset means to start searching that many characters from the end of the string, but it does not change the order of the search (it does not search backwards!). Returns an unspecified negative value if no match is found, if the given string is empty, or its length is greater than this string's.
bool isAscii()
Returns true if this string contains only ASCII bytes (those with values in the range 0 to 127, inclusive). While that really makes no difference in scripts, some of the C-native string algorithms can run much more quickly if they know that they don't have to parse non-ASCII UTF8 characters.
integer length()
Alias for lengthUtf8().
integer lengthBytes()
Returns the length of the string in bytes.
integer lengthUtf8()
Returns the length of the string in UTF-8 characters.
string operator+()
Overloads the binary +
operator when a string
is on the left-hand side of an addition operation. Returns a new
string compounded of this string plus the stringified form of its
right-hand argument. e.g. "3"+7+"!" === "37!"
. Note, however, that
calling concat()
is more efficient than chaining more than one +
operator because the +
operator (when chained) evaluates to several
temporary strings along the way to its result, whereas concat()
does
not (it operates on a memory buffer, ideally with a single allocation).
string replace(needle, replacement [, limit = 0])
Returns a new string, a copy of this string with instances of needle replaced by replacement. If a limit is supplied, only the first limit instances are replaced, else all instances are replaced. If no changes are made, the original string is returned.
array split(string separator [,limit = 0])
Splits this string on the given separator string. Returns an array
of entries, containing the whole input string as a single entry if
no separators are found. Neighboring separators, or separators at
the beginning or ending, act as if they had an empty entry on their
side(s), and empty entries are populated with length-0
strings1. If the second parameter is specified (and is greater
than 0) then it stops splitting after tokenizing that many elements.
e.g. splitting "a:b:c"
with a delimiter of ":"
and limit of 2 will
result in ["a","b"]
. (Side-note: those semantics changed to match
JavaScript's behaviour on 201602132.)
Special case: If the separator is an empty string, the input string is split into individual characters (up to the given limit, if any).
string substr([offset=0 [,length=-1]])
Returns a copy of a substring of the current string, starting at the given character (not byte) offset (negative values count from the end of the string). If the length is negative (the default) then the range from the offset to the end is returned. If the offset is larger than the string's length an empty string is returned. If a negative offset has an absolute value greater than the string's length then it is treated as 0.
Achtung: prior to 20171115, a length of 0 meant to copy until the end of the string. That led to some quirky corner cases, so the semantics were changed.
string toJSONString()
Returns a JSON-escaped form of this string, including surrounding double quotes.
string toLower()
Returns the lower-cased form of this string. Supports all of the one-to-one case conversions specified by Unicode, and none of the n-to-one/one-to-n special cases. If there is no valid conversion for a character, it is kept as-is.
string toUpper()
Returns the upper-cased form of the string. See toLower() for Unicode details.
string trim()
Returns a copy of this string trimmed of leading and trailing whitespace. Note that only ASCII whitespace is considered, not exotic UTF-8 spaces.
string trimLeft/trimRight()
Returns a copy of this string trimmed of leading resp. trailing whitespace.
string unescape()
Returns a copy of this string with conventional C-style
backslash-escape sequences unescaped. Also unescapes \uXXXX
and
\UXXXXXXXX
Unicode sequences. Generally only of use with heredocs
or strings read from files or user input, since quoted strings in
scripts do this automatically (using the same algorithm) when
evaluated. Unknown backslash sequences are left intact (as this
simplifies(?) escaping data for certain types of script-bound C APIs
(might want to rethink that)).
TODO: list all the sequences it supports.
Strings as Numbers
The API has limited built-in support for converting decimal-format strings into numbers. Any math operation with a number on the LHS and a string on the RHS will automatically try to convert the string to a number, resulting in a value of 0 if the value is-not-a number. Unary plus and minus can also coerce a string into a number. Examples:
assert "3.00.1" === "3.0"+0.1; // string on the left
assert 3.7 === 0.7 + "3"; // string on the right
assert 3.74 === 0.7 + "3" + "0.04";
assert 3.1 === 3.1 + "abc"; // "abc"==0
assert 5 === +"5";
assert -5 === -"5";
assert -5 === -"+5"; // + sign inside the string is supported
assert -5 === +"-5"; // as is a - sign.
assert 1.2 === -"-1.2"; // but beware of floating-point precision changes on such conversions!
Alternately, the parseInt()
, parseDouble()
, and parseNumber()
members of the numeric prototypes provide more complete numeric
conversions (including hex and octal notations) and report invalid
conversions by returning undefined
:
const pn = 0.parseNumber, pInt = 0.parseInt;
assert 1.0 === pn('1.0');
assert 1 === pInt(1.0);
assert 1.0 === pn(1.0);
assert -1 === pn('-1');
assert 'double' === typeinfo(name pn("1.3")); // parseNumber() keeps the numeric type
assert 'integer' === typeinfo(name pn("1"));
assert 'integer' === typeinfo(name pInt("1.3")); // parseInt() truncates to an integer
assert undefined === pn("not a number");
assert undefined === pn("1-1");
assert undefined === pInt("1.blah");
If you are not concerned about whether you get an integer or double
result, use parseNumber()
, as it may return either one, whereas
parseInt()
and parseDouble()
always coerce their result (if not
undefined
) to integer and double, respectfully.
Length Limits
The maximum byte length of any given string is
2^(CWAL_SIZE_T_BITS-3)
, e.g. 30 bits (1GB) in a 32-bit
build3, and function bodies count as strings for this purpose
(so no billion-byte Functions, okay?). The remaining bits are used for
internal state flags. (Design note: it was either that or add a flag
field to strings, which would have increased their size by 4-8 bytes
once the compiler padded the structure.) In a 16-bit build the maximum
string length is only 8kb, but for most of s2's envisioned uses even
that "should be" sufficient. On 64-bit, that limit is functionally
unreachable. While Buffers have a limit of just under
2^CWAL_SIZE_T_BITS
bytes, such large buffers cannot be converted to
strings. The cwal_build_info()
C function (and its s2sh
counterpart, s2.cwalBuildInfo()
) can be used to find out the maximum
length of a string at runtime.
Note that the Buffer class does not have this limitation
- it may use (essentially) the whole bit range (64kb in a 16-bit build, 4GB in a 32-bit build, and gazillions of bytes in 64-bit).
Footnotes
- ^ These semantics changed on 2019-12-08 to match JavaScript and reduce insanity in client-side code. Prior to that, empty slots used to be filled the undefined value instead of empty strings.
- ^ 20191110 LOL. i was just expecting the old semantics in some client code and came to look at the docs, expecting to see the old semantics described. Thought i had discovered a bug.
- ^ This is set independently of the architecture's bitness. A 64-bit machine can build a 16-bit libcwal and a 32-bit machine can build a 64-bit-capable libcwal, provided the underlying platform is capable of it.