Technical note: This post uses Unicode characters, especially at the end. If your browser or operating system does not have full Unicode coverage you may see boxes or rectangles in place of the correct glyphs.
ECMAScript, the standardized version of the language JavaScript, defines string values as sequences of UTF-16
code
units, not as sequences of characters.
This language misfeature complicates Unicode handling considerably.
For characters in the Basic Multilingual Plane (BMP) a single UTF-16 code unit (one 16-bit word) suffices.
For characters outside this range, two code units are necessary.
As an example, the Latin letter A is both one character and one code unit: "A".length === 1
, but
the
Unicode character U+1D400 MATHEMATICAL BOLD CAPITAL A is one character but two code units: "π".length ===
2
.
A better language would hide this ugly implementation detail from users, and string attributes such as length
would
be in terms of characters, not code units.
Unfortunately, for historical reasons, ECMAScript forces programmers who want proper Unicode support to deal
with
raw UTF-16 directly.
One of the features broken by this kludge is regular expression character classes.
In a character class, you can easily use ranges of characters, so long as those ranges fall within the BMP, e.g.
[a-z]
will match any character in the Latin alphabet.
If you want to match character ranges outside the BMP, things are a little more complicated.
To match, say, the range of tetragrams, U+1D306 through U+1D356, we might naΓ―vely use them directly:
[π-π]
.
To understand why this doesn't work, we must consider the UTF-16 representation of the characters "π" and "π".
In UTF-16, characters outside of the BMP are represented by a surrogate pair comprised of two code units: a high surrogate in the range 0xD800β0xDB7F followed by a low surrogate in the range 0xDC00β0xDFFF. Each of these 16-bit surrogates contains six bits that identify the code unit as part of a surrogate pair, followed by 10 bits of the represented code point less 0x10000. A more complete introduction can be found in the Wikipedia UTF-16 article.
The character U+1D306, "π", is represented by the surrogate pair D834 DF06. Likewise U+1D356, "π", is represented by D834 DF56.
The string "[π-π]", then, contains five characters but is represented in ECMAScript by seven code units:
Character | Code Units |
---|---|
[ | 002F |
π | D834 |
DF06 | |
- | 002D |
π | D834 |
DF56 | |
] | 005D |
The interpretation of ECMAScript regular expression character classes is according to code units, not characters.
Despite the fact that "[π-π]" contains 5 characters, since "[π-π]".length === 7
, the meaning
when
used as a character class is surprising.
"[π-π]" is equivalent to [\uD834\uDF06-\uD834\uDF56] and means "match either D834, or something in DF06βD834,
or
DF56," just as if we had written "[am-qz]" to match an "a", an "m"β"q", or a "z".
Obviously this is not what was intended.
In character classes, then, we cannot use characters outside the BMP. Even a single character outside the BMP, if appearing in a character class, will have an undesired interpretation: the character class will match either of the two surrogate code points (but not both) which is clearly not the intention.
Fortunately there is a way to match what we want, though not by using a character class. In matching the range π-π, we want to match two consecutive code units. The first will be D834, and second will be DF06, DF56, or anything between. So, using escape sequences to represent the code units directly, we can write:
\uD834[\uDF06-\uDF56]
The first escape will match the high surrogate, and the second range will match any of the low surrogates which may follow it to complete a character in the desired range.
Now consider a longer range, between U+1D306 "π" and U+1F004 MAHJONG TILE RED DRAGON "π". The lowest surrogate pair to match, as before, is D834 DF06. The highest pair is D83C DC04. Additionally, anything "between" those two pairs should match. What does "between" mean here? In contrast to the previous example, not only the low surrogate but also the high surrogate now varies over the range we want to match. Any code point between these two will be represented as either (a) a D834 high surrogate followed by a low surrogate between DF06 and the top of the low surrogate range, DFFF, or (b) any high surrogate between D835 and D83B followed by any low surrogate whatsoever, or (c) a D83C high surrogate followed by a low surrogate between DC00 (the bottom of the low surrogate range) and DC04. This gives the following regular expression:
\uD834[\uDF06-\uDFFF]|[\uD835-\uD83B][\uDC00-\uDFFF]|\uD83C[\uDC00-\uDC04]
Three alternatives are necessary, each of which, if it matches, consumes two consecutive code units.
Writing regular expressions like those above by hand is tedious and error-prone. Instead we can write a program to generate them.
First we need an efficient representation of sets of code points. For this we provide a set datatype which represents a set of code points (i.e. integers) in the Unicode range (0 - 0x10FFFF). Sets may be constructed and manipulated with the following functions:
universe
These two are actually constants, not constructors.
nil
fromCharRange(from,to)
Where from
and to
are any Unicode characters.
fromChar(char)
fromString(string)
Returns a cset containing every unique character in string
,
which
may include Unicode characters outside the BMP.
Sets can be constructed from strings, individual characters, or character ranges, among other ways. (There are also Unicode properties and categories, but that's for another post.)
complement(cset)
difference(a,b)
union(a,b)
intersection(a,b)
Once a character set is constructed, we can output an ECMAScript regular expression which will match any character from that set using toRegex().
toRegex(cset)
All of these are live, so you can edit the code and watch the corresponding output update in real time right here on the page.
The CSET.import() call simply makes the functions from the CSET module available locally. I've written a separate post about modules.
A regular expression to match the tetragrams:
Here is the longer range that was explained above:
Of course, if the range covers characters in the BMP, there is no need to use the "\u" Unicode escapes, and we can use the characters directly to represent themselves, making the regular expression a little more readable and saving a few bytes
Here's a regex to match any character that appears in the first sentence of this post:
Finally, here's a regex to match any single character in the Unicode category Ll (lowercase letters in any language). The other Unicode General Categories (except Cn) are also supported: Lu Ll Lt Lm Lo Mn Mc Me Nd Nl No Pc Pd Ps Pe Pi Pf Po Sm Sc Sk So Zs Zl Zp Cc Cf Cs and Co. The Unicode Character Database has an explanation of these.
The CSET code used on this page is in cset_production.js, which is generated from cset_source.js, which contains detailed comments. The code is released under the MIT license.