Counting Codepoints

Mark Davis ☕️ mark at
Tue Oct 13 07:08:28 CDT 2015

On Tue, Oct 13, 2015 at 8:36 AM, Richard Wordingham <
richard.wordingham at> wrote:

> Rather the question must be the unwieldy one of how
> many scalar values and lone surrogates it contains in total.

​That may be the question in theory; in practice no programming language is
going to support APIs like that. So the question is whether your original
question was purely theoretical, or was about some particular

If the latter, then looking at the behavior of related functions in that
environment, like traversing a string, and counting in a way that is most
consistent with their behavior, is the least likely to cause problems.

For example, Java is pretty consistent; each of the following returns 2 as
the count.

    String test = "\uDC00\uD800\uDC20";
    int count = test.codePointCount(0, test.length());
    *System.out.println("codePointCount:\t" + count);*

    count = 0;
    int cp;
    for (int i = 0; i < test.length(); i += Character.charCount(cp)) {
      cp = test.codePointAt(i);
    *System.out.println("Java 7 iteration:\t" + count);*

    count = 0;
    for (int cp2 : test.codePoints().toArray()) {
    *System.out.println("Java 8 iteration:\t" + count);*

// for the last, could just call: *count = (int) test.codePoints().count();*

The isolate surrogate code unit is
​consistently treated
as the corresponding surrogate code point, which is what
​anyone would

​reasonably ​

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list