Counting Codepoints

Mark Davis ☕️ mark at macchiato.com
Tue Oct 13 07:08:28 CDT 2015


On Tue, Oct 13, 2015 at 8:36 AM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> Rather the question must be the unwieldy one of how
> many scalar values and lone surrogates it contains in total.
>

​That may be the question in theory; in practice no programming language is
going to support APIs like that. So the question is whether your original
question was purely theoretical, or was about some particular
language/environment.

If the latter, then looking at the behavior of related functions in that
environment, like traversing a string, and counting in a way that is most
consistent with their behavior, is the least likely to cause problems.

For example, Java is pretty consistent; each of the following returns 2 as
the count.

    String test = "\uDC00\uD800\uDC20";
    int count = test.codePointCount(0, test.length());
    *System.out.println("codePointCount:\t" + count);*

    count = 0;
    int cp;
    for (int i = 0; i < test.length(); i += Character.charCount(cp)) {
      cp = test.codePointAt(i);
      count++;
    }
    *System.out.println("Java 7 iteration:\t" + count);*

    count = 0;
    for (int cp2 : test.codePoints().toArray()) {
      count++;
    }
    *System.out.println("Java 8 iteration:\t" + count);*

// for the last, could just call: *count = (int) test.codePoints().count();*

The isolate surrogate code unit is
​consistently treated
as the corresponding surrogate code point, which is what
​anyone would

​reasonably ​
expect.

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151013/46711f59/attachment.html>


More information about the Unicode mailing list