Counting Codepoints
Mark Davis ☕️
mark at macchiato.com
Tue Oct 13 07:08:28 CDT 2015
On Tue, Oct 13, 2015 at 8:36 AM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:
> Rather the question must be the unwieldy one of how
> many scalar values and lone surrogates it contains in total.
>
That may be the question in theory; in practice no programming language is
going to support APIs like that. So the question is whether your original
question was purely theoretical, or was about some particular
language/environment.
If the latter, then looking at the behavior of related functions in that
environment, like traversing a string, and counting in a way that is most
consistent with their behavior, is the least likely to cause problems.
For example, Java is pretty consistent; each of the following returns 2 as
the count.
String test = "\uDC00\uD800\uDC20";
int count = test.codePointCount(0, test.length());
*System.out.println("codePointCount:\t" + count);*
count = 0;
int cp;
for (int i = 0; i < test.length(); i += Character.charCount(cp)) {
cp = test.codePointAt(i);
count++;
}
*System.out.println("Java 7 iteration:\t" + count);*
count = 0;
for (int cp2 : test.codePoints().toArray()) {
count++;
}
*System.out.println("Java 8 iteration:\t" + count);*
// for the last, could just call: *count = (int) test.codePoints().count();*
The isolate surrogate code unit is
consistently treated
as the corresponding surrogate code point, which is what
anyone would
reasonably
expect.
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151013/46711f59/attachment.html>
More information about the Unicode
mailing list