Counting Codepoints

Philippe Verdy verdy_p at wanadoo.fr
Tue Oct 13 09:16:47 CDT 2015


This works in Java because Java also treats surrogates as characters, even
if it has additional APIs to test strings for their actual encoding length
for Unicode. But outside strings, characters are just integers mathing
their code point value, and are not restricted to be valid Unicode
characters (strings also are not restricted to UTF-16 validation). Java
strings are not UTF-16 strings, they are just streams of unsigned 16-bit
code units, with arbitrary values and relative order (so ill-formed strings
for Unicode are still valid Java strings).
When UTF-16 validity is required, your examples with loops would have to
test the presence of lone surrogates in the returned code points. Such
detection is needed for implementing some protocols, e.g. to parse HTML
pages and check the encoding (or guess it) and the input stream would then
be parsed with another encoding countring codepoints differently.
For I/O, the 16-bit "char" type.is actually not used, I/O is performed with
signed "byte"s, they are decoded using a specific encoding which will
return errors or exceptions if decoded into strings, or for the reverse
operation which can also fail).


2015-10-13 14:08 GMT+02:00 Mark Davis ☕️ <mark at macchiato.com>:

>
> On Tue, Oct 13, 2015 at 8:36 AM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:
>
>> Rather the question must be the unwieldy one of how
>> many scalar values and lone surrogates it contains in total.
>>
>
> ​That may be the question in theory; in practice no programming language
> is going to support APIs like that. So the question is whether your
> original question was purely theoretical, or was about some particular
> language/environment.
>
> If the latter, then looking at the behavior of related functions in that
> environment, like traversing a string, and counting in a way that is most
> consistent with their behavior, is the least likely to cause problems.
>
> For example, Java is pretty consistent; each of the following returns 2 as
> the count.
>
>     String test = "\uDC00\uD800\uDC20";
>     int count = test.codePointCount(0, test.length());
>     *System.out.println("codePointCount:\t" + count);*
>
>     count = 0;
>     int cp;
>     for (int i = 0; i < test.length(); i += Character.charCount(cp)) {
>       cp = test.codePointAt(i);
>       count++;
>     }
>     *System.out.println("Java 7 iteration:\t" + count);*
>
>     count = 0;
>     for (int cp2 : test.codePoints().toArray()) {
>       count++;
>     }
>     *System.out.println("Java 8 iteration:\t" + count);*
>
> // for the last, could just call: *count = (int)
> test.codePoints().count();*
>
> The isolate surrogate code unit is
> ​consistently treated
> as the corresponding surrogate code point, which is what
> ​anyone would
>
> ​reasonably ​
> expect.
>
> Mark
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151013/69289c29/attachment.html>


More information about the Unicode mailing list