Unicode in passwords

Tue Oct 6 08:04:44 CDT 2015

Note that Java strings DO allow the presence of lone surrogates, as well as
non-characters , because Java strings are unrestricted vectors of 16-bit
code units (non-BMP characters are handled as pairs of surrogates).

In those conditions, normalizing the Java string will leave those lone
surrogates (and non-characters) as is, or will throw an exception,
depending on the API used. Java strings do not have any implied encoding
(their "char" members are also unrestricted 16-bit code units, they have
some basic properties but only in BMP, defined in the builtin Character
class API: properties for non-BMP characters require using a library to
provide them, such as ICU4J).

This is essentially the same kind as C/C++ "wide" strings using 16-bit
wchar_t, except that:
- C/C++ wide strings do not allow the inclusion of U+0000 which is a
terminator, unless you use a string class keeping the actual string length
(and not just the allocated buffer length which may be larger).
- Java strings, including litterals, are immutable, and optionally atomized
into a global dictionary, which includes all string litterals to share the
storage space of multiple instances with equal contents, including across
distinct classes from distinct packages.
- This also true for string literals (which are all immutable and atomized,
and initialized from the compiled bytecode of classes using a modified
version of UTF-8 that preserves all 16-bit code units (including lone
surrogates and non-characters like U+FFFF), but also store U+0000 as
<0xC0,0x80>. This modified UTF-8 encoding is also what you get if you use
the JNI interface version with 8-bit string (this internally requires a
conversion by JNI, using a temporary buffer); if you use the JNI interface
version with 16-bit strings, you work directly with the internal 16-bit
java strings and there's no conversion: you'll also get the lone surrogates
and all non-characters and you are not restricted to only valid UTF-16.
- Java strings are commonly used for fast initialization of large immutable
binary arrays because the conversion from Modified-UTF-8 to 16-bit strings
does not require running any compîled bytecode (this is not true for other
static arrays which requires large code for array litterals and not
warrantied to be immutable: the alternative to this large compiled code is
to initialize those large static arrays by I*/O *from an external stream,
such as a file beside the class in the same package, and possibly packed in
the same JAR).

Java passwords are "strings" but then still allow them to include arbitrary
16-bit code units, even if they violate UTF-16 restrictions. You will not
get much difference is you use byte arrays, the only change being the
difference of size of code units. Between those two representation you are
free to convert them with ANY encodings pair, and not just assuming
UTF-8<>UTF-16.

However, for security reasons, it's best to avoid string litterals for
passwords, because they can be enumerated from the global dictionnary of
atomized strings, or directly by reading the byte code of the compiled
class where they are sored in modified-UTF-8 but loaded and used as
arbitrary 16-bit strings (but the same is true if you use a byte array
literal ! you can just parse the initilization byte code to get the list of
bytes). If passwords or authorization keys are stored somewhere (as strings
or as byte arrays) they should be encrypted into a safe storage and not in
static string litterals or byte array initializers (they will BOTH be clear
text in the bytecode of the compiled class).

In both cases, there is NO normalization applied implicitly or
checked/enforced by the API (the only check that occurs is at class loading
time for the Modified-UTF-8 encoding for string literals: if it is wrong
the class will not load at all, you'll get an invalid class exception;
there's no such ckeck at all for the encoding of byte array initializers,
the only checks are the validity of the java initializer byte code and
bounds of array indexes used by the initiliazer code).

2015-10-06 5:39 GMT+02:00 Martin J. Dürst <duerst at it.aoyama.ac.jp>:

> On 2015/10/01 13:11, Jonathan Rosenne wrote:
>
>> For languages such as Java, passwords should be handled as byte arrays
>> rather than strings. This may make it difficult to apply normalization.
>>
>
> Well, they should be received from the user interface as strings, then
> normalized, then converted to byte arrays using a well-defined single
> encoding. Somewhat tedious, but hopefully not difficult.
>
> Regards,   Martin.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151006/37e8d44e/attachment.html>