Unicode password mapping for crypto standard

Mon Jan 4 23:30:32 CST 2016

Hi Unicode list, I am looking for feedback on this proposal, 
specifically a standard specification to map between (presumably) 
Unicode text strings and octet strings.

A "password" is defined as an arbitrary octet string in a number of 
protocols and formats. This has worked for basic cases where the 
"password" is just ASCII, but there are interoperability issues when 
characters beyond ASCII get involved. My observation is that a lot of 
security folks get hand-wavy about the Unicode stuff, which is why there 
is little standardization in this area.

Recently in the IETF, application/pkcs8-encrypted is proposed for the PKCS #8 EncryptedPrivateKeyInfo type. For purposes of our discussion, the format takes as input an opaque octet string (any octet in the range 00h-FFh, of any length), and executes various specified algorithms; the result is a decrypted private key. The most common algorithm is PBKDF2, but any algorithm can be used (including, for example, a raw symmetric encryption algorithm such as AES-256).

PKCS #8 punts on the issue of character encoding. It says that ASCII or UTF-8 could be used, but doesn’t enforce anything in particular. PKCS #12 specifies UTF-16LE with a terminating NULL character (00h 00h).

In the application/pkcs8-encrypted registration, I thought it might be wise to allow senders and receivers to specify how input (whether user input or otherwise) gets mapped to the octet string, since it's not part of the format. Originally my concern at that time was to reflect IANA character sets, rather than profiles of Unicode.

These days, however, most user agents are Unicode-enabled and will accept user input in Unicode. Therefore, issue is less about legacy character sets, and more about how to take the Unicode input and get a consistent and reasonable stream of bits out on both ends. For example: should the password be case folded, converted to NFKC, encoded in UTF-8 vs. UTF-16BE, etc.? Constraining or transforming the input would be helpful for disparate systems to agree on these things.

Thank you,

Sean

PS I read the "Unicode in passwords" thread. It's relevant. An 
alternative or addition to proposing a mapping to/from Unicode, might be 
to have a "keyboard-mapping" or "keyboard-layout" parameter, that 
specifies the suggested layout of the keyboard (or input device) used 
for password input, preferably by deferring to some international 
standard on the topic. Such a parameter could influence the initial user 
input method, but it doesn't answer the question of how to turn the key 
presses into specific bits (Unicode-based or otherwise).

**********
The relevant part of the template (most recent proposal, today) is:
***
Optional parameters:

password-mapping:
When the private key encryption algorithm incorporates a "password" that 
is an octet string, a mapping between user input and the octet string is 
desirable. PKCS #5 [RFC2898] Section 3 recommends "that applications 
follow some common text encoding rules"; it then suggests, but does not 
recommend, ASCII and UTF-8. This parameter specifies the charset that a 
recipient SHOULD attempt first when mapping user input to the octet 
string. It has similar semantics as the charset parameter from 
text/plain, except that it only applies to the user’s input of the 
password. There is no default value.

The following special values are defined:
*pkcs12  = UTF-16LE with U+0000 NULL terminator (PKCS #12-style)
*precis  = PRECIS password profile, i.e., OpaqueString from Section 4 of 
RFC 7613 (always UTF-8)
*precis-XXX = PRECIS profile as named XXX in the IANA PRECIS Profiles 
Registry <https://www.iana.org/assignments/precis-parameters>
*hex     = hexadecimal input: the input is mapped to 0-9, A-F, and then 
converted directly to octets. If there are an odd number of hex digits, 
the final digit 0 is appended, or an error condition may be raised. 
Compare with Annex M.4 of IEEE 802.11-2012.
*dtmf    = The characters "0"-"9", "A"-"D", "*", and "#", which map to 
their corresponding ASCII codes. (This is to support restricted-input 
devices, i.e., telephones and telephone-like equipment.)

Otherwise, the value of this parameter is a charset, from the Character 
Sets Registry <http://www.iana.org/assignments/character-sets>.
***

The relevant part of the original template (proposed 2015-11-04) is:
***
Optional parameters:
charset: When the private key encryption algorithm incorporates a “password" that is an octet string, a mapping between user input and the octet string is desirable. PKCS #5 [RFC2898] Section 3 recommends "that applications follow some common text encoding rules"; it then suggests, but does not recommend, ASCII and UTF-8. This parameter specifies the charset that a recipient SHOULD attempt first when mapping user input to the octet string. It has the same semantics as the charset parameter from text/plain, except that it only applies to the user’s input of the password. There is no default value.

ualg: When the charset is a Unicode-based encoding, this parameter is a space-delimited list of Unicode algorithms that a recipient SHOULD first attempt to apply to the Unicode user input in succession, in order to derive the octet string. The list of algorithm keywords is defined by [UNICODE]. “Tailored operations” are operations that are sensitive to language, which must be provided as an input parameter. If a tailored operation is called for, the exclamation mark followed by the [BCP47] language tag specifies the language. For example, "toNFD toNFKC_Casefold!tr" first applies Normalization Form D, followed by Normalization Form KC with Case Folding in the Turkish language, according to [UNICODE] and [UAX31]. The default value of this parameter is empty, and leaves the matter of whether to normalize, case fold, or apply other transformations unspecified.

The latest template is here:

http://mailarchive.ietf.org/arch/msg/precis/Qil9mc5AtqxXp8OXllp0lAwYts4