Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

Tue May 16 13:01:57 CDT 2017

On Windows NTFS (and LFN extension of FAT32 and exFAT) at least, random
sequences of 16-bit code units are not permitted. There's visibly a
validation step that returns an error if you attempt to create files with
invalid sequences (including other restrictions such as forbidding U+0000
and some other problematic controls).

This occurs because the NTFS and FAT driver will also attempt to normalize
the string in order to create compatibility 8.3 filenames using the
system's native locale (not the current user locale which is used when
searching files/enumerating directories or opening files - this could
generate errors when the encodings for distinct locales do not match, but
should not cause errors when filenames are **first** searched in their
UTF-16 encoding specified in applications, but applications that still need
to access files using their short name are deprecated). The kind of
normalization taken for creating short 8.3 filenames uses OS-specific
specific conversion tables built in the filesystem drivers. This generation
however has a cost due to the uniqueness constraints (requiring to
abbreviate the first part of the 8.3 name to add "~numbered" suffixes
before the extension, whose value is unpredicatable if there are other
existing "*~1.*" files: it requires the driver to retry with another
number, looping if necessary). This also has a (very modest) storage cost
but it is less critical than the enumeration step and the fact that these
shortened name cannot be predicted by applications.

This canonicalization is also required also because the filesystem is
case-insensitive (and it's technically not possible to store all the
multiple case variants for filenames as assigned aliases/physical links).
In classic filesystems for Unix/Linux the only restrictions are on
forbidding null bytes, and assigning "/" a role for hierarchic filesystems
(unusable anywhere as directory entry name), plus the preservation of "."
and ".." entries in directories, meaning that only 8-bit encodings based on
7-bit ASCII are possible, so Linux/Unix are not completely treating thes
filenames as pure binary bags of bytes (however if this is not checked and
such random names may occur, which will be difficult to handle with classic
tools and shells). Some other filesystems for Linux/Unix are still
enforcing restrictions (and there exists even versions of them that are
supporting case insensitity, in addition to FAT12/FAT16/FAT32/exFAT/NTFS
emulated filesystems: this also exists in NFS driver as an option, or in
drivers for legacy filesystems initially coming from mainframes, or in
filesystem drivers based on FTP, and even in the filesystem driver allowing
to mount a Windows registry which is also case-insensitive).

Technically in the core kernel of Linux/Unix there's no restriction on the
effective encoding (except "/" and null), the actual restrictions are
implemented within filesystem drivers, configured only when volumes are
mounted: each mounted filesystem can then have its own internal encoding;
there will be different behaviors when using a driver for any MacOS
filesystem.

Linux can perfectly work with NTFS filesystems, except that most of the
time, short filenames will be completely ignored and not generated on the
fly.

This generation of short filenames in a legacy (unspecified) 8-bit codepage
is not a requirement of NTFS and it can be disabled also in Windows.

But FAT12/FAT16/FAT32 still require these legacy short names to be
generated when only the LFN could be used, and the short 8.3 name left
completely null in the main directory entry ; but legacy FAT drivers will
shoke on these null entries, if they are not tagged by a custom attribute
bit as "ignorable but not empty", or if the 8+3 characters do not use
specific unique parterns such as "\" followed by 7 pseudo-random characters
in the main part, plus 3 other pseudo-random characters in the extension
(these 10 characters may use any non null value: they provide nearly 80
bits or more exactly 250^10 identifiers if we exclude the 6 characters "/",
"\", ".", ":" NULL and SPACE that are reserved, which could be generated
almost predictably simply by hashing the original unabbreviated name with
79 bits from SHA-128, or faster with simple MD5 hahsing, and very rare
remaining collisions to handle).

Some FAT repait tools will attempt to repair the legacy short filenames
that are not unique or cannot be derived from the UTF-16 encoded LFN (this
happens when "repairing" a FAT volume initially created on another system
that used a different 8-bit OEM codepage, but this "CheckDisk" tools should
have an option to not "repair" them, given that modern applications
normally do not need these filenames if a LFN is present (even the Windows
Explorer will not display these short names because trhey are hidden by
default each time there's a LFN which overrides them).

We must add however that on FAT filesystems, a LFN will not always be
stored if the Unicode name already has the "8.3" form and all characters
are from ASCII (which is the base of all supported 8-bit OEM charsets), but
it will be created if the user edits the filename to use another prefered
capitalization than the default one (the Explorer default is to render
fully capitlized short filenames using a single leading capital letter, and
all other characters, including the 1-to-3-characters file extension,
befing displayed as lowercase (so the "Windows" LFN would be stored simply
as the "WINDOWS" short filename without any LFN needed in the directory
entries).

To be complete, a few legacy filenames are also reserved and can't be used
in Windows (short of LFN) filenames, such as "CON" (case-insensitive),
reserved by another legacy non-filesystem driver before they are seeked in
a specific current directory: to use them as filenames, you must prefix
them with a drive letter or with the some ".\" prefix (relative to the
current directory) or full path name.

2017-05-16 17:44 GMT+02:00 Hans Åberg <haberg-1 at telia.com>:

>
> > On 16 May 2017, at 17:30, Alastair Houghton via Unicode <
> unicode at unicode.org> wrote:
> >
> > On 16 May 2017, at 14:23, Hans Åberg via Unicode <unicode at unicode.org>
> wrote:
> >>
> >> You don't. You have a filename, which is a octet sequence of unknown
> encoding, and want to deal with it. Therefore, valid Unicode
> transformations of the filename may result in that is is not being
> reachable.
> >>
> >> It only matters that the correct octet sequence is handed back to the
> filesystem. All current filsystems, as far as experts could recall, use
> octet sequences at the lowest level; whatever encoding is used is built in
> a layer above.
> >
> > HFS(+), NTFS and VFAT long filenames are all encoded in some variation
> on UCS-2/UTF-16. ...
>
> The filesystem directory is using octet sequences and does not bother
> passing over an encoding, I am told. Someone could remember one that to
> used UTF-16 directly, but I think it may not be current.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170516/06564c3e/attachment.html>