Unicode equivalence between Word for Windows/MAC

Philippe Verdy verdy_p at wanadoo.fr
Wed Oct 28 11:06:15 CDT 2015


Unicode 6.1 just indicates the version of the repertoire, Mac indicates
there that the encoding is based on Unicode (just like Windows) and that it
is UTF-16.
With the "Little Endian" option, the file is saved with UTF-16 code units
in little endian order (like on PC and on today's x86-based Mac's), on old
versions of MacOS the byte order was  big endian (as these Macs were not
based on the x86 architecture but on Motorola 68K in the first generations
and later on PowerPC).
Word documents actually have two internal structures : the old .doc binary
format, and later the .docx format which is based on XML (actual files are
in fact ZIPped archives containing multiple XML files. All these XML files
declare themselves their own encoding (which can be UTF-8, or UTF-16
starting with a "byte order mark" indicating if it is little-endian or big
endian, or "UTF-16LE" where there's no leading byte order mark but the
order is assumed to by little-endian.
The UTF-8 encoding is not dependant of byte orders, however it may be a bit
larger in code space than UTF-16 when working with non Latin scripts. This
different of size for XML (significant for languages like Japanese, Korean
or Chinese) however is insignificant in .docx files are they are internally
compressed in a ZIP archive.
Additionally files for Mac noy only contain the document but also have a
legacy "resource fork" containing some metadata about the application
creating the document and tagging its internal format and some other
options specific to the format. But those metadata are generally not
transmitted if you forward them for example by email attachments.
UTF-8 and UTF-16 encodings should all be interoperable between Windows and
MacOS anyway, notably for files in .docx format (internally based on XML),
except for a few characters that are specific to MacOS (such as the Apple
logo) and only supported by private encodings or by PUA and with
MacOS-specific fonts when mapped to Unicode.

The main differences will not be really in the character encoding but in
the internal support of custom extensions for Word, such as active scripts,
or embedding options for documents created by other apps, possibly not from
Microsoft itself.

Those Word documents will load between all platforms but some embedded
items may not be found on the target machine (including fonts used on the
source machine but not embedded in the transmitted document: for this
reason, Word/Office comes on the Mac with a basic set of fonts made by
Microsoft supported by Windows but installed with Word/Office as
complements, including Arial, Verdana, Courier New, Times New Roman, ...,
when MacOS natively had Helvetica, Courier, Times... which are very similar
in term of design but with slightly different metrics.)

The main cause of incompatibility will then be if you create a document on
your Mac with fonts specific to your system but not preinstalled with
Office on both platforms: if a font is missing there are some fallbacks to
other fonts but the docuemnt layout may be slightly altered due to the
different font metrics.

The second kind of incompatiblity will occur if you have embedded in your
Word document some "objects" created by Windows-specific or MacOS-specific
apps that the recipient does not have on his system: those embdded
components may just show as blank rectangle on the page.

The third kind of incompatiblity comes from scripting : embedded scripts in
your document may be disabled automatically on the recipient machine unless
those components are installed, and their use is authenticated by a digital
signature of the document by its creator, and you have given trust to this
creator to execute his scripts on your system: this is not specific to
Windows or MacOS, and those scripts will be disabled on the recipient
machine even if the recipient has the same OS and the same version of Word
as the initial creator of the document. Office will display in thios case a
"yellow warning banner" about those (unsigned or not trusted) scripts, but
the document should still be readable and editable without those scripts,
even if the scripts (most often written in VBscript) are not runnable.
Those scripts are normally helper tools used on the creator PC intended to
help create/edit the document, but they should not be needed to read the
document "as is". When you create a file with those tools, you should save
the file in a statically-rendered version where those tools are purged out
from the document, or a version where those embedded scripts are signed by
a (Office) tool installed and trusted on both systems.

But character encodings are not an issue if those encodings are
Unicode-based (UTF-8 and UTF-16 are interoperable between Windows, MacOS,
Linux, Unix and almost all modern OSes ; this was not the case with old
8-bit charsets such as Windows codepages or MacRoman and similar due to
their many national variants, or variants specific to some versions of
these OSes), except if the recipient uses an old version of Office apps on
old OSes without native Unicode support. UTF-8 and UTF-16 are present and
natively supported in Windows since Windows 98 (on Windows 95, you may have
to install an additional Unicode compatiblity layer, and old versions of
Windows 3.x or before may not support thoise encodings correctly and will
typically also not support the newer, XML-based, .docx or .odt formats but
only the legacy .doc format or older .rtf formats with a limited set of
private encodings and limited characters repertoire, those old Office apps
may then display some "tofu" if the encoding is not supported ; note also
that Office comes with a set of "format converters" to help convert
incoming documents to one of the supported formats, but the conversion may
be lossy or approximative in some cases).

For all versions of Windows with a native "Win32" API (instead of the old
"Win16" and "DOS"/"BIOS"-like APIs), you should never encounter issue with
Unicode encodings ; the same applies to all versions of Mac OS since MacOS
X (which has native support of Unicode, instead of legacy Mac codepages).

Native support of Unicode UTF's is no longer an option on most OSes (and at
least on all OSes where you'll use an Office application or a web browser
or any graphical desktop environment), it is preinstalled by default on all
modern OSes including Unix/Linux (old 8-bit encodings and fonts for X11 are
still supported on these systems, and a few default fonts for these legacy
encodings are still installed, but most applications no longer use or need
them, except for the system console/shell used in non-graphic mode for
legacy terminal emulations such as VT-220 and similar, or on embedded
versions of Linux which actually don't need to render or decode any text
and transmit text data in an "as is" format, or that only support basic
ASCII or a single 8-bit codepage for those consoles without support of
internationalization).


2015-10-28 10:59 GMT+01:00 Rafael Sarabia <sarabiarafael at gmail.com>:

> Dear all,
>
> I need to use a document both in Word 2007 for Windows and Word 2011 for
> Mac and I'm finding some incompatibility issues.
>
> The file has been created in Word for Windows and save as "Unicode" (which
> I believe -although I am not certain means "UTF-16").
>
> In Word 2011 for Mac I have several options to save it as Unicode: Unicode
> 6.1, Unicode 6.1 Little-Endian and UTF-8. None of them seem to be
> equivalent to the "Unicode" encoding in Word 2007 for Windows.
>
>
>
> My question is very simple: which is the encoding equivalent in Word 2011
> for Mac  to "Unicode" in Word for Windows (*which allows me to work in
> both operating systems/Word programs interchangeably*? One of the three
> abovementioned possibilities or another one? (I don't have the complete
> list in front of me)
>
>
> Thank you very much in advance.
>
> Kind regards.
>
> Rafael Sarabia
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151028/95781738/attachment.html>


More information about the Unicode mailing list