Joined "ti" coded as "O" in PDF

Don Osborn dzo at bisharat.net
Sun May 8 07:42:13 CDT 2016


Could it be said that a PDF conversion app generating unusual coding of 
characters, and doing so without advising users, is an instance of 
"Unicode malpractice"? (per David's mention of using the "bully pulpit")

Some earlier posts in this thread made the observation that PDF is for 
presentation not archiving. However, since the format makes it possible 
to search text instead of having just an image of the pages, it seems 
that distinction is at least somewhat blurred. PDFs are archived and 
searched, and people expect to use those functions. So yes this 
font/coding issue in PDFs is a real world problem, but of the sort that 
Unicode was created to relegate to the past.

An analogy that comes to mind is continued use of old hacked 8-bit 
fonts, which were created before Unicode was widely adopted, for 
printing and limited sharing ("you need to install this font to view 
correctly"). Documents produced with them, however, are shared as PDFs 
(such as some Chinese-Hausa learning materials up to at least 2010, 
which of course look and print fine, but which run into the same search 
and re-use issues), and even escape into the wild as text (with unhappy 
results like a Bambara translation of a handwashing poster during the 
ebola crisis).

Any digital text these days can't be treated as just producing something 
visually correct.

By the way, the "Ɵ" in the original title changed to "O" somewhere back 
in the thread. A luta continua.

Don


On 5/8/2016 4:13 AM, Andrew Cunningham wrote:
> The t_i instance will depend on the quality of the font. If its a 
> standard ligature there should be a glyph to codepoints assignment in 
> the cmap table or the ToUnicode mapping in the PDF file.
>
> As David indicates, it isnt a Unicode issue.
>
> It is an issue with the font used and/or the tools used.
>
> PDFs have always been problematic. That isn't going to change anytime 
> soon. Partly for archiveable or accessible PDFs, the person generating 
> the PDFs should select the best tools for the job and test the PDF. 
> Then fix any problems.
>
> Andrew
>
> On Sunday, 8 May 2016, David Perry <hospes02 at scholarsfonts.net 
> <mailto:hospes02 at scholarsfonts.net>> wrote:
> > I agree that it's a real-world problem -- PDFs really should be 
> searchable -- but I do not see that it's a Unicode issue. Unicode 
> defines the basic building blocks of LATIN SMALL LETTER T and LATIN 
> SMALL LETTER I; that's its job. Unicode is not responsible for font 
> construction or creating PDF software. Furthermore, even if Unicode 
> did want to do something about it, I can't imagine what that could be 
> -- aside perhaps from using its bully pulpit to urge PDF creators and 
> font creators to do their jobs better.
> >
> > The fact that some PDF apps do not search and copy/paste text 
> correctly when unencoded characters are given PUA values has been 
> known for many years.  In the case of Calibri, I looked at the font 
> (version installed on my Win7 system) and found that the 'ti' ligature 
> is named t_i, which follows good naming practices, and it does not 
> have a PUA assignment. Given this, any well-constructed PDF app should 
> be able to decode the ligature correctly.
> >
> > David
> >
> > On 5/6/2016 11:49 AM, Steve Swales wrote:
> >>
> >> This discussion seems to have fizzled out, but I’m concerned that
> >> there’s a real world problem here which is at least partially the
> >> concern of the consortium, so let me stir the pot and see if there’s
> >> still any meat left.
> >>
> >> On the current release of MacOS (including the developer beta, for
> >> your reference, Peter), if you use Calibri font, for example, in any
> >> app (e.g. notes), to write words with “ti” (like
> >> internationalization), then press “Print" and “Open PDF in Preview”,
> >> you get a PDF document with the joined “ti”. Subsequently cutting and
> >> pasting produces mojibake, and searching the document for words
> >> with“ti” doesn’t work, as previously noted.
> >>
> >> I suppose we can look on this as purely a font handling/MacOS bug, but
> >> I’m wondering if we should be providing accommodations or conveniences
> >> in Unicode for it to work as desired.
> >>
> >> -steve
> >>
> >
>
> -- 
> Andrew Cunningham
> lang.support at gmail.com <mailto:lang.support at gmail.com>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160508/0cf3dc01/attachment.html>


More information about the Unicode mailing list