).

When HTML introduced the `b`/`strong` and `i`/`em` distinctions, it should also have added presentational/semantic pairs 

- `sup`/`exp` (exponent) or `pow` (power) and 
- `sub`/`idx`, `ind` (index) or `base`. 

I don?t think the WHATWG or W3C would be interested in adding them now. 

From marius.spix at web.de  Mon Mar 22 12:37:20 2021
From: marius.spix at web.de (Marius Spix)
Date: Mon, 22 Mar 2021 18:37:20 +0100
Subject: Aw: Re: HTML entities
In-Reply-To: <21390B27-7E3D-4A7B-A7FD-F6EF83BED603@crissov.de>
References: 
 <21390B27-7E3D-4A7B-A7FD-F6EF83BED603@crissov.de>
Message-ID: 

An HTML attachment was scrubbed...
URL: 

From marius.spix at web.de  Mon Mar 22 12:44:10 2021
From: marius.spix at web.de (Marius Spix)
Date: Mon, 22 Mar 2021 18:44:10 +0100
Subject: Fw: Aw: Re: HTML entities
References: 
 <21390B27-7E3D-4A7B-A7FD-F6EF83BED603@crissov.de>

Message-ID: 

An HTML attachment was scrubbed...
URL: 

From harjitmoe at outlook.com  Mon Mar 22 13:39:35 2021
From: harjitmoe at outlook.com (Harriet Riddle)
Date: Mon, 22 Mar 2021 18:39:35 +0000
Subject: Aw: Re: HTML entities
In-Reply-To: 
References: 
 <21390B27-7E3D-4A7B-A7FD-F6EF83BED603@crissov.de>
 ,

Message-ID: 

Several originally presentational elements have been re-defined in HTML5 as having vague semantics distinct from just a styled span, but also distinct from any similarly styled semantic elements; those which could not be were deprecated.? This applies to more than just sup/sub, e.g.  is treated as a vague differentiated, but not emphasised, voice, such as commentary or a character's thoughts, et cetera.

This has some interesting effects:  has been interpreted as a de?mphasis and is still valid, while the accompanying  is deprecated since it could not be given a consistent distinctive semantic (e.g., headings should use heading elements).

?Har.
________________________________
From: Unicode  on behalf of Marius Spix via Unicode 
Sent: Monday, March 22, 2021 5:44:10 PM
To: christoph.paeper at crissov.de 
Cc: unicode at unicode.org 
Subject: Fw: Aw: Re: HTML entities

I did some further research: The WHATWG spec differs from the Mozilla definition. It lists ^{and ^{in the text-level semantics section and states:

> These elements must be used only to mark up typographical conventions with specific meanings, not for typographical presentation for presentation's sake.
> The sub element can be used inside a var element, for variables that have subscripts.

See also: https://html.spec.whatwg.org/multipage/text-level-semantics.html#the-sub-and-sup-elements

Rergards,

Marius Spix

Gesendet: Montag, 22. M?rz 2021 um 18:37 Uhr
Von: "Marius Spix" 
An: christoph.paeper at crissov.de
Cc: unicode at unicode.org
Betreff: Aw: Re: HTML entities
Dear Christoph,

according to Mozilla [1],

> The ^{element should only be used for typographical reasons?that is, to change the position of the text to comply > with typographical conventions or standards, rather than solely for presentation or appearance purposes.

[1] https://developer.mozilla.org/en-US/docs/Web/HTML/Element/sup

Regards,

Marius Spix

Gesendet: Montag, 22. M?rz 2021 um 18:17 Uhr
Von: "Christoph P?per via Unicode" 
An: unicode at unicode.org
Betreff: Re: HTML entities
Marius Spix via Unicode :
>
> CSS is also no solution, because _{and _{are semantic tags (like , ,  and ) and not just stylistic ones (like , ,  or ).

When HTML introduced the `b`/`strong` and `i`/`em` distinctions, it should also have added presentational/semantic pairs

- `sup`/`exp` (exponent) or `pow` (power) and
- `sub`/`idx`, `ind` (index) or `base`.

I don?t think the WHATWG or W3C would be interested in adding them now.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From asmusf at ix.netcom.com  Mon Mar 22 14:24:04 2021
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Mon, 22 Mar 2021 12:24:04 -0700
Subject: Aw: Re: HTML entities
In-Reply-To: 
References: 
 <21390B27-7E3D-4A7B-A7FD-F6EF83BED603@crissov.de>

Message-ID: 

An HTML attachment was scrubbed...
URL: 

From mark at macchiato.com  Mon Mar 22 14:27:44 2021
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Mon, 22 Mar 2021 12:27:44 -0700
Subject: Aw: Re: HTML entities
In-Reply-To: 
References: 
 <21390B27-7E3D-4A7B-A7FD-F6EF83BED603@crissov.de>

Message-ID: 

+1

On Mon, Mar 22, 2021, 12:26 Asmus Freytag via Unicode 
wrote:

> On 3/22/2021 10:37 AM, Marius Spix via Unicode wrote:
>
> Dear Christoph,
>
> according to Mozilla [1],
>
> "The ^{element should only be used for typographical reasons?that is,
> to change the position of the text to complywith typographical conventions
> or standards, rather than solely for presentation or appearance purposes."
>
> [1] https://developer.mozilla.org/en-US/docs/Web/HTML/Element/sup
>
>
> Now, I have a hard time coming up with examples of "presentation or
> appearance" purposes that require small, raised letters or digits and are
> *not* related to some "typographical convention".
>
> The problem with ^{seems to be more in the fact that there's more than
> one convention that might apply.
>
> A./
>
>
>
> Regards,
>
> Marius Spix
>
>
> *Gesendet:* Montag, 22. M?rz 2021 um 18:17 Uhr
> *Von:* "Christoph P?per via Unicode" 
> 
> *An:* unicode at unicode.org
> *Betreff:* Re: HTML entities
> Marius Spix via Unicode  :
> >
> > CSS is also no solution, because _{and _{are semantic tags (like
> , ,  and ) and not just stylistic ones (like ,
> ,  or ).
>
> When HTML introduced the `b`/`strong` and `i`/`em` distinctions, it should
> also have added presentational/semantic pairs
>
> - `sup`/`exp` (exponent) or `pow` (power) and
> - `sub`/`idx`, `ind` (index) or `base`.
>
> I don?t think the WHATWG or W3C would be interested in adding them now.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From richard.wordingham at ntlworld.com  Mon Mar 22 17:16:24 2021
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 22 Mar 2021 22:16:24 +0000
Subject: Keyboard Suddenly Outputting in NFD
Message-ID: <20210322221624.2b4d61ed@JRWUBU2>

I'm asking here because my searches turned up nothing.

I've just noticed that when I use my handrolled keyboard designed to
output NFC, what appears on the terminal (Gnome-terminal) or browser
(Firefox into a Wikimedia form), my text is being stored as NFD UTF-8.
I use an M17n definition with fcitx on Ubuntu 16.04.3 as the input
method. It used to generate NFC; I'm not sure when it suddenly changed
to generating NFD text. The keyboard used to generate NFC output.

This change causes me grief because I am using grep to search data files
stored in NFC; grep does not respect canonical equivalence, so a typed
in sequence in NFD does not match the NFC data in the file. 

Does anyone know where this change has occurred?  Are there any quick
fixes?

I do have a grep-like search utility that respects canonical
equivalence, but it's a bit slow with a million-line input file.

Richard.

From duerst at it.aoyama.ac.jp  Mon Mar 22 18:23:29 2021
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=)
Date: Tue, 23 Mar 2021 08:23:29 +0900
Subject: Aw: Re: HTML entities
In-Reply-To: 
References: 
 <21390B27-7E3D-4A7B-A7FD-F6EF83BED603@crissov.de>

Message-ID: 

Hello Asmus, others,

On 2021/03/23 04:24, Asmus Freytag via Unicode wrote:
> On 3/22/2021 10:37 AM, Marius Spix via Unicode wrote:
>> Dear Christoph,
>> according to Mozilla [1],
>> "The ^{element should only be used for typographical reasons?that is, to 
>> change the position of the text to complywith typographical conventions or 
>> standards, rather than solely for presentation or appearance purposes."
>> [1] https://developer.mozilla.org/en-US/docs/Web/HTML/Element/sup
> 
> Now, I have a hard time coming up with examples of "presentation or appearance"
> purposes that require small, raised letters or digits and are *not* related to
> some "typographical convention".
> 
> The problem with ^{seems to be more in the fact that there's more than one
> convention that might apply.

I agree that this text from MDN is not very good. I think that what it 
meant is something like "don't use ^{if you want smaller, raised 
letters just for a change or just for fun". Also, of course, MDN is not 
a specification.

Regards,?? Martin.

From duerst at it.aoyama.ac.jp  Mon Mar 22 18:44:11 2021
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=)
Date: Tue, 23 Mar 2021 08:44:11 +0900
Subject: Aw: Re: HTML entities
In-Reply-To: 
References: <3163040.UKa7oIsXr7@laptop>

 <001201d719c8$924b9720$b6e2c560$@xencraft.com>

 <1866eedd-c8d4-9586-26bd-46125690d920@it.aoyama.ac.jp>

Message-ID: <1575b882-826a-151a-26b6-dfc41503df1c@it.aoyama.ac.jp>

Hello Marius, others,

On 2021/03/22 22:23, Marius Spix via Unicode wrote:
> You cannot just map ² to SUPERSCRIPT TWO, because you may have cases
> with nested ^{or _{like 10^(10¹⁰⁰), which is the
> representation of a number known as Googolplex, or ?_{CO₂_{,
> which is the percentage of carbon dioxide in an air sample. Such cases are not
> and should not be handled by Unicode, because their interpretation requires a
> stack machine.
> CSS is also no solution, because _{and _{are semantic tags (like ,
> ,  and ) and not just stylistic ones (like , ,  or ).

What I meant was not to use CSS instead of ^{or _{, but to use it 
in addition to one of these. That should make it possible to address the 
browser's limitation on rendering superscripts and subscripts. Using CSS 
(and Web Fonts) it should be possible to get as close as needed in look 
and style to the builtin ??? superscript characters without actually 
using these characters. That would also make sure that none of these 
characters needs character entity references, and there is no worry 
about using a character that does not have a superscript (or subscript) 
variant in Unicode itself.

That would avoid the slippery slope problem both for character entity 
references and for Unicode superscript/subscript variants. And that's a 
very good thing, because whenever somebody comes up with a request for 
yet another of these, the only thing that is sure is that it won't be 
the last.

See an additional comment below.

> *Gesendet:* Montag, 22. M?rz 2021 um 09:53 Uhr
> *Von:* "Jukka K. Korpela via Unicode" 
> *An:* "Martin J. D?rst" 
> *Cc:* "via Unicode" 
> *Betreff:* Re: HTML entities
> Martin J. D?rst (duerst at it.aoyama.ac.jp ) wrote:
> 
>      Hello Jukka, others,
> 
>      On 2021/03/18 17:20, Jukka K. Korpela via Unicode wrote:
>       > Tex (textexin at xencraft.com ) wrote:
> 
>       >> However, you are quoting a doc that has been withdrawn.
> 
>       > It?s a pity that this well-written and useful document was withdrawn, for
>       > reasons I don?t understand.
> 
>      Here are the main reasons, as far as I understand them. Unicode gets
>      updated roughly once a year, and Web technology also changes over time.
>      There was not enough manpower to keep the document up to date.
> 
>      In addition, the document was always a kind of tug-of-war between those
>      who pushed for more favorable descriptions of specific Unicode
>      characters (such as ? in this discussion) or more favorable descriptions
>      of markup-based and style-based solutions (such as ).
> 
> Thank you for the description. These opposite views surely reflected different
> needs, such as the need to represent data in plain text in some contexts and the
> need for more structured representation.

Not only. They also were a front line in the discussion about how far 
Unicode should go in encoding characters with typographical/stylistic 
distinctions, or in other words, what should be the limits of plain text.

Regards,   Martin.

>      Well, an then somebody else uses 10^3.5 somewhere. How are you
>      going to express this so that it doesn't turn into 103.5 in plain text?
>      The problem is that there is always a limit somewhere for plain text.
> 
> Well, in the given case, it might help if we had IMPLIED EXPONENTIATION (we
> don?t; we have IMPLIED TIMES, but it does not help here); at least it would
> appear in text data to indicate that adjacent digits are not part of the same
> number.
> 
> 
>      There is also always a limit somewhere for markup and styled rendering,
>      but it's in a quite different place.
> 
> Regarding exponents, the limit is currently set by the presence of superscript
> characters for digits, plus, and minus, and (for some reason), =, (, ), and n.
> This covers most of the cases where one might consider using superscripts in
> general texts and in expressing values of quantities.
> 
> But when you have, say, text that contains the simple expression /ax /with /x/
> as a superscript denoting exponent there is no satisfactory way to represent it
> in plain text. Using just ax would mean using a wrong expression, and using a?
> (with U+02E3 MODIFIER LETTER SMALL X) would be too tricky. Unicode hasn?t got a
> repertoire of superscript Latin letters even though they are often used as
> semantically different from normal letters; it only has some of such letters,
> apparently meant for special uses only (like phonetic symbols).
> 
> 
>      Out of the box rendering of ^{and _{may be rather crude, but I
>      guess it should be possible to do a lot better with some dose of CSS and
>      possibly some Web fonts.
> 
> In a sense, it would be straightforward to map, say, ² to SUPERSCRIPT
> TWO in the rendering phase, either directly at the character level or via glyph
> selection when an OpenType font is used. In another sense, it would be
> complicated, since we hardly want to have ² rendered substantially
> different from ^x in style. So the mapping should take place only when
> the entire document contains only such ^{elements where are characters have
> superscript counterparts in Unicode (or at the glyph level).
> 
> Jukka

From asmusf at ix.netcom.com  Mon Mar 22 19:29:56 2021
From: asmusf at ix.netcom.com (Asmus Freytag (c))
Date: Mon, 22 Mar 2021 17:29:56 -0700
Subject: Aw: Re: HTML entities
In-Reply-To: 
References: 
 <21390B27-7E3D-4A7B-A7FD-F6EF83BED603@crissov.de>

Message-ID: 

On 3/22/2021 4:23 PM, Martin J. D?rst wrote:
> Hello Asmus, others,
>
> On 2021/03/23 04:24, Asmus Freytag via Unicode wrote:
>> On 3/22/2021 10:37 AM, Marius Spix via Unicode wrote:
>>> Dear Christoph,
>>> according to Mozilla [1],
>>> "The ^{element should only be used for typographical 
>>> reasons?that is, to change the position of the text to complywith 
>>> typographical conventions or standards, rather than solely for 
>>> presentation or appearance purposes."
>>> [1] https://developer.mozilla.org/en-US/docs/Web/HTML/Element/sup
>>
>> Now, I have a hard time coming up with examples of "presentation or 
>> appearance"
>> purposes that require small, raised letters or digits and are *not* 
>> related to
>> some "typographical convention".
>>
>> The problem with ^{seems to be more in the fact that there's more 
>> than one
>> convention that might apply.
>
> I agree that this text from MDN is not very good. I think that what it 
> meant is something like "don't use ^{if you want smaller, raised 
> letters just for a change or just for fun". Also, of course, MDN is 
> not a specification.

Right, we get that.

In the unusual circumstance that I might want smaller, raised letters 
"just for fun", I may not care about a precise appearance, so I wouldn't 
pay attention to "rules" anyway.

The real issue with ^{compared to  is that language like 
that makes it masquerade as "semantic", when it isn't.

A./

-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From duerst at it.aoyama.ac.jp  Mon Mar 22 20:18:54 2021
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2e_D=c3=bcrst?=)
Date: Tue, 23 Mar 2021 10:18:54 +0900
Subject: Aw: Re: HTML entities
In-Reply-To: 
References: 
 <21390B27-7E3D-4A7B-A7FD-F6EF83BED603@crissov.de>

Message-ID: <3aae7bb0-8a74-ae66-7cb6-d1e4623de9f1@it.aoyama.ac.jp>

Hello Asmus, others,

On 2021/03/23 09:29, Asmus Freytag (c) wrote:
> On 3/22/2021 4:23 PM, Martin J. D?rst wrote:

>> I agree that this text from MDN is not very good. I think that what it 
>> meant is something like "don't use ^{if you want smaller, raised 
>> letters just for a change or just for fun". Also, of course, MDN is 
>> not a specification.
> 
> Right, we get that.
> 
> In the unusual circumstance that I might want smaller, raised letters 
> "just for fun", I may not care about a precise appearance, so I wouldn't 
> pay attention to "rules" anyway.
> 
> The real issue with ^{compared to  is that language like 
> that makes it masquerade as "semantic", when it isn't.

In my opinion, in these contexts, 'semantic' has to be seen as something 
with a degree.  may have a higher degree of semantics that 
^{. For ^{, it's essentially any kind of semantics that is usually 
displayed as a superscript, which could be e.g. an exponent, a 
superscript index in some mathematical of physical,... notation, a 
superscript in some phonetic notation, and so on. For , at least 
if we follow the meaning of the word 'strong' itself, it's any kind of 
semantics that implies some kind of strengthening, which still could be 
a rather wide range. In both cases, for finer semantics, an HTML class 
attribute might be used.

Regards,    Martin.

> A./
> 

From richard.wordingham at ntlworld.com  Tue Mar 23 03:11:45 2021
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 23 Mar 2021 08:11:45 +0000
Subject: Keyboard Suddenly Outputting in NFD
In-Reply-To: <20210322221624.2b4d61ed@JRWUBU2>
References: <20210322221624.2b4d61ed@JRWUBU2>
Message-ID: <20210323081145.039fa11c@JRWUBU2>

On Mon, 22 Mar 2021 22:16:24 +0000
Richard Wordingham via Unicode  wrote:

> Are there any quick fixes?

There is one off-the-shelf fix.  Instead of typing

grep pattern file

one types

grep $(unconv -x any-nfc <<
References: 
Message-ID: 

Martin J. D?rst via Unicode :
> 
> Interesting idea to use the  (Ruby parenthesis) element. But I'm sure there's a better (semantically more appropriate) way to use markup (+maybe styling) to hide the "^" but let it appear when in plain text.

I?ve asked just now: 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From ishida at w3.org  Tue Mar 23 06:04:21 2021
From: ishida at w3.org (r12a)
Date: Tue, 23 Mar 2021 11:04:21 +0000
Subject: HTML entities
In-Reply-To: 
References: <3163040.UKa7oIsXr7@laptop>

 <001201d719c8$924b9720$b6e2c560$@xencraft.com>

 <1866eedd-c8d4-9586-26bd-46125690d920@it.aoyama.ac.jp>

Message-ID: <7bc7d93a-da39-ea99-0d06-5df13c745425@w3.org>

fwiw, i was curious enough to check it out, and Unicode has the full 
ASCII lower-case alphabet except for q available as superscripted letters.

????????????????q???????????????????

ri

Jukka K. Korpela via Unicode wrote on 22/03/2021 08:53:
> Unicode hasn?t got a repertoire of superscript Latin letters even 
> though they are often used as semantically different from normal 
> letters; it only has some of such letters, apparently meant for 
> special uses only (like phonetic symbols).

-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From kilobyte at angband.pl  Tue Mar 23 07:53:09 2021
From: kilobyte at angband.pl (Adam Borowski)
Date: Tue, 23 Mar 2021 13:53:09 +0100
Subject: HTML entities
In-Reply-To: <7bc7d93a-da39-ea99-0d06-5df13c745425@w3.org>
References: <3163040.UKa7oIsXr7@laptop>

 <001201d719c8$924b9720$b6e2c560$@xencraft.com>

 <1866eedd-c8d4-9586-26bd-46125690d920@it.aoyama.ac.jp>

 <7bc7d93a-da39-ea99-0d06-5df13c745425@w3.org>
Message-ID: 

On Tue, Mar 23, 2021 at 11:04:21AM +0000, r12a via Unicode wrote:
> Jukka K. Korpela via Unicode wrote on 22/03/2021 08:53:
> > Unicode hasn?t got a repertoire of superscript Latin letters even though
> > they are often used as semantically different from normal letters; it only
> > has some of such letters, apparently meant for special uses only (like
> > phonetic symbols).

> fwiw, i was curious enough to check it out, and Unicode has the full ASCII
> lower-case alphabet except for q available as superscripted letters.
> 
> ????????????????q???????????????????

And for uppercase:
??C??F??????????Q?S????XYZ
plus look-alikes: ???

The pipeline already includes CFQ.

Thus, what about adding the stragglers, ie, qSXYZ ?

On the other hand, subscript is nowhere close:
?bcd?fg?????????q?????w?yz
with no capitals.

Meow!
-- 
??????? Latin:   meow 4 characters, 4 columns,  4 bytes
??????? Greek:   ???? 4 characters, 4 columns,  8 bytes
??????? Runes:   ???? 4 characters, 4 columns, 12 bytes
??????? Chinese: ?   1 character,  2 columns,  3 bytes <-- best!

From beckiergb at gmail.com  Tue Mar 23 14:24:08 2021
From: beckiergb at gmail.com (Rebecca Bettencourt)
Date: Tue, 23 Mar 2021 12:24:08 -0700
Subject: HTML entities
In-Reply-To: 
References: <3163040.UKa7oIsXr7@laptop>

 <001201d719c8$924b9720$b6e2c560$@xencraft.com>

 <1866eedd-c8d4-9586-26bd-46125690d920@it.aoyama.ac.jp>

 <7bc7d93a-da39-ea99-0d06-5df13c745425@w3.org> 
Message-ID: 

The pipeline also includes lowercase q.

You're not going to convince the UTC to encode superscript SXYZ unless you
find evidence of them being used as part of a phonetic transcription
system. That's the only use case that has gotten superscripts and
subscripts accepted in recent years; all other proposals have been
summarily dismissed.

-- Rebecca Bettencourt

On Tue, Mar 23, 2021 at 5:59 AM Adam Borowski via Unicode <
unicode at unicode.org> wrote:

> On Tue, Mar 23, 2021 at 11:04:21AM +0000, r12a via Unicode wrote:
> > Jukka K. Korpela via Unicode wrote on 22/03/2021 08:53:
> > > Unicode hasn?t got a repertoire of superscript Latin letters even
> though
> > > they are often used as semantically different from normal letters; it
> only
> > > has some of such letters, apparently meant for special uses only (like
> > > phonetic symbols).
>
> > fwiw, i was curious enough to check it out, and Unicode has the full
> ASCII
> > lower-case alphabet except for q available as superscripted letters.
> >
> > ????????????????q???????????????????
>
> And for uppercase:
> ??C??F??????????Q?S????XYZ
> plus look-alikes: ???
>
> The pipeline already includes CFQ.
>
> Thus, what about adding the stragglers, ie, qSXYZ ?
>
>
> On the other hand, subscript is nowhere close:
> ?bcd?fg?????????q?????w?yz
> with no capitals.
>
>
> Meow!
> --
> ??????? Latin:   meow 4 characters, 4 columns,  4 bytes
> ??????? Greek:   ???? 4 characters, 4 columns,  8 bytes
> ??????? Runes:   ???? 4 characters, 4 columns, 12 bytes
> ??????? Chinese: ?   1 character,  2 columns,  3 bytes <-- best!
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From richard.wordingham at ntlworld.com  Tue Mar 23 15:18:43 2021
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 23 Mar 2021 20:18:43 +0000
Subject: Keyboard Suddenly Outputting in NFD
In-Reply-To: 
References: <20210322221624.2b4d61ed@JRWUBU2> <20210323081145.039fa11c@JRWUBU2>

Message-ID: <20210323201843.383d05fa@JRWUBU2>

On Tue, 23 Mar 2021 11:22:44 +0100
Marius Spix  wrote:

> Logstash can be used for NFC normalization.

> Gesendet: Dienstag, 23. M?rz 2021 um 09:11 Uhr
> Von: "Richard Wordingham via Unicode"

>> one types

>> grep $(uconv -x any-nfc <<> This won't work nicely if the pattern contains shell control
>> characters, such as spaces and dollars.

Ah, my solution is the wrong way round.  It should be:

uconv -x any-nfd | grep pattern

I should make the data match the search string!

Richard.

From lyratelle at gmx.de  Wed Mar 24 04:38:16 2021
From: lyratelle at gmx.de (Dominikus Dittes Scherkl)
Date: Wed, 24 Mar 2021 10:38:16 +0100
Subject: HTML entities
In-Reply-To: <7A9EB686-D4A3-4E8E-BD11-64E4D8447746@crissov.de>
References: 
 <7A9EB686-D4A3-4E8E-BD11-64E4D8447746@crissov.de>
Message-ID: <8c84d590-e920-2b6c-d2bb-5a71665a50ed@gmx.de>

Am 21.03.21 um 13:18 schrieb Christoph P?per via Unicode:
>> Martin J. D?rst via Unicode :
>>
>> Interesting idea to use the  (Ruby parenthesis) element. But I'm sure there's a better (semantically more appropriate) way to use markup (+maybe styling) to hide the "^" but let it appear when in plain text.
>
> I don?t think there?s one in HTML
>
> Following the precedence set by U+2064 Invisible Plus (e.g. between integer and vulgar fraction) and U+2062 Invisible Times (e.g. between letter constants or variables), Unicode could add X+2065 Invisible Exponentiation (or Invisible Opening Parenthesis and Invisible Closing Parenthesis).
>
Yes, I think adding an "Invisible Exponent" character to Unicode would
really help solving this semantic distinction problem in plain text.

--
                                          Dominikus Dittes Scherkl

From corentin.jabot at gmail.com  Fri Mar 26 06:44:11 2021
From: corentin.jabot at gmail.com (Corentin)
Date: Fri, 26 Mar 2021 12:44:11 +0100
Subject: White spaces for the purpose of programming languages
Message-ID: 

Hello

In UAX #44, White_space is described as "Spaces, separator characters and
other control characters which should be treated by programming languages
as "white space" for the purpose of parsing elements."

>From what I can tell, ECMAScript/JS uses White_space (or
rather Space_Separator which is slightly different), Rust uses
Pattern_White_Space which is a more restricted set, while most other
languages seem to only support the ASCII spaces.

I wanted to confirm that the intent is that White_Space is recommended in
programming languages.
I assumed that Pattern_White_Space would be more suitable for that purpose,
but it isn't actually clear from a reading of UAX31

Which first states in it's introduction
> A common task facing an implementer of the Unicode Standard is the
provision of a parsing and/or lexing engine for identifiers, such as
programming language variables or domain names.

But later:

Pattern Syntax : There are many circumstances where software interprets
patterns that are a mixture of literal characters, whitespace, and syntax
characters. Examples include regular expressions, Java collation rules,
Excel or ICU number formats, and many others.

(programming languages are not mentioned there)

Any clarification as to whether White_Space should be considered over
Pattern_White_Space for programming languages would be appreciated :)

I think that clarification might be useful for many users as different
programming languages have made different choices!

Thanks,

Corentin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From mandel59 at gmail.com  Fri Mar 26 11:43:30 2021
From: mandel59 at gmail.com (Ryusei)
Date: Sat, 27 Mar 2021 01:43:30 +0900
Subject: What is Urdu paragraph separator?
Message-ID: <2A159DCE-B621-4330-8997-821DFB613E42@gmail.com>

Hello

According to NamesList.txt, U+203B is for several usages:

> 203B	REFERENCE MARK
> 	= Japanese kome
> 	= Urdu paragraph separator
> 	x (tibetan ku ru kha bzhi mig can - 0FBF)
> 	x (cjk unified ideograph-200AD - 200AD)

I know Japanese komejirushi, and a page of Wikipedia > shows a good real-life usage in Japan.

But I never heard about Urdu paragraph separator. How is it used? And why Urdu separator mark and East Asian reference mark are unified? (I think unifying marks of different scripts likely cause typographic issue, especially where font fallback is required. I don't expect that Urdu separator mark is rendered as fullwidth character.)

Thanks,
Ryusei
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From wjgo_10009 at btinternet.com  Fri Mar 26 13:45:44 2021
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Fri, 26 Mar 2021 18:45:44 +0000 (GMT)
Subject: A poem using language-independent glyphs
Message-ID: <67cbea6c.1298.1786fdb4925.Webtop.87@btinternet.com>

Here is link to a forum post that I produced today.

https://forum.affinity.serif.com/index.php?/topic/138654-artwork-for-greetings-cards/

I am hoping that in time that these glyphs, and others, will become 
accessible within regular Unicode using a mechanism related to, yet a 
little different from, the mechanism used for QID emoji.

The mechanism being to use a tag exclamation mark rather than the tag Q 
used for QID emoji in the original proposal.

William Overington

Friday 26 March 2021

From richard.wordingham at ntlworld.com  Sat Mar 27 14:00:22 2021
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 27 Mar 2021 19:00:22 +0000
Subject: Keyboard Suddenly Outputting in NFD
In-Reply-To: <20210322221624.2b4d61ed@JRWUBU2>
References: <20210322221624.2b4d61ed@JRWUBU2>
Message-ID: <20210327190022.53c248cf@JRWUBU2>

On Mon, 22 Mar 2021 22:16:24 +0000
Richard Wordingham via Unicode  wrote:

** FALSE ALARM! **

> I've just noticed that when I use my handrolled keyboard designed to
> output NFC, what appears on the terminal (Gnome-terminal) or browser
> (Firefox into a Wikimedia form), my text is being stored as NFD UTF-8.
> I use an M17n definition with fcitx on Ubuntu 16.04.3 as the input
> method. It used to generate NFC; I'm not sure when it suddenly changed
> to generating NFD text.

Sorry, it can't have been working as well as I thought it did.  I seem
to have slightly broken the keyboard in October 2020.

(The keyboard converts XSAMPA input to IPA in NFC.   The immediate idea
was to apply the transform for the string "_s" when there is a
transform for "a_M", without having to define a transform for "a_s".  I
defined a transform for "a_", that left "_" in the pending input, but
that extended transform then fired instead of the one for "a_M".)

Richard.

From wjgo_10009 at btinternet.com  Wed Mar 31 16:18:35 2021
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Wed, 31 Mar 2021 22:18:35 +0100 (BST)
Subject: A poem using language-independent glyphs
In-Reply-To: <67cbea6c.1298.1786fdb4925.Webtop.87@btinternet.com>
References: <67cbea6c.1298.1786fdb4925.Webtop.87@btinternet.com>
Message-ID: <7c2ed17b.153d.1788a270029.Webtop.108@btinternet.com>

The thread now has 28 posts in it and over 800 views.

There are now three poems using language-independent glyphs, two of the 
poems written today.

https://forum.affinity.serif.com/index.php?/topic/138654-artwork-for-greetings-cards

William Overington

Wednesday 31 March 2021

------ Original Message ------
From: "William_J_G Overington via Unicode" 
To: unicode at unicode.org
Sent: Friday, 2021 Mar 26 At 18:45
Subject: A poem using language-independent glyphs
Here is link to a forum post that I produced today.
https://forum.affinity.serif.com/index.php?/topic/138654-artwork-for-greetings-cards/ 

I am hoping that in time that these glyphs, and others, will become 
accessible within regular Unicode using a mechanism related to, yet a 
little different from, the mechanism used for QID emoji.
The mechanism being to use a tag exclamation mark rather than the tag Q 
used for QID emoji in the original proposal.
William Overington
Friday 26 March 2021

-------------- next part --------------
An HTML attachment was scrubbed...
URL: 

From markus.icu at gmail.com  Wed Mar 31 22:10:01 2021
From: markus.icu at gmail.com (Markus Scherer)
Date: Wed, 31 Mar 2021 20:10:01 -0700
Subject: White spaces for the purpose of programming languages
In-Reply-To: 
References: 
Message-ID: 

On Fri, Mar 26, 2021 at 4:50 AM Corentin via Unicode 
wrote:

> In UAX #44, White_space is described as "Spaces, separator characters and
> other control characters which should be treated by programming languages
> as "white space" for the purpose of parsing elements."
>
> From what I can tell, ECMAScript/JS uses White_space (or
> rather Space_Separator which is slightly different), Rust uses
> Pattern_White_Space which is a more restricted set, while most other
> languages seem to only support the ASCII spaces.
>
> I wanted to confirm that the intent is that White_Space is recommended in
> programming languages.
> I assumed that Pattern_White_Space would be more suitable for that purpose,
> but it isn't actually clear from a reading of UAX31
>

We came up with Pattern_White_Space for working with ICU *rule and pattern
strings* (e.g., rules to define sort orders, rules for number spellout,
date/time/number formatting patterns).
This is why we included the RLM and LRM controls -- making it easy to keep
rule strings legible when there are RTL characters.
(If we were defining it now, I assume that we would also include the newer
ALM (U+061C), but the property is immutable so we can't add anything.)

We proposed this as a Unicode property because it seemed useful.
We were not specifically thinking about whole programming languages.
I assume that existing languages are not going to want to make a change
here.

When parsing *user input*, we generally look for all White_Space where
"space" is allowed.

Personally, I think that White_Space is unnecessarily broad for programming
language syntax.
Pattern_White_Space might be a useful starting point.

   - The bidi controls should probably not be programming "white space" on
   their own because they don't have any advance width. They should be allowed
   somewhere, maybe at token boundaries or after indenting spaces.
   - U+0085 NEL is a holdover from OS/390 and the line feed confusion on
   IBM systems. (They didn't much care what LF/NEL mapped to because their
   text systems had a "record" per line and didn't need a line separator
   character like Unix-y systems.)
      - I can't tell if the EBCDIC platforms are "alive". Elsewhere I have
      tried to find out if there is a competent C++11 compiler available.
   - Line & paragraph separators apparently never got much use.
   - Form feed? Vertical tab?
   - East Asian developers might appreciate U+3000 ideographic space
   because their IMEs tend to emit that.

So maybe just TAB, LF, CR, space (0020), and possibly wide space (3000),
plus also LRM/RLM/ALM at certain boundaries?

Best regards,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL:}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}}