From piotrunio-2004 at wp.pl  Thu Dec  4 06:35:47 2025
From: piotrunio-2004 at wp.pl (=?UTF-8?Q?piotrunio-2004=40wp=2Epl?=)
Date: Thu, 04 Dec 2025 13:35:47 +0100
Subject: =?UTF-8?Q?Re=3A_Odp=3A_RE=3A_What_to_do_if_a_legacy_compatibility_character_is_defective=3F?=
In-Reply-To: <<59598cd6e838407f88138b93a707aca2@grupawp.pl>>
References: <c7015e8a4ddc43998e86897ec9e6854a@grupawp.pl>
 <d74f1535bc234bbc9a60bcc9d1a17778@grupawp.pl>
 <paxp193mb156695b6528fe110e304daf695f1a@paxp193mb1566.eurp193.prod.outlook.com>
 <03c761daf930423295d0e5b5f8de424c@grupawp.pl>
 <105968cf-57f1-418d-8782-c84132d676e8@ix.netcom.com>
 <14bd764ef49d4e64b11144eeb0ce5a92@grupawp.pl>
 <ca0ff4cd-69ce-4dad-b02d-607e7a30da42@ix.netcom.com>
 <59598cd6e838407f88138b93a707aca2@grupawp.pl>
Message-ID: <abc7589bddea4446aa736e6ff62357ac@grupawp.pl>

I have investigated the situation further and it seems that defect in the Unicode 13.0?17.0 mapping is even more fundamental than I previously thought. In particular, the proposal L2/25-037 does not acknowledge the proposal L2/00-159, which had already been incorporated into Unicode 3.2.?In that proposal, the description of characters U+23B8 (LEFT VERTICAL BOX LINE) and U+23B9 (RIGHT VERTICAL BOX LINE) exactly matches the proposed characters L2/25-037:1FBFC (BOX DRAWINGS LIGHT LEFT EDGE) and L2/25-037:1FBFD (BOX DRAWINGS LIGHT RIGHT EDGE). In both proposals, those two characters are specified to be aligned to left or right edge, span the entire edge (extending to the top and bottom), and match the thickness of Box Drawings Light lines. The description of the characters U+23BA (HORIZONTAL SCAN LINE-1) and U+23BD (HORIZONTAL SCAN LINE-9) also exactly matches the proposed characters L2/25-037:1FBFA (BOX DRAWINGS LIGHT TOP EDGE) and L2/25-037:1FBFB (BOX DRAWINGS LIGHT BOTTOM EDGE). In both proposals, those two characters are specified to be aligned to top and bottom edges, span the entire edge (extending to the left and right), and match the thickness of Box Drawings Light lines. However, the proposal L2/00-159 had already set precedent for usage of [U+23BA, U+23BD, U+23B8, U+23B9] (and not the 1?8 blocks or 1?4 blocks) in mapping to certain platforms such as?The Heath/Zenith 19 Graphics Character Set and?The DEC Special Graphics Character Set. This contrasts with the usage of 1?8 blocks [U+2594, U+2581, U+258F, U+2595] and other related 1?8 or 7?8 block characters in the mapping to PETSCII and Apple II.? Therefore there is a discrepancy between the legacy platforms added in Unicode 3.2 (which use the box drawing lines 23B8, 23B9, 23BA, 23BD) and the legacy platforms added in Unicode 13.0?17.0 (which use 1?8 blocks 2594, 2581, 258F, 2595).   Dnia 25 pa?dziernika 2025 10:27 piotrunio-2004 at wp.pl via Unicode &lt;unicode at corp.unicode.org&gt; napisa?(a):   Dnia 25 pa?dziernika 2025 08:29 Asmus Freytag via Unicode &lt;unicode at corp.unicode.org&gt; napisa?(a):  Again, the identity of the Unicode character is giving by
      encoding the intended mappings. If Unicode decides to map the same
      character to similar characters on different platforms, that is
      not a problem, as long as implementers know that the intent is to
      use a platform-specific rendering (and not assume that there is
      only one possible rendering per character).  If you feel that the guidance available to implementers in the
      text of the standard or in an annotation of the nameslist is not
      sufficent, then the remedy would be to ask for the explanation to
      be updated. We are unfortunately locked in as far as character
      names are concerned, but we can add a note (best in the text of
      the standard) that explains that emulators for some systems will
      need an adjusted design so a sequence or other arrangement of
      these characters looks correct.  Indeed the character names cannot be changed due to stability policies. An explanation note has been provided for U+1FB81 that claims &#34;The lines corresponding to 3 and 5 are not
actually block elements, but can show any horizontally
repeating pattern&#34;, but still implicitly enforces 1?8 blocks for top and bottom. However, this doesn&#39;t address other cases such as the PETSCII C64 variation. And if?1FB70?1FB81 1FBB5?1FBB8 1FBBC were all noted to no longer require exact 1?8 blocks, that would also not remedy the issue because it would introduce an inconsistency with the existing 1?8 or 7?8 block characters 2581 2589 258F 2594?2595, which already have established compatibility precedents that require the exact fraction, but are also used in the Unicode 13.0 mapping to PETSCII and Apple II character sets despite those platforms using varying thickness (consistent with light box drawings, except for the 1?8 top and bottom blocks in C64, where the 1?4 top and bottom blocks are made consistent instead).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20251204/d5f6a746/attachment.htm>

From asmusf at ix.netcom.com  Thu Dec  4 14:15:27 2025
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Thu, 4 Dec 2025 12:15:27 -0800
Subject: Odp: RE: What to do if a legacy compatibility character is
 defective?
In-Reply-To: <abc7589bddea4446aa736e6ff62357ac@grupawp.pl>
References: <c7015e8a4ddc43998e86897ec9e6854a@grupawp.pl>
 <d74f1535bc234bbc9a60bcc9d1a17778@grupawp.pl>
 <paxp193mb156695b6528fe110e304daf695f1a@paxp193mb1566.eurp193.prod.outlook.com>
 <03c761daf930423295d0e5b5f8de424c@grupawp.pl>
 <105968cf-57f1-418d-8782-c84132d676e8@ix.netcom.com>
 <14bd764ef49d4e64b11144eeb0ce5a92@grupawp.pl>
 <ca0ff4cd-69ce-4dad-b02d-607e7a30da42@ix.netcom.com>
 <59598cd6e838407f88138b93a707aca2@grupawp.pl>
 <abc7589bddea4446aa736e6ff62357ac@grupawp.pl>
Message-ID: <16921e34-e06e-427b-ad69-c0a6bcbb6c69@ix.netcom.com>

On 12/4/2025 4:35 AM, piotrunio-2004 at wp.pl via Unicode wrote:
> I have investigated the situation further and it seems that defect in 
> the Unicode 13.0?17.0 mapping is even more fundamental than I 
> previously thought. In particular, the proposal L2/25-037 does not 
> acknowledge the proposal L2/00-159, which had already been 
> incorporated into Unicode 3.2.?In that proposal, the description of 
> characters U+23B8 (LEFT VERTICAL BOX LINE) and U+23B9 (RIGHT VERTICAL 
> BOX LINE) exactly matches the proposed characters L2/25-037:1FBFC (BOX 
> DRAWINGS LIGHT LEFT EDGE) and L2/25-037:1FBFD (BOX DRAWINGS LIGHT 
> RIGHT EDGE). In both proposals, those two characters are specified to 
> be aligned to left or right edge, span the entire edge (extending to 
> the top and bottom), and match the thickness of Box Drawings Light 
> lines. The description of the characters U+23BA (HORIZONTAL SCAN 
> LINE-1) and U+23BD (HORIZONTAL SCAN LINE-9) also exactly matches the 
> proposed characters L2/25-037:1FBFA (BOX DRAWINGS LIGHT TOP EDGE) and 
> L2/25-037:1FBFB (BOX DRAWINGS LIGHT BOTTOM EDGE). In both proposals, 
> those two characters are specified to be aligned to top and bottom 
> edges, span the entire edge (extending to the left and right), and 
> match the thickness of Box Drawings Light lines. However, the proposal 
> L2/00-159 had already set precedent for usage of [U+23BA, U+23BD, 
> U+23B8, U+23B9] (and not the 1?8 blocks or 1?4 blocks) in mapping to 
> certain platforms such as?The Heath/Zenith 19 Graphics Character Set 
> and?The DEC Special Graphics Character Set. This contrasts with the 
> usage of 1?8 blocks [U+2594, U+2581, U+258F, U+2595] and other related 
> 1?8 or 7?8 block characters in the mapping to PETSCII and Apple II. 
> Therefore there is a discrepancy between the legacy platforms added in 
> Unicode 3.2 (which use the box drawing lines 23B8, 23B9, 23BA, 23BD) 
> and the legacy platforms added in Unicode 13.0?17.0 (which use 1?8 
> blocks 2594, 2581, 258F, 2595).
>
> Dnia 25 pa?dziernika 2025 10:27 piotrunio-2004 at wp.pl via Unicode 
> <unicode at corp.unicode.org> napisa?(a):
>
>
>     Dnia 25 pa?dziernika 2025 08:29 Asmus Freytag via Unicode
>     <unicode at corp.unicode.org> napisa?(a):
>
>         Again, the identity of the Unicode character is giving by
>         encoding the intended mappings. If Unicode decides to map the
>         same character to similar characters on different platforms,
>         that is not a problem, as long as implementers know that the
>         intent is to use a platform-specific rendering (and not assume
>         that there is only one possible rendering per character).
>
>         If you feel that the guidance available to implementers in the
>         text of the standard or in an annotation of the nameslist is
>         not sufficent, then the remedy would be to ask for the
>         explanation to be updated. We are unfortunately locked in as
>         far as character names are concerned, but we can add a note
>         (best in the text of the standard) that explains that
>         emulators for some systems will need an adjusted design so a
>         sequence or other arrangement of these characters looks correct.
>
>     Indeed the character names cannot be changed due to stability
>     policies. An explanation note has been provided for U+1FB81 that
>     claims "The lines corresponding to 3 and 5 are not actually block
>     elements, but can show any horizontally repeating pattern", but
>     still implicitly enforces 1?8 blocks for top and bottom. However,
>     this doesn't address other cases such as the PETSCII C64
>     variation. And if?1FB70?1FB81 1FBB5?1FBB8 1FBBC were all noted to
>     no longer require exact 1?8 blocks, that would also not remedy the
>     issue because it would introduce an inconsistency with the
>     existing 1?8 or 7?8 block characters 2581 2589 258F 2594?2595,
>     which already have established compatibility precedents that
>     require the exact fraction, but are also used in the Unicode 13.0
>     mapping to PETSCII and Apple II character sets despite those
>     platforms using varying thickness (consistent with light box
>     drawings, except for the 1?8 top and bottom blocks in C64, where
>     the 1?4 top and bottom blocks are made consistent instead).
>
>
>
>
What is missing is an actual proposal. That is, not just analysis or 
exposition, but actual proposed wording or proposed encoding that would 
fix the issue.

That would need to be provided as a UTC document (aka L2 document) 
submission, with the analysis appended in a background section.

A./

PS: I am not convinced that platform-specific mappings (glyphs) are an 
issue, because the scenario where these data are reliably transferred 
*between* legacy implementations can't have existed then, so it's 
questionably why it needs to be perfect today. My assumption would be 
that the use case is lossless round trip from (each) legacy emulator to 
Unicode and back. Having PETSII / Apple II specific characters does not 
improve things, because any data stream containing those could not be 
displayed on any other emulator. This is different from legacy 
characters mapped to letters and common text symbols because we have an 
expectation that we can share text across devices (or emulators).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20251204/4cadf205/attachment.htm>

From piotrunio-2004 at wp.pl  Thu Dec  4 16:37:29 2025
From: piotrunio-2004 at wp.pl (=?UTF-8?Q?piotrunio-2004=40wp=2Epl?=)
Date: Thu, 04 Dec 2025 23:37:29 +0100
Subject: =?UTF-8?Q?Re=3A_Odp=3A_RE=3A_What_to_do_if_a_legacy_compatibility_character_is_defective=3F?=
In-Reply-To: <<16921e34-e06e-427b-ad69-c0a6bcbb6c69@ix.netcom.com>>
References: <c7015e8a4ddc43998e86897ec9e6854a@grupawp.pl>
 <d74f1535bc234bbc9a60bcc9d1a17778@grupawp.pl>
 <paxp193mb156695b6528fe110e304daf695f1a@paxp193mb1566.eurp193.prod.outlook.com>
 <03c761daf930423295d0e5b5f8de424c@grupawp.pl>
 <105968cf-57f1-418d-8782-c84132d676e8@ix.netcom.com>
 <14bd764ef49d4e64b11144eeb0ce5a92@grupawp.pl>
 <ca0ff4cd-69ce-4dad-b02d-607e7a30da42@ix.netcom.com>
 <59598cd6e838407f88138b93a707aca2@grupawp.pl>
 <abc7589bddea4446aa736e6ff62357ac@grupawp.pl>
 <16921e34-e06e-427b-ad69-c0a6bcbb6c69@ix.netcom.com>
Message-ID: <caa2f75de86d433096fb4990af2f11aa@grupawp.pl>

Dnia 04 grudnia 2025 21:28 Asmus Freytag via Unicode &lt;unicode at corp.unicode.org&gt; napisa?(a):  On 12/4/2025 4:35 AM,   piotrunio-2004 at wp.pl  via Unicode wrote:  I have investigated the situation further and it seems
              that defect in the Unicode 13.0?17.0 mapping is even more
              fundamental than I previously thought. In particular, the
              proposal L2/25-037 does not acknowledge the proposal
              L2/00-159, which had already been incorporated into
              Unicode 3.2.?In that proposal, the description of
              characters U+23B8 (LEFT VERTICAL BOX LINE) and U+23B9
              (RIGHT VERTICAL BOX LINE) exactly matches the proposed
              characters L2/25-037:1FBFC (BOX DRAWINGS LIGHT LEFT EDGE)
              and L2/25-037:1FBFD (BOX DRAWINGS LIGHT RIGHT EDGE). In
              both proposals, those two characters are specified to be
              aligned to left or right edge, span the entire edge
              (extending to the top and bottom), and match the thickness
              of Box Drawings Light lines. The description of the
              characters U+23BA (HORIZONTAL SCAN LINE-1) and U+23BD
              (HORIZONTAL SCAN LINE-9) also exactly matches the proposed
              characters L2/25-037:1FBFA (BOX DRAWINGS LIGHT TOP EDGE)
              and L2/25-037:1FBFB (BOX DRAWINGS LIGHT BOTTOM EDGE). In
              both proposals, those two characters are specified to be
              aligned to top and bottom edges, span the entire edge
              (extending to the left and right), and match the thickness
              of Box Drawings Light lines. However, the proposal
              L2/00-159 had already set precedent for usage of [U+23BA,
              U+23BD, U+23B8, U+23B9] (and not the 1?8 blocks or 1?4
              blocks) in mapping to certain platforms such as?The
              Heath/Zenith 19 Graphics Character Set and?The DEC Special
              Graphics Character Set. This contrasts with the usage of
              1?8 blocks [U+2594, U+2581, U+258F, U+2595] and other
              related 1?8 or 7?8 block characters in the mapping to
              PETSCII and Apple II.? Therefore there
                    is a discrepancy between the legacy platforms added
                    in Unicode 3.2 (which use the box drawing lines
                    23B8, 23B9, 23BA, 23BD) and the legacy platforms
                    added in Unicode 13.0?17.0 (which use 1?8 blocks
                    2594, 2581, 258F, 2595).   Dnia 25 pa?dziernika 2025 10:27   piotrunio-2004 at wp.pl  via Unicode   &lt;unicode at corp.unicode.org&gt;  napisa?(a):   Dnia 25 pa?dziernika 2025 08:29 Asmus
                                Freytag via Unicode   &lt;unicode at corp.unicode.org&gt;  napisa?(a):  Again, the identity of the
                                        Unicode character is giving by
                                        encoding the intended mappings.
                                        If Unicode decides to map the
                                        same character to similar
                                        characters on different
                                        platforms, that is not a
                                        problem, as long as implementers
                                        know that the intent is to use a
                                        platform-specific rendering (and
                                        not assume that there is only
                                        one possible rendering per
                                        character).  If you feel that the guidance
                                        available to implementers in the
                                        text of the standard or in an
                                        annotation of the nameslist is
                                        not sufficent, then the remedy
                                        would be to ask for the
                                        explanation to be updated. We
                                        are unfortunately locked in as
                                        far as character names are
                                        concerned, but we can add a note
                                        (best in the text of the
                                        standard) that explains that
                                        emulators for some systems will
                                        need an adjusted design so a
                                        sequence or other arrangement of
                                        these characters looks correct.  Indeed the character names cannot be
                              changed due to stability policies. An
                              explanation note has been provided for
                              U+1FB81 that claims &#34;The lines
                              corresponding to 3 and 5 are not
                              actually block elements, but can show any
                              horizontally
                              repeating pattern&#34;, but still implicitly
                              enforces 1?8 blocks for top and bottom.
                              However, this doesn&#39;t address other cases
                              such as the PETSCII C64 variation. And
                              if?1FB70?1FB81 1FBB5?1FBB8 1FBBC were all
                              noted to no longer require exact 1?8
                              blocks, that would also not remedy the
                              issue because it would introduce an
                              inconsistency with the existing 1?8 or 7?8
                              block characters 2581 2589 258F 2594?2595,
                              which already have established
                              compatibility precedents that require the
                              exact fraction, but are also used in the
                              Unicode 13.0 mapping to PETSCII and Apple
                              II character sets despite those platforms
                              using varying thickness (consistent with
                              light box drawings, except for the 1?8 top
                              and bottom blocks in C64, where the 1?4
                              top and bottom blocks are made consistent
                              instead).      What is missing is an actual proposal. That
        is, not just analysis or exposition, but actual proposed wording
        or proposed encoding that would fix the issue.  That would need to be provided as a UTC
        document (aka L2 document) submission, with the analysis
        appended in a background section.?  A./  PS: I am not convinced that
        platform-specific mappings (glyphs) are an issue, because the
        scenario where these data are reliably transferred *between*
        legacy implementations can&#39;t have existed then, so it&#39;s
        questionably why it needs to be perfect today. My assumption
        would be that the use case is lossless round trip from (each)
        legacy emulator to Unicode and back. Having PETSII / Apple II
        specific characters does not improve things, because any data
        stream containing those could not be displayed on any other
        emulator. This is different from legacy characters mapped to
        letters and common text symbols because we have an expectation
        that we can share text across devices (or emulators).   I have a draft of a follow up of L2/25-037?that analyzes the character sets thoroughly with the additional context provided by?L2/00-159 characters (including the particularly complex relationship between box drawings, 1?8 blocks, and 1?4 blocks in PETSCII), provides additional explanation and screenshot of evidence of HP 264x character in both isolated and in connected usage, and arrives at the conclusion that 23 characters (that is, all in?L2/25-037 except for the 4 that were already added by?L2/00-159) should be added. However, the SEW announced that they will not be discussing these characters any further, so how could any follow up of the proposal possibly get incorporated into Unicode?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20251204/f6d75029/attachment-0001.htm>

From asmusf at ix.netcom.com  Thu Dec  4 18:11:46 2025
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Thu, 4 Dec 2025 16:11:46 -0800
Subject: Odp: RE: What to do if a legacy compatibility character is
 defective?
In-Reply-To: <caa2f75de86d433096fb4990af2f11aa@grupawp.pl>
References: <c7015e8a4ddc43998e86897ec9e6854a@grupawp.pl>
 <d74f1535bc234bbc9a60bcc9d1a17778@grupawp.pl>
 <paxp193mb156695b6528fe110e304daf695f1a@paxp193mb1566.eurp193.prod.outlook.com>
 <03c761daf930423295d0e5b5f8de424c@grupawp.pl>
 <105968cf-57f1-418d-8782-c84132d676e8@ix.netcom.com>
 <14bd764ef49d4e64b11144eeb0ce5a92@grupawp.pl>
 <ca0ff4cd-69ce-4dad-b02d-607e7a30da42@ix.netcom.com>
 <59598cd6e838407f88138b93a707aca2@grupawp.pl>
 <abc7589bddea4446aa736e6ff62357ac@grupawp.pl>
 <16921e34-e06e-427b-ad69-c0a6bcbb6c69@ix.netcom.com>
 <caa2f75de86d433096fb4990af2f11aa@grupawp.pl>
Message-ID: <46cf18b2-21ae-4aaf-9c44-9b641e0d3a27@ix.netcom.com>

On 12/4/2025 2:37 PM, piotrunio-2004 at wp.pl via Unicode wrote:
> However, the SEW announced that they will not be discussing these 
> characters any further, so how could any follow up of the proposal 
> possibly get incorporated into Unicode?

Nothing can force the SEW to accept any particular proposal. However, 
unless there's a document actually submitted, there's nothing that will 
happen at all, no matter what.

If it were up to me, I would focus on suggesting specific language for 
the standard or the nameslist rather than proposing new characters. Such 
feedback may be reviewed by other working groups, not solely SEW.

A./


From piotrunio-2004 at wp.pl  Fri Dec  5 00:55:17 2025
From: piotrunio-2004 at wp.pl (=?UTF-8?Q?piotrunio-2004=40wp=2Epl?=)
Date: Fri, 05 Dec 2025 07:55:17 +0100
Subject: =?UTF-8?Q?Re=3A_Odp=3A_RE=3A_What_to_do_if_a_legacy_compatibility_character_is_defective=3F?=
In-Reply-To: <<46cf18b2-21ae-4aaf-9c44-9b641e0d3a27@ix.netcom.com>>
References: <c7015e8a4ddc43998e86897ec9e6854a@grupawp.pl>
 <d74f1535bc234bbc9a60bcc9d1a17778@grupawp.pl>
 <paxp193mb156695b6528fe110e304daf695f1a@paxp193mb1566.eurp193.prod.outlook.com>
 <03c761daf930423295d0e5b5f8de424c@grupawp.pl>
 <105968cf-57f1-418d-8782-c84132d676e8@ix.netcom.com>
 <14bd764ef49d4e64b11144eeb0ce5a92@grupawp.pl>
 <ca0ff4cd-69ce-4dad-b02d-607e7a30da42@ix.netcom.com>
 <59598cd6e838407f88138b93a707aca2@grupawp.pl>
 <abc7589bddea4446aa736e6ff62357ac@grupawp.pl>
 <16921e34-e06e-427b-ad69-c0a6bcbb6c69@ix.netcom.com>
 <caa2f75de86d433096fb4990af2f11aa@grupawp.pl>
 <46cf18b2-21ae-4aaf-9c44-9b641e0d3a27@ix.netcom.com>
Message-ID: <a55b51c0d04e4e8b8851d4a26c8a7390@grupawp.pl>

Dnia 05 grudnia 2025 01:23 Asmus Freytag via Unicode &lt;unicode at corp.unicode.org&gt; napisa?(a):  On 12/4/2025 2:37 PM, piotrunio-2004 at wp.pl via Unicode wrote:  However, the SEW announced that they will not be discussing these  characters any further, so how could any follow up of the proposal  possibly get incorporated into Unicode?   Nothing can force the SEW to accept any particular proposal. However,  unless there&#39;s a document actually submitted, there&#39;s nothing that will  happen at all, no matter what.  I have already submitted the draft of the follow up. However, I don&#39;t intend to force the characters to be accepted, but instead I requested for the SEW to be made aware of the information in the follow up, so that I can continue receiving more detailed feedback that I can then use to further clarify the issue or explore potential alternative solutions.   If it were up to me, I would focus on suggesting specific language for  the standard or the nameslist rather than proposing new characters. Such  feedback may be reviewed by other working groups, not solely SEW.   A./  The issue for the 1?8 blocks (in Apple II, PETSCII, etc.) is that their encoding policy is not consistent with the previously encoded box drawing lines (in?Heath/Zenith 19,?DEC Special Graphics, etc.), therefore making it inappropriate to use 1?8 blocks in the encoding of those characters for those platforms. The issue is even worse for the C64/C128 PETSCII mapping, which is not even consistent within the same platform where the characters which L2/19-025 mapped to 1?8 blocks 1FB7C?1FBFF (????) are aligned with top and bottom 1?4 blocks (??) but are misaligned with top and bottom 1?8 blocks (??), which makes it a misuse of the 1?8 block character identity (in the context of that platform, top and bottom 1?4 blocks have same thickness as box drawings, which better matches the usage of 23BA 23BD ?? instead). Whereas the issue for the HP 264x character is that the character can be used independently from the other character that Unicode unified it with and that it forms a distinct connection type. As I can tell, those issues are baked into the existing Unicode 13.0?17.0 mappings of those platforms, so I don&#39;t see how &#39;specific language for the standard or the nameslist&#39; could possibly address those issues.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20251205/c5eba2ec/attachment.htm>

From lists at akphs.com  Sun Dec 14 10:33:03 2025
From: lists at akphs.com (Phil Smith III)
Date: Sun, 14 Dec 2025 11:33:03 -0500
Subject: Combining characters
Message-ID: <00f301dc6d17$51fb2080$f5f16180$@akphs.com>

This may be dumb/hopelessly na?ve but here goes!

 
My observations/inferences/suppositions re combining characters:

*	They were originally implemented as a way to reduce the total number of characters 
*	We?re well past any likely original vision of the number of characters/scripts anyway, so that ?savings? is kinda meaningless
*	Combiners are a pain overall (normalization!)
*	Barring Earth joining the Galactic Federation and Unicode deciding to include all twelve billion alien languages, the current scheme will suffice forever (yeah, yeah, I know, ?never say forever?, but
this feels like IPv6 addresses, ?there are just SO many
?)

 
Ergo, I posit that there should never be a need for any NEW combiners.


Is there any sort of official or unofficial policy to that end? Inquiring minds and all that


Thanks,

...phsiii

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20251214/b1b33bfb/attachment.htm>

From doug at ewellic.org  Sun Dec 14 11:16:19 2025
From: doug at ewellic.org (Doug Ewell)
Date: Sun, 14 Dec 2025 17:16:19 +0000
Subject: Combining characters
In-Reply-To: <00f301dc6d17$51fb2080$f5f16180$@akphs.com>
References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com>
Message-ID: <SA5PR03MB8401EEDC2FC079938DC07AB4CAACA@SA5PR03MB8401.namprd03.prod.outlook.com>

Phil Smith III wrote:

> My observations/inferences/suppositions re combining characters:
> - They were originally implemented as a way to reduce the total number
> of characters

Another, possibly more farsighted reason is that, if a newly needed letter-with-diacritic can be represented today with an existing letter and an existing diacritic, instead of waiting possibly years for the precomposed combination to be encoded, that time saving is a big win for the user community.

> - Combiners are a pain overall (normalization!)
> [...]
> Ergo, I posit that there should never be a need for any NEW combiners.

Combining-character mechanisms are already implemented at many levels (processing, counting, fonts, rendering engines, etc.). More combining characters that work essentially the same as existing ones don?t really add to the pain.

--
Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org


From lists at akphs.com  Sun Dec 14 11:37:35 2025
From: lists at akphs.com (Phil Smith III)
Date: Sun, 14 Dec 2025 12:37:35 -0500
Subject: Combining characters
In-Reply-To: <SA5PR03MB8401EEDC2FC079938DC07AB4CAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com>
 <SA5PR03MB8401EEDC2FC079938DC07AB4CAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
Message-ID: <011001dc6d20$56261670$02724350$@akphs.com>

Doug Ewell wrote:
>Another, possibly more farsighted reason is that, if a newly needed
>letter-with-diacritic can be represented today with an existing letter
>and an existing diacritic, instead of waiting possibly years for the
>precomposed combination to be encoded, that time saving is a big win
>for the user community.

"newly needed letter-with-diacritic" -- does that happen? Venusian gets added and the ONLY issue is that it needs J+Combining Grave? I see the point but am not sure it's realistic, and in any case isn't what I'm talking about: I'm asking about NEW combiners. Though "invalid" combinations can be an issue now, with different engines rendering them differently. At least if code comes across J+Combining Grave now, the combining-ness is known. When a Combining Backslash is added for Jovian, well, now that character is new and normalization adventures abound.

>More combining characters that work essentially the same as existing
>ones don?t really add to the pain.

Actually they add a LOT of pain/complexity for certain use cases, because of normalization.

Thanks; I don't mean to sound like "Go away", this is exactly the kind of discussion I was hoping for! The fact that there haven't been any new combiners in several versions (I think?) is what made me think that there might be some level of "No more, not now, not ever" policy.


From doug at ewellic.org  Sun Dec 14 11:57:53 2025
From: doug at ewellic.org (Doug Ewell)
Date: Sun, 14 Dec 2025 17:57:53 +0000
Subject: Combining characters
In-Reply-To: <011001dc6d20$56261670$02724350$@akphs.com>
References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com>
 <SA5PR03MB8401EEDC2FC079938DC07AB4CAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
 <011001dc6d20$56261670$02724350$@akphs.com>
Message-ID: <SA5PR03MB8401EB99841F2A39DDA2593DCAACA@SA5PR03MB8401.namprd03.prod.outlook.com>

Phil Smith III wrote:

> "newly needed letter-with-diacritic" -- does that happen? Venusian
> gets added and the ONLY issue is that it needs J+Combining Grave? I
> see the point but am not sure it's realistic,

This sort of thing is not uncommon with Native American orthographies (right here on Earth!) that are newly created, or for which Unicode encoding is new.

> and in any case isn't what I'm talking about: I'm asking about NEW
> combiners. Though "invalid" combinations can be an issue now, with
> different engines rendering them differently. At least if code comes
> across J+Combining Grave now, the combining-ness is known. When a
> Combining Backslash is added for Jovian, well, now that character is
> new and normalization adventures abound.

Normalization (NFC or NFD, not NFK*) for characters like this comes into play only when the character exists as both a precomposed unitary character and a combining sequence. When there is only one or the other, normalization to NFC or NFD yields the same result, and is thus a no-op, and not particularly adventurous.

> Actually they add a LOT of pain/complexity for certain use cases,
> because of normalization.

Only if a separate NFC (precomposed) or NFD (decomposed) form is added where one already exists, and IIRC there is indeed a policy against that.

--
Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org


From irgendeinbenutzername at gmail.com  Sun Dec 14 12:28:16 2025
From: irgendeinbenutzername at gmail.com (Charlotte Eiffel Lilith Buff)
Date: Sun, 14 Dec 2025 19:28:16 +0100
Subject: Combining characters
In-Reply-To: <011001dc6d20$56261670$02724350$@akphs.com>
References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com>
 <SA5PR03MB8401EEDC2FC079938DC07AB4CAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
 <011001dc6d20$56261670$02724350$@akphs.com>
Message-ID: <CAKLR3Apkzmaou_4tAt5Smh1-4AprF0kZSEjwd+nkWav45Q58yQ@mail.gmail.com>

> The fact that there haven't been any new combiners in several versions

I?m actually really curious what gave you that impression. Pretty much
every Unicode update adds tons of new combining characters (the only
exceptions being those weird inbetween-y versions we occasionally get).

Am So., 14. Dez. 2025 um 18:39 Uhr schrieb Phil Smith III via Unicode <
unicode at corp.unicode.org>:

> Doug Ewell wrote:
> >Another, possibly more farsighted reason is that, if a newly needed
> >letter-with-diacritic can be represented today with an existing letter
> >and an existing diacritic, instead of waiting possibly years for the
> >precomposed combination to be encoded, that time saving is a big win
> >for the user community.
>
> "newly needed letter-with-diacritic" -- does that happen? Venusian gets
> added and the ONLY issue is that it needs J+Combining Grave? I see the
> point but am not sure it's realistic, and in any case isn't what I'm
> talking about: I'm asking about NEW combiners. Though "invalid"
> combinations can be an issue now, with different engines rendering them
> differently. At least if code comes across J+Combining Grave now, the
> combining-ness is known. When a Combining Backslash is added for Jovian,
> well, now that character is new and normalization adventures abound.
>
> >More combining characters that work essentially the same as existing
> >ones don?t really add to the pain.
>
> Actually they add a LOT of pain/complexity for certain use cases, because
> of normalization.
>
> Thanks; I don't mean to sound like "Go away", this is exactly the kind of
> discussion I was hoping for! The fact that there haven't been any new
> combiners in several versions (I think?) is what made me think that there
> might be some level of "No more, not now, not ever" policy.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20251214/c9c2c413/attachment.htm>

From lists at akphs.com  Sun Dec 14 12:47:13 2025
From: lists at akphs.com (Phil Smith III)
Date: Sun, 14 Dec 2025 13:47:13 -0500
Subject: Combining characters
In-Reply-To: <CAKLR3Apkzmaou_4tAt5Smh1-4AprF0kZSEjwd+nkWav45Q58yQ@mail.gmail.com>
References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com>
 <SA5PR03MB8401EEDC2FC079938DC07AB4CAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
 <011001dc6d20$56261670$02724350$@akphs.com>
 <CAKLR3Apkzmaou_4tAt5Smh1-4AprF0kZSEjwd+nkWav45Q58yQ@mail.gmail.com>
Message-ID: <012d01dc6d2a$105656a0$310303e0$@akphs.com>

Well, I?m sorta ?asking for a friend? ? a coworker who is deep in the weeds of working with something Unicode-related. I?m blaming him for having told me that :)

 
If it?s wrong, it certainly changes things a lot, and makes my question moot(ish?)!

 
From: Charlotte Eiffel Lilith Buff <irgendeinbenutzername at gmail.com> 
Sent: Sunday, December 14, 2025 1:28 PM
To: Phil Smith III <lists at akphs.com>
Cc: Unicode <unicode at corp.unicode.org>
Subject: Re: Combining characters

 
> The fact that there haven't been any new combiners in several versions

 
I?m actually really curious what gave you that impression. Pretty much every Unicode update adds tons of new combining characters (the only exceptions being those weird inbetween-y versions we occasionally get).

 
Am So., 14. Dez. 2025 um 18:39 Uhr schrieb Phil Smith III via Unicode <unicode at corp.unicode.org <mailto:unicode at corp.unicode.org> >:

Doug Ewell wrote:
>Another, possibly more farsighted reason is that, if a newly needed
>letter-with-diacritic can be represented today with an existing letter
>and an existing diacritic, instead of waiting possibly years for the
>precomposed combination to be encoded, that time saving is a big win
>for the user community.

"newly needed letter-with-diacritic" -- does that happen? Venusian gets added and the ONLY issue is that it needs J+Combining Grave? I see the point but am not sure it's realistic, and in any case isn't what I'm talking about: I'm asking about NEW combiners. Though "invalid" combinations can be an issue now, with different engines rendering them differently. At least if code comes across J+Combining Grave now, the combining-ness is known. When a Combining Backslash is added for Jovian, well, now that character is new and normalization adventures abound.

>More combining characters that work essentially the same as existing
>ones don?t really add to the pain.

Actually they add a LOT of pain/complexity for certain use cases, because of normalization.

Thanks; I don't mean to sound like "Go away", this is exactly the kind of discussion I was hoping for! The fact that there haven't been any new combiners in several versions (I think?) is what made me think that there might be some level of "No more, not now, not ever" policy.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20251214/d4229dff/attachment.htm>

From jukkakk at gmail.com  Sun Dec 14 13:50:38 2025
From: jukkakk at gmail.com (Jukka K. Korpela)
Date: Sun, 14 Dec 2025 21:50:38 +0200
Subject: Combining characters
In-Reply-To: <00f301dc6d17$51fb2080$f5f16180$@akphs.com>
References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com>
Message-ID: <CAGHxYa5t119HujzJxrurOxfeiiojoLA62tVhKJ4GFh5oHDDFEA@mail.gmail.com>

The question is neither dumb or na?ve, However, it has been addressed and
answered several times in various discussions. There might be different
views on this in the Unicode community, so I?ll label the following as mine
only,

Unicode has included precomposed characters such as ??? and ??? for
compatibility, since they existed in earlier character codes. If Unicode
were designed today and with no regard to earlier codes, it would have just
base characters and combining marks. But due to the existence and
widespread use of codes with precomposed characters, they were included
into Unicode and defined as distinct, but ?compatibility equivalent? to
combinations of base characters and combining marks. This means that you
are allowed to, but not required to, or even encouraged to make a
distinction between, say, the single character ??? and the sequence of ?e?
followed by a combining acute accent.

Combining characters provide for a general mechanism of adding combining
marks on any characters. You can take this to extremes and absurdity,
creating a character with dozens or zillions of marks above and below it,
but this does not prevent meaningful use of combining characters.

Yucca, https://jkorpela.fi

su 14.12.2025 klo 18.36 Phil Smith III via Unicode (unicode at corp.unicode.org)
kirjoitti:

> This may be dumb/hopelessly na?ve but here goes!
>
>
>
> My observations/inferences/suppositions re combining characters:
>
>    - They were originally implemented as a way to reduce the total number
>    of characters
>    - We?re well past any likely original vision of the number of
>    characters/scripts anyway, so that ?savings? is kinda meaningless
>    - Combiners are a pain overall (normalization!)
>    - Barring Earth joining the Galactic Federation and Unicode deciding
>    to include all twelve billion alien languages, the current scheme will
>    suffice forever (yeah, yeah, I know, ?never say forever?, but?this feels
>    like IPv6 addresses, ?there are just SO many??)
>
>
>
> Ergo, I posit that there should never be a need for any NEW combiners.
>
>
> Is there any sort of official or unofficial policy to that end? Inquiring
> minds and all that?
>
>
>
> Thanks,
>
> ...phsiii
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20251214/0e76a011/attachment.htm>

From don.hosek at gmail.com  Sun Dec 14 14:25:22 2025
From: don.hosek at gmail.com (Don Hosek)
Date: Sun, 14 Dec 2025 14:25:22 -0600
Subject: Combining Characters
Message-ID: <CAMDtpNQ1UCrfMPVf+n+zo5+mk-_RshLg7yfHqKFXkeQY3fwzfQ@mail.gmail.com>

>
> When a Combining Backslash is added for Jovian, well, now that character
> is new and normalization adventures abound.
>

Just one additional note on this: Everything around combining characters,
normalization and grapheme segmentation is data-driven. Other than when new
rules for Indic scripts were introduced with Unicode 15.1.0, the only thing
I?ve needed to update for my Unicode grapheme library has been to import
the newest Unicode data tables. I?ve not written normalization code (yet),
but from everything that I?ve seen on that front, it looks like a similar
thing where again, everything is data-driven.

The only case I can see where things could get weird would be if there
suddenly became some weird case where, e.g., the Jovians insisted that the
combining backslash must appear before the letter and not after it (and
it?s been a few years since I had to really look at the rules and this
might be possible with the existing combining character classes anyway).

-dh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20251214/d636db7d/attachment.htm>

From markus.icu at gmail.com  Sun Dec 14 14:32:06 2025
From: markus.icu at gmail.com (Markus Scherer)
Date: Sun, 14 Dec 2025 12:32:06 -0800
Subject: Combining characters
In-Reply-To: <012d01dc6d2a$105656a0$310303e0$@akphs.com>
References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com>
 <SA5PR03MB8401EEDC2FC079938DC07AB4CAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
 <011001dc6d20$56261670$02724350$@akphs.com>
 <CAKLR3Apkzmaou_4tAt5Smh1-4AprF0kZSEjwd+nkWav45Q58yQ@mail.gmail.com>
 <012d01dc6d2a$105656a0$310303e0$@akphs.com>
Message-ID: <CAN49p6rfyuVgwJwMNmdZ8-tJDNPCz9zHz=4E+X=7pRWa9rqMMw@mail.gmail.com>

On Sun, Dec 14, 2025 at 9:40?AM Phil Smith III via Unicode <
unicode at corp.unicode.org> wrote:

> The fact that there haven't been any new combiners in several versions (I
> think?)


We publish data files, so that you need not guess.

Are we talking about combining marks per se, which include Indic-script
vowel signs, or are we talking about characters with non-zero
Canonical_Combining_Class?

Unicode 15, 16, and 17 added 135 characters with General_Category=M.
Unicode 15, 16, and 17 added 56 characters with ccc!=0. (These are a subset
of the above.)

On Sun, Dec 14, 2025 at 10:50?AM Phil Smith III via Unicode <
unicode at corp.unicode.org> wrote:

> Well, I?m sorta ?asking for a friend? ? a coworker who is deep in the
> weeds of working with something Unicode-related.
>

Maybe you could ask your coworker to chime in and say what he is working
on, and maybe we can give some tips?

markus

>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20251214/b17dd2ab/attachment.htm>

From list+unicode at jdlh.com  Sun Dec 14 14:40:25 2025
From: list+unicode at jdlh.com (list+unicode at jdlh.com)
Date: Sun, 14 Dec 2025 12:40:25 -0800
Subject: Combining characters
In-Reply-To: <CAKLR3Apkzmaou_4tAt5Smh1-4AprF0kZSEjwd+nkWav45Q58yQ@mail.gmail.com>
References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com>
 <SA5PR03MB8401EEDC2FC079938DC07AB4CAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
 <011001dc6d20$56261670$02724350$@akphs.com>
 <CAKLR3Apkzmaou_4tAt5Smh1-4AprF0kZSEjwd+nkWav45Q58yQ@mail.gmail.com>
Message-ID: <7bf3ed16-4fc5-4fc9-929b-5eb34b312075@jdlh.com>

Following up on Charlotte's comment:

for Unicode 17.0, released in September 2025, see:

1. *Unicode? 17.0 Versioned Charts Index 
*<https://www.unicode.org/charts/PDF/Unicode-17.0/>, which lists all 
4803 characters newly encoded in Unicode 17.0, including 27 in 
"Combining Diacritical Marks Extended".

2. *Combining Diacritical Marks Extended* 
<https://www.unicode.org/charts/PDF/Unicode-17.0/U170-1AB0.pdf>
which has pictures of new combining characters U-1ACF..U-1ADD and 
U-1AE0..U-1AEB.

So, there have been new combiners in the most recent version of The 
Unicode Standard.
 ???? ?Jim DeLaHunt

On 2025-12-14 10:28, Charlotte Eiffel Lilith Buff via Unicode wrote:
> > The fact that there haven't been any new combiners in several versions
>
> I?m actually really curious what gave you that impression. Pretty much 
> every Unicode update adds tons of new combining characters (the only 
> exceptions being those weird inbetween-y versions we occasionally get).
>
> Am So., 14. Dez. 2025 um 18:39?Uhr schrieb Phil Smith III via Unicode 
> <unicode at corp.unicode.org>:
>
>     Doug Ewell wrote:
>     >Another, possibly more farsighted reason is that, if a newly needed
>     >letter-with-diacritic can be represented today with an existing
>     letter
>     >and an existing diacritic, instead of waiting possibly years for the
>     >precomposed combination to be encoded, that time saving is a big win
>     >for the user community.
>
>     "newly needed letter-with-diacritic" -- does that happen? Venusian
>     gets added and the ONLY issue is that it needs J+Combining Grave?
>     I see the point but am not sure it's realistic, and in any case
>     isn't what I'm talking about: I'm asking about NEW combiners.
>     Though "invalid" combinations can be an issue now, with different
>     engines rendering them differently. At least if code comes across
>     J+Combining Grave now, the combining-ness is known. When a
>     Combining Backslash is added for Jovian, well, now that character
>     is new and normalization adventures abound.
>
>     >More combining characters that work essentially the same as existing
>     >ones don?t really add to the pain.
>
>     Actually they add a LOT of pain/complexity for certain use cases,
>     because of normalization.
>
>     Thanks; I don't mean to sound like "Go away", this is exactly the
>     kind of discussion I was hoping for! The fact that there haven't
>     been any new combiners in several versions (I think?) is what made
>     me think that there might be some level of "No more, not now, not
>     ever" policy.
>
>
-- 
.   --Jim DeLaHunt,jdlh at jdlh.com     http://blog.jdlh.com/ (http://jdlh.com/)
       multilingual websites consultant, Vancouver, B.C., Canada
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20251214/ca8552f1/attachment-0001.htm>

From markus.icu at gmail.com  Sun Dec 14 14:40:54 2025
From: markus.icu at gmail.com (Markus Scherer)
Date: Sun, 14 Dec 2025 12:40:54 -0800
Subject: Combining Characters
In-Reply-To: <CAMDtpNQ1UCrfMPVf+n+zo5+mk-_RshLg7yfHqKFXkeQY3fwzfQ@mail.gmail.com>
References: <CAMDtpNQ1UCrfMPVf+n+zo5+mk-_RshLg7yfHqKFXkeQY3fwzfQ@mail.gmail.com>
Message-ID: <CAN49p6pMmG=nZ9bNY5HcDG6ETG1vi5U4_DQ9bNmEvDfWLuqzNQ@mail.gmail.com>

On Sun, Dec 14, 2025 at 12:27?PM Don Hosek via Unicode <
unicode at corp.unicode.org> wrote:

> When a Combining Backslash is added for Jovian, well, now that character
>> is new and normalization adventures abound.
>>
>
> [...]
>
> The only case I can see where things could get weird would be if there
> suddenly became some weird case where, e.g., the Jovians insisted that the
> combining backslash must appear before the letter and not after it (and
> it?s been a few years since I had to really look at the rules and this
> might be possible with the existing combining character classes anyway).
>

Some of the Indic-script vowel marks *appear graphically* before their
consonant.

We also have scripts like Thai with characters that have the
Logical_Order_Exception
property <https://www.unicode.org/reports/tr44/#Logical_Order_Exception>
and are encoded in memory before their consontants.

However, when the Jovians arrive with their billion-character encoding,
then Unicode will become a legacy encoding.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20251214/5fd9b495/attachment.htm>

From asmusf at ix.netcom.com  Sun Dec 14 16:02:41 2025
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sun, 14 Dec 2025 14:02:41 -0800
Subject: Combining characters
In-Reply-To: <SA5PR03MB8401EB99841F2A39DDA2593DCAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com>
 <SA5PR03MB8401EEDC2FC079938DC07AB4CAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
 <011001dc6d20$56261670$02724350$@akphs.com>
 <SA5PR03MB8401EB99841F2A39DDA2593DCAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
Message-ID: <fa05f4e5-fb50-4ed6-ba0f-ebd676454c7c@ix.netcom.com>

On 12/14/2025 9:57 AM, Doug Ewell via Unicode wrote:
> Normalization (NFC or NFD, not NFK*) for characters like this comes into play only when the character exists as both a precomposed unitary character and a combining sequence. When there is only one or the other, normalization to NFC or NFD yields the same result, and is thus a no-op, and not particularly adventurous.

This is actually incorrect. (And Doug actually knows better :) ).

It would be correct for a sequence of a base character with */single 
/*combining mark, but as soon as you have two or more combining marks, 
their order is defined by NFC. The idea is that that if two combining 
marks don't interact (such as by stacking), different orders could 
result in the same display and normalization enforces a preferred ordering.

To make matters more complex, some combining marks are defined to not 
reorder. Those can be in any order defined by the author and could lead 
to duplicate encoding for the same display. The reasons behind 
supporting that are a bit complex, but generally it's done for scripts 
other than Latin.

But in general, */canonical reordering/*?is a thing and is part of 
normalization.

A./
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20251214/d5e1fd14/attachment.htm>

From asmusf at ix.netcom.com  Sun Dec 14 16:44:49 2025
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sun, 14 Dec 2025 14:44:49 -0800
Subject: Combining characters
In-Reply-To: <012d01dc6d2a$105656a0$310303e0$@akphs.com>
References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com>
 <SA5PR03MB8401EEDC2FC079938DC07AB4CAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
 <011001dc6d20$56261670$02724350$@akphs.com>
 <CAKLR3Apkzmaou_4tAt5Smh1-4AprF0kZSEjwd+nkWav45Q58yQ@mail.gmail.com>
 <012d01dc6d2a$105656a0$310303e0$@akphs.com>
Message-ID: <e16081af-bea3-4e4f-ba96-316d3ce6a1ef@ix.netcom.com>

On 12/14/2025 10:47 AM, Phil Smith III via Unicode wrote:
>
> Well, I?m sorta ?asking for a friend? ? a coworker who is deep in the 
> weeds of working with something Unicode-related. I?m blaming him for 
> having told me that :)
>
>
This actually deserves a deeper answer, or a more "bird's-eye" one, if 
you want. Read to the end.

The way you asked the question seems to hint that in your minds you and 
your friend conflate the concept of "combining" mark and "diacritic". 
That would not be surprising if you are mainly familiar with European 
scripts and languages, because in that case, this equivalence kind of 
applies.

And you may also be thinking mainly of languages and their 
orthographies, and not of notations, phonetic or otherwise, that give 
rise to unusual combinations. Most European languages do have a 
reasonably small, fixed set of letters with diacritics in their 
orthographies, even though there are many languages where, if you ask 
the native users to list all the combinations, they will fall short. 
Example is the use of an accent with the letter 'e' in some of the 
Scandinavian languages to distinguish two identically spelled small 
words that have very different functions in the syntax. You will see 
that accent used in books and formal writing, but I doubt people bother 
when writing a text message.

The focus on code space is a red herring to a degree. The real 
difficulty would be in cataloging all of the rare combinations, and get 
all fonts to be aware of them. It is much easier to encode the diacritic 
as a combining character and have general rules for layout. With modern 
fonts, you can, in principle, get acceptable display even for unexpected 
combinations without the effort of first cataloging, then publishing and 
then having all font vendors explicitly adding an implementation for 
that combination before it can be used.

Other languages and scripts have combinatorics as part of their DNA, so 
to speak. Their structural unit is not the letter (with or without 
decorations) but the syllable, which is naturally combined from 
components that graphically attach to each other or even fuse into a 
combined shape. Because that process is not random, it's easier to 
encode these structural elements (some of which are combining 
characters) than to try to enumerate the possible combinations. It 
doesn't hurt that the components nicely map onto discrete keys on the 
respective keyboards.

Notations, such as scientific notation, also often assigns a discrete 
identity to the combining mark. A dot above can be the first derivative 
with respect to time, which can be applied to any letter designating a 
variable, which can be, at the minimum any letter from the Latin or 
Greek alphabets, but why stop there. There's nothing in the notation 
itself that would enjoin a scientist from combining that dot with any 
character they find suitable. The only sensible solution is encoding a 
combining mark, even though some letters exist that have a dot above as 
part of an orthography and are also encoded in precomposed form.

In contrast, Chinese ideographs, while visually composed of identifiable 
elements, are treated by their users as units and well before Unicode 
came along there was an established approach how to manage things like 
keyboard entry while encoding these as precomposed entities and not as 
their building blocks.

A big part of the encoding decision is always to do what makes sense for 
the writing system or notation (and the script it is based on).

For a universal encoding, such as Unicode, there simply isn't a 
"one-size-fits-all" solution that would work. But if you look at this 
universal encoding only from a very narrow perspective of the 
orthographies that you are most familiar with, then, understandably, you 
might feel that anything that isn't directly required (from your point 
of view) is an unnecessary complication.

However, once you adopt a more universal perspective, it's much easier 
to not rat-hole on some seeming inconsistencies, because you can always 
discover how certain decisions relate to the specific requirements for 
one or more writing systems. Importantly, this often includes 
requirements based on de-facto implementations for these systems before 
the advent of Unicode. Being universal, Unicode needed to be designed to 
allow easy conversion from all existing data sets. And for European 
scripts, the business community and the librarians had competing 
systems, one with limited sets of precomposed characters and one with 
combining marks for diacritics. The ultimate source of the duality stems 
from there, but the two communities had different goals. One wanted to 
efficiently handle the common case (primarily mapping all the modern 
national typewriters into character encoding) while the other was 
interested in a full representation of anything that could be present in 
printed book titles (for cataloging), including unusual or historic 
combinations.

In conclusion, the question isn't a bad one, but the real answer is that 
complexity is very much part of human writing, and when you design (and 
extend) a universal character encoding, you will need to be able to 
represent that full degree of complexity. Therefore, what seem like 
obvious simplifications really aren't feasible, unless you give up on 
attempting to be universal.

A./
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20251214/12aef077/attachment.htm>

From ashpilkin at gmail.com  Sun Dec 14 17:07:41 2025
From: ashpilkin at gmail.com (Alex Shpilkin)
Date: Mon, 15 Dec 2025 01:07:41 +0200
Subject: Combining characters
In-Reply-To: <fa05f4e5-fb50-4ed6-ba0f-ebd676454c7c@ix.netcom.com>
References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com>
 <SA5PR03MB8401EEDC2FC079938DC07AB4CAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
 <011001dc6d20$56261670$02724350$@akphs.com>
 <SA5PR03MB8401EB99841F2A39DDA2593DCAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
 <fa05f4e5-fb50-4ed6-ba0f-ebd676454c7c@ix.netcom.com>
Message-ID: <T88A7T.B50GJP1YRE0D@gmail.com>


On Sun, Dec 14 2025 at 14:02:41 -08:00:00, Asmus Freytag via Unicode 
<unicode at corp.unicode.org> wrote:
> To make matters more complex, some combining marks are defined to not 
> reorder. Those can be in any order defined by the author and could 
> lead to duplicate encoding for the same display. The reasons behind 
> supporting that are a bit complex, but generally it's done for 
> scripts other than Latin.

Amusingly, study of literal Latin, the language, uses two combining 
marks of the same CCC together as a matter of course: dictionaries mark 
a vowel with (what in NFD would be) the sequence COMBINING MACRON, 
COMBINING BREVE to tell the reader that a syllable?s length either 
varies or cannot be determined.

-- 
Cheers,
Alex


From mark at kli.org  Sun Dec 14 17:22:06 2025
From: mark at kli.org (Mark E. Shoulson)
Date: Sun, 14 Dec 2025 18:22:06 -0500
Subject: Combining characters
In-Reply-To: <e16081af-bea3-4e4f-ba96-316d3ce6a1ef@ix.netcom.com>
References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com>
 <SA5PR03MB8401EEDC2FC079938DC07AB4CAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
 <011001dc6d20$56261670$02724350$@akphs.com>
 <CAKLR3Apkzmaou_4tAt5Smh1-4AprF0kZSEjwd+nkWav45Q58yQ@mail.gmail.com>
 <012d01dc6d2a$105656a0$310303e0$@akphs.com>
 <e16081af-bea3-4e4f-ba96-316d3ce6a1ef@ix.netcom.com>
Message-ID: <b9ff3f9c-09a8-4d0d-b1fb-832644f4c1eb@kli.org>

On 12/14/25 5:44 PM, Asmus Freytag via Unicode wrote:

> On 12/14/2025 10:47 AM, Phil Smith III via Unicode wrote:
>>
>> Well, I?m sorta ?asking for a friend? ? a coworker who is deep in the 
>> weeds of working with something Unicode-related. I?m blaming him for 
>> having told me that :)
>>
>>
> This actually deserves a deeper answer, or a more "bird's-eye" one, if 
> you want. Read to the end.
>
> The way you asked the question seems to hint that in your minds you 
> and your friend conflate the concept of "combining" mark and 
> "diacritic". That would not be surprising if you are mainly familiar 
> with European scripts and languages, because in that case, this 
> equivalence kind of applies.
>
Yes.? This is crucial.? You (Phil) are writing like "sheez, so there's e 
and there's e-with-an-acute, we might as well just treat them like 
separate letters."? And that maybe makes sense for languages where 
"combining characters" are maybe two or three diacritics that can live 
on five or six letters.? Maybe it does make sense to consider those 
combinations as distinct letters (indeed, some of the languages in 
question do just that.)? But some combining characters are more rightly 
perceived as things separate from the letters which are written in the 
same space (and have historically always been considered so).? The most 
obvious examples would be Hebrew and Arabic vowel-points.? Does it 
really make sense to consider ?? and ?? and ??? and all the other 
combinatorics as separate distinct things, when they clearly contain 
separate units, each of which has its own consistent character?? Throw 
in the Hebrew "accents" (cantillation marks) and you're talking an 
enormous combinatorial explosion at the *cost* of simplicity and 
consistency, not improving it.? Ditto Indic vowel-marks and a jillion 
other abjads and abugidas.? If anything, there's a better case to be 
made that the precomposed letters were maybe a wrong move.

(TL;DR: what Asmus said.)

~mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20251214/9f5bd8d9/attachment.htm>

From duerst at it.aoyama.ac.jp  Sun Dec 14 17:54:28 2025
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2E_D=C3=BCrst?=)
Date: Mon, 15 Dec 2025 08:54:28 +0900
Subject: Combining characters
In-Reply-To: <SA5PR03MB8401EB99841F2A39DDA2593DCAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com>
 <SA5PR03MB8401EEDC2FC079938DC07AB4CAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
 <011001dc6d20$56261670$02724350$@akphs.com>
 <SA5PR03MB8401EB99841F2A39DDA2593DCAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
Message-ID: <191bffe1-705e-4865-9f60-80f990c19cff@it.aoyama.ac.jp>

Hello everybody,

On 2025-12-15 02:57, Doug Ewell via Unicode wrote:
> Phil Smith III wrote:

> Only if a separate NFC (precomposed) or NFD (decomposed) form is added where one already exists, and IIRC there is indeed a policy against that.

Yes. It's at https://www.unicode.org/policies/stability_policy.html, 
under the heading "Normalization Stability". It's written in a 
result-oriented way, but it essentially means that if a text can already 
be expressed in decomposed form, no new precompositions will be added.

That doesn't mean that it's not possible to add new precompositions and 
decompositions at the same time, e.g. for a new script (see my next mail 
for an example).

Regards,   Martin.

From duerst at it.aoyama.ac.jp  Sun Dec 14 17:54:33 2025
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2E_D=C3=BCrst?=)
Date: Mon, 15 Dec 2025 08:54:33 +0900
Subject: Combining Characters
In-Reply-To: <CAMDtpNQ1UCrfMPVf+n+zo5+mk-_RshLg7yfHqKFXkeQY3fwzfQ@mail.gmail.com>
References: <CAMDtpNQ1UCrfMPVf+n+zo5+mk-_RshLg7yfHqKFXkeQY3fwzfQ@mail.gmail.com>
Message-ID: <392cff9f-9c2f-4221-b46f-8a49e878ae12@it.aoyama.ac.jp>

Hello everybody,

On 2025-12-15 05:25, Don Hosek via Unicode wrote:

> Just one additional note on this: Everything around combining characters,
> normalization and grapheme segmentation is data-driven. Other than when new
> rules for Indic scripts were introduced with Unicode 15.1.0, the only thing
> I?ve needed to update for my Unicode grapheme library has been to import
> the newest Unicode data tables. I?ve not written normalization code (yet),
> but from everything that I?ve seen on that front, it looks like a similar
> thing where again, everything is data-driven.

That's essentially true, based on my experience with Unicode-related 
code for the programming language Ruby.


> The only case I can see where things could get weird would be if there
> suddenly became some weird case where, e.g., the Jovians insisted that the
> combining backslash must appear before the letter and not after it (and
> it?s been a few years since I had to really look at the rules and this
> might be possible with the existing combining character classes anyway).

Because of the way we have optimized normalization in Ruby (caching 
normalization results for runs of a base character followed by 
modifiers), that wasn't exactly true when we upgraded to Unicode 16.0.0.
See the "Normalization Behavior" entry at 
https://www.unicode.org/versions/Unicode16.0.0/#Migration.

New scripts introduced in 16.0.0 (Kirat Rai, Tulu-Tigalari, and Gurung 
Khema) contained combining marks that had combining class 0 and were 
also base characters combining with other combining marks (or even with 
themselves). That was something we hadn't taken account of in our 
implementation previously (because it was not needed).

You can see an example at 
https://github.com/ruby/ruby/blob/master/test/test_unicode_normalize.rb#L219:
     assert_equal "\u{16121 16121 16121 16121 16121 1611E}",
              "\u{1611E 16121 16121 16121 16121 16121}".unicode_normalize
U+1611E is GURUNG KHEMA VOWEL SIGN AA, a single bar on top of a 
character. It combines with itsel to form
U+16121, GURUNG KHEMA VOWEL SIGN U, which is a double bar above.

Although not required for actually writing Gurung Khema (or so I 
assume), the correct form to represent a number of bars above (11 in the 
test code above) is to first group them into pairs with U+16121, and 
only in the case of an odd number add a single U+1611E to the end.

Regards,   Martin.

From asmusf at ix.netcom.com  Sun Dec 14 17:55:21 2025
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sun, 14 Dec 2025 15:55:21 -0800
Subject: Combining characters
In-Reply-To: <b9ff3f9c-09a8-4d0d-b1fb-832644f4c1eb@kli.org>
References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com>
 <SA5PR03MB8401EEDC2FC079938DC07AB4CAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
 <011001dc6d20$56261670$02724350$@akphs.com>
 <CAKLR3Apkzmaou_4tAt5Smh1-4AprF0kZSEjwd+nkWav45Q58yQ@mail.gmail.com>
 <012d01dc6d2a$105656a0$310303e0$@akphs.com>
 <e16081af-bea3-4e4f-ba96-316d3ce6a1ef@ix.netcom.com>
 <b9ff3f9c-09a8-4d0d-b1fb-832644f4c1eb@kli.org>
Message-ID: <a2811dd4-02df-4467-a98a-531db410eaff@ix.netcom.com>

On 12/14/2025 3:22 PM, Mark E. Shoulson via Unicode wrote:
>
> On 12/14/25 5:44 PM, Asmus Freytag via Unicode wrote:
>
>> On 12/14/2025 10:47 AM, Phil Smith III via Unicode wrote:
>>>
>>> Well, I?m sorta ?asking for a friend? ? a coworker who is deep in 
>>> the weeds of working with something Unicode-related. I?m blaming him 
>>> for having told me that :)
>>>
>>>
>> This actually deserves a deeper answer, or a more "bird's-eye" one, 
>> if you want. Read to the end.
>>
>> The way you asked the question seems to hint that in your minds you 
>> and your friend conflate the concept of "combining" mark and 
>> "diacritic". That would not be surprising if you are mainly familiar 
>> with European scripts and languages, because in that case, this 
>> equivalence kind of applies.
>>
> Yes.? This is crucial.? You (Phil) are writing like "sheez, so there's 
> e and there's e-with-an-acute, we might as well just treat them like 
> separate letters."? And that maybe makes sense for languages where 
> "combining characters" are maybe two or three diacritics that can live 
> on five or six letters.? Maybe it does make sense to consider those 
> combinations as distinct letters (indeed, some of the languages in 
> question do just that.)? But some combining characters are more 
> rightly perceived as things separate from the letters which are 
> written in the same space (and have historically always been 
> considered so). The most obvious examples would be Hebrew and Arabic 
> vowel-points.? Does it really make sense to consider ?? and ?? and ??? 
> and all the other combinatorics as separate distinct things, when they 
> clearly contain separate units, each of which has its own consistent 
> character?? Throw in the Hebrew "accents" (cantillation marks) and 
> you're talking an enormous combinatorial explosion at the *cost* of 
> simplicity and consistency, not improving it.? Ditto Indic vowel-marks 
> and a jillion other abjads and abugidas.
>
Nice examples to back up what I wrote.
>
> ?If anything, there's a better case to be made that the precomposed 
> letters were maybe a wrong move.
>
>
That "might" have been the case, had Unicode been created in a vacuum.

Instead, Unicode needed to offer the easiest migration path from the 
installed base of pre-existing character encodings, or risk failing to 
gain ground at all.

All the early systems mainly started out with legacy applications and 
legacy data that needed to be supported as transparently as possible. 
Given the pervasive amount of indexing into strings and length 
calculations that are deeply embedded into these legacy applications, 
trying to support these with a different encoding model (not just with a 
different encoding) would have been a non-starter.

As we've seen since, the final key in that puzzle was IETF creating an 
ASCII compatible, variable length encoding form that violated one of 
Unicode's early design goals (to have a fixed number of code units per 
character). However, allowing direct parsing of data streams for 
ASCII-based syntax characters was more of a compatibility requirement 
than had appeared at first.

The reason, this was not built directly into the earliest Unicode 
versions was that it is something that (transport) protocol designers 
are up against more than people worried about representing text in 
documents.

Looking at Unicode from the perspective "what, if I could design 
something from scratch?" can be intellectually interesting but is of 
little practical value. Any design that would have prevented people from 
different legacy environments from coalescing around would simply have 
died out.

If it amuses you, you could think of some features of Unicode as being 
akin to the "vestiginal" organs that evolution sometimes leaves behind. 
They may not strictly be required the way the organism functions today, 
but without their use in the historical transition, the current form of 
the organism would not exist, because the species would be extinct.

A./

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20251214/3020ff0b/attachment.htm>

From duerst at it.aoyama.ac.jp  Sun Dec 14 18:31:33 2025
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2E_D=C3=BCrst?=)
Date: Mon, 15 Dec 2025 09:31:33 +0900
Subject: Combining characters
In-Reply-To: <T88A7T.B50GJP1YRE0D@gmail.com>
References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com>
 <SA5PR03MB8401EEDC2FC079938DC07AB4CAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
 <011001dc6d20$56261670$02724350$@akphs.com>
 <SA5PR03MB8401EB99841F2A39DDA2593DCAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
 <fa05f4e5-fb50-4ed6-ba0f-ebd676454c7c@ix.netcom.com>
 <T88A7T.B50GJP1YRE0D@gmail.com>
Message-ID: <c681c766-ec6b-468d-a3c0-29b659019e0f@it.aoyama.ac.jp>


On 2025-12-15 08:07, Alex Shpilkin via Unicode wrote:
> 
> On Sun, Dec 14 2025 at 14:02:41 -08:00:00, Asmus Freytag via Unicode 
> <unicode at corp.unicode.org> wrote:
>> To make matters more complex, some combining marks are defined to not 
>> reorder. Those can be in any order defined by the author and could 
>> lead to duplicate encoding for the same display. The reasons behind 
>> supporting that are a bit complex, but generally it's done for scripts 
>> other than Latin.
> 
> Amusingly, study of literal Latin, the language, uses two combining 
> marks of the same CCC together as a matter of course: dictionaries mark 
> a vowel with (what in NFD would be) the sequence COMBINING MACRON, 
> COMBINING BREVE to tell the reader that a syllable?s length either 
> varies or cannot be determined.

These two characters are indeed not reordered, but that's not a problem, 
because they are stacked. The sequence COMBINING MACRON, COMBINING BREVE 
will have the macron between the character and the breve, whereas the 
sequence COMBINING BREVE, COMBINING MACRON will have the macron above 
the breve. Not an expert, but my assumption is that only the first one 
is customary for Latin.

Regards,   Martin.

From doug at ewellic.org  Sun Dec 14 23:03:14 2025
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 15 Dec 2025 05:03:14 +0000
Subject: Combining characters
In-Reply-To: <fa05f4e5-fb50-4ed6-ba0f-ebd676454c7c@ix.netcom.com>
References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com>
 <SA5PR03MB8401EEDC2FC079938DC07AB4CAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
 <011001dc6d20$56261670$02724350$@akphs.com>
 <SA5PR03MB8401EB99841F2A39DDA2593DCAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
 <fa05f4e5-fb50-4ed6-ba0f-ebd676454c7c@ix.netcom.com>
Message-ID: <SA5PR03MB84013BD7EC92F3218C6744EBCAADA@SA5PR03MB8401.namprd03.prod.outlook.com>

Asmus Freytag wrote:

> It would be correct for a sequence of a base character with _single_
> combining mark, but as soon as you have two or more combining marks,
> their order is defined by NFC.

I had mistakenly assumed that Phil?s use case considered only sequences with a single combining mark, and consciously chose to limit my response to that scenario.

--
Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org


From doug at ewellic.org  Sun Dec 14 23:26:24 2025
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 15 Dec 2025 05:26:24 +0000
Subject: Combining Characters
In-Reply-To: <CAN49p6pMmG=nZ9bNY5HcDG6ETG1vi5U4_DQ9bNmEvDfWLuqzNQ@mail.gmail.com>
References: <CAMDtpNQ1UCrfMPVf+n+zo5+mk-_RshLg7yfHqKFXkeQY3fwzfQ@mail.gmail.com>
 <CAN49p6pMmG=nZ9bNY5HcDG6ETG1vi5U4_DQ9bNmEvDfWLuqzNQ@mail.gmail.com>
Message-ID: <SA5PR03MB84016E281E787B270F2910CECAADA@SA5PR03MB8401.namprd03.prod.outlook.com>

Markus Scherer wrote:

> However, when the Jovians arrive with their billion-character
> encoding, then Unicode will become a legacy encoding.

Yeah, I?m really not a fan of this whole ?Jovian? and ?Venusian? and ?Galactic Federation? line of argument.

As several (including Markus) have observed here, the use of combining characters in non-Latin scripts and in transcription can differ markedly from their use in Latin-script, natural-language scenarios.

The original questions strongly implied that only Latin-script, natural-language scenarios were relevant, to the extent that anything else must be from outer space. That?s unfair to languages written in scripts other than Latin, as well as to the efforts made in Unicode for 35 years to provide good support for these scripts and for contexts other than natural language.

Apologies if humor was implied throughout and I didn?t get it.

--
Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org


From asmusf at ix.netcom.com  Sun Dec 14 23:42:13 2025
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sun, 14 Dec 2025 21:42:13 -0800
Subject: Combining characters
In-Reply-To: <SA5PR03MB84013BD7EC92F3218C6744EBCAADA@SA5PR03MB8401.namprd03.prod.outlook.com>
References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com>
 <SA5PR03MB8401EEDC2FC079938DC07AB4CAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
 <011001dc6d20$56261670$02724350$@akphs.com>
 <SA5PR03MB8401EB99841F2A39DDA2593DCAACA@SA5PR03MB8401.namprd03.prod.outlook.com>
 <fa05f4e5-fb50-4ed6-ba0f-ebd676454c7c@ix.netcom.com>
 <SA5PR03MB84013BD7EC92F3218C6744EBCAADA@SA5PR03MB8401.namprd03.prod.outlook.com>
Message-ID: <eaf726ed-6770-4476-a23f-10d1e4c7bb69@ix.netcom.com>

On 12/14/2025 9:03 PM, Doug Ewell wrote:
> Asmus Freytag wrote:
>
>> It would be correct for a sequence of a base character with _single_
>> combining mark, but as soon as you have two or more combining marks,
>> their order is defined by NFC.
> I had mistakenly assumed that Phil?s use case considered only sequences with a single combining mark, and consciously chose to limit my response to that scenario.
>
I know that you were aware of the general case. What I was trying to 
communicate (and expounded upon in the other reply) is the degree to 
which human writing in the general case is highly complex, usually even 
more complex than most native speakers (other than typesetters) are ever 
aware of, even for their own language.

And it is acknowledging this complexity ? and how it is necessarily 
reflected in anything that aims to be a universal system of character 
encoding ? that drives the understanding that such a system must be full 
of complexities of its own that cannot even be reconciled down to a 
minimally simplistic system.

For those of us, unlike the questioner, who have been around this effort 
for any length of time, this complexity can seem to be a given. But many 
people who have not worked in this space are genuinely surprised and 
challenged by it. And that includes people who have impressive 
credentials in technical work. Without realizing it, they apply their 
own native understanding of writing systems as if that was exhaustive or 
even typical. When they try to come up with solutions, such as 
protocols, that need to be robust in the face of the full variety of 
global text (even only the living subset) they may reach conclusions 
that fatefully fall well short of what is needed, or they try to 
"simplify" away complexities that to them feel ill motivated.

Commonly, I also observe that solutions are proposed that "micro-manage" 
some well-understood or familiar subset of characters, but leave a 
protocol without meaningful solutions or safeguards to the vast majority 
which contains all the other scripts and writing systems.

There's no quick fix, but it is my firm conviction that we always need 
to start from a point of correctly scoping the issues as those belonging 
to a "universal" system of character encoding, as opposed to one that is 
optimized for some subset.

A./


From unicode.org at sl.neatnit.net  Thu Dec 18 10:08:57 2025
From: unicode.org at sl.neatnit.net (Nitai Sasson)
Date: Thu, 18 Dec 2025 16:08:57 +0000
Subject: RFC: controlling bidirectional mirroring of characters
Message-ID: <176607414306.7.7927205120432481899.1073729710@sl.neatnit.net>

Hello all!

I've been sitting on this for a while, kind of afraid to finish it up and send it. I've finally decided to just do so, even though it's not perfect.

Following the email discussion from [April 2025](https://corp.unicode.org/pipermail/unicode/2025-April/thread.html), I want to propose a combining formatting character to affect the mirroring behavior of arrow characters (and potentially other characters) in bidirectional text. The initial idea for this was originally brought up by Mark E. Shoulson while brainstorming in his first reply.

This is a request for comment and a draft for that proposal.

Please see it at:

https://codeberg.org/NeatNit/unicode-bidi-arrows-proposal/src/branch/main/email.md

Thank you,
Nitai Sasson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20251218/9324bd9b/attachment.htm>

From moody at posixcafe.org  Fri Dec 19 10:32:57 2025
From: moody at posixcafe.org (Jacob Moody)
Date: Fri, 19 Dec 2025 10:32:57 -0600
Subject: Combining Characters
In-Reply-To: <392cff9f-9c2f-4221-b46f-8a49e878ae12@it.aoyama.ac.jp>
References: <CAMDtpNQ1UCrfMPVf+n+zo5+mk-_RshLg7yfHqKFXkeQY3fwzfQ@mail.gmail.com>
 <392cff9f-9c2f-4221-b46f-8a49e878ae12@it.aoyama.ac.jp>
Message-ID: <d6f7cd64-b5b0-46a9-a404-d53b13560193@posixcafe.org>

On 12/14/25 17:54, Martin J. D?rst via Unicode wrote:
> 
>> The only case I can see where things could get weird would be if there
>> suddenly became some weird case where, e.g., the Jovians insisted that the
>> combining backslash must appear before the letter and not after it (and
>> it?s been a few years since I had to really look at the rules and this
>> might be possible with the existing combining character classes anyway).
> 
> Because of the way we have optimized normalization in Ruby (caching 
> normalization results for runs of a base character followed by 
> modifiers), that wasn't exactly true when we upgraded to Unicode 16.0.0.
> See the "Normalization Behavior" entry at 
> https://www.unicode.org/versions/Unicode16.0.0/#Migration.

I also ran in to some issues with exactly this with my implementation for 9front[0].
Took me a bit to figure out what was going on, unfortunately I had first written my
implementation for v15 so at the time I wasn't sure if I had somehow overfit my
code to 15 or something had changed.
> 
> New scripts introduced in 16.0.0 (Kirat Rai, Tulu-Tigalari, and Gurung 
> Khema) contained combining marks that had combining class 0 and were 
> also base characters combining with other combining marks (or even with 
> themselves). That was something we hadn't taken account of in our 
> implementation previously (because it was not needed).
> 

I do wish the documents on migration[1] had explicitly explained that these
new characters have ccc=0 conjoiners, it may imply it when discussing them,
and maybe I'm still a bit green on the details to put 2 and 2 together
but it would have saved me some time.

On the topic I did find the suggested resolution of using the quickcheck value a bit strange,
as far as I know use of quickcheck was not strictly required for normalziation prior
to this update. Or well, my v15 implementation did not use it and passed all the normalization
tests. I guess as an upside I found that with these changes and the inclusion of quickcheck hangul
no longer needed to be special cased.

Thanks,
Jacob Moody

[0] https://github.com/9front/9front/blob/front/sys/src/libc/ucd/runenorm.c
[1] https://www.unicode.org/reports/tr15/tr15-56.html#Contexts_Care

From ashpilkin at gmail.com  Fri Dec 19 15:02:55 2025
From: ashpilkin at gmail.com (Alex Shpilkin)
Date: Fri, 19 Dec 2025 23:02:55 +0200
Subject: Combining Characters
In-Reply-To: <d6f7cd64-b5b0-46a9-a404-d53b13560193@posixcafe.org>
References: <CAMDtpNQ1UCrfMPVf+n+zo5+mk-_RshLg7yfHqKFXkeQY3fwzfQ@mail.gmail.com>
 <392cff9f-9c2f-4221-b46f-8a49e878ae12@it.aoyama.ac.jp>
 <d6f7cd64-b5b0-46a9-a404-d53b13560193@posixcafe.org>
Message-ID: <VSBJ7T.24WUZDWVDAVV1@gmail.com>


On Fri, Dec 19 2025 at 10:32:57 -06:00:00, Jacob Moody via Unicode 
<unicode at corp.unicode.org> wrote:
> I do wish the documents on migration[1] had explicitly explained that 
> these
> new characters have ccc=0 conjoiners, it may imply it when discussing 
> them,
> and maybe I'm still a bit green on the details to put 2 and 2 together
> but it would have saved me some time.

No objection here despite the foregoing.

> On the topic I did find the suggested resolution of using the 
> quickcheck value a bit strange, as far as I know use of quickcheck 
> was not strictly required for normalziation prior to this update. Or 
> well, my v15 implementation did not use it and passed all the 
> normalization
> tests.

I haven?t gotten to implementing canonical composition yet, nor have 
I looked at any other implementation including yours, but AFAICT the QC 
properties aren?t required now either: looking at the 3.11 
Normalization Forms in Unicode 13, predating this change, the 
recomposition algorithm that suggests itself is:

starter = 0  # sentinel not part of any compositions
starter index = uninitialized

index = 0
while index < length of string:
    composition = try to compose (starter, string[index])
    if succeeded:
        assert ccc[composition] = 0
        string[starter index] = composition
        delete string[index]
    else:
        if ccc[string[index]] = 0:  # NB only this late
            starter = string[index]
            starter index = index
        index = index + 1

If you check conditions in this order, then the handling of 
starter+starter compositions falls out naturally. (Also note that the 
composition table only needs to contain pairs of an NFC-form starter 
and an NFD character, and there are possible optimizations connected to 
the fact that, if the next character after a successful composition is 
a nonstarter too, then the first character in the next lookup will be 
the result of this one.)

Trying to merge de- and recomposition into a single streaming process 
(e.g. with limits on the length of a composing character sequence to 
avoid worst-case linear memory consumption) will of course make things 
much more difficult.

> I guess as an upside I found that with these changes and the 
> inclusion of quickcheck hangul no longer needed to be special cased.

I don?t believe you ever actually *have* to special-case Hangul after 
you?ve generated your tables, it?s just that if you are trying to 
keep your table size down (as I am) then doing so will give you 
something like 2x savings.

-- 
HTH,
Alex


From ashpilkin at gmail.com  Fri Dec 19 15:17:25 2025
From: ashpilkin at gmail.com (Alex Shpilkin)
Date: Fri, 19 Dec 2025 23:17:25 +0200
Subject: Combining Characters
In-Reply-To: <VSBJ7T.24WUZDWVDAVV1@gmail.com>
References: <CAMDtpNQ1UCrfMPVf+n+zo5+mk-_RshLg7yfHqKFXkeQY3fwzfQ@mail.gmail.com>
 <392cff9f-9c2f-4221-b46f-8a49e878ae12@it.aoyama.ac.jp>
 <d6f7cd64-b5b0-46a9-a404-d53b13560193@posixcafe.org>
 <VSBJ7T.24WUZDWVDAVV1@gmail.com>
Message-ID: <1HCJ7T.7BQ2ABPDSB4F3@gmail.com>


On Fri, Dec 19 2025 at 23:02:55 +02:00:00, Alex Shpilkin 
<ashpilkin at gmail.com> wrote:
> I haven?t gotten to implementing canonical composition yet

And you can tell because the algorithm I?ve posted is wrong. 
Attempted correction (which does introduce a bit of special handling to 
account for the starter+starter case):

starter = 0  # sentinel not part of any compositions
starter index = uninitialized

index = 0
while index < length of string:
   composition = try to compose (starter, string[index])
   if succeeded and (ccc[string[index]] != 0 or index == starter index 
+ 1):
       string[starter index] = composition
       delete string[index]
   else:
       if ccc[string[index]] == 0:  # NB only this late
           starter = string[index]
           starter index = index
       index = index + 1

-- 
Sorry for the noise,
Alex