From piotrunio-2004 at wp.pl Thu Dec 4 06:35:47 2025 From: piotrunio-2004 at wp.pl (=?UTF-8?Q?piotrunio-2004=40wp=2Epl?=) Date: Thu, 04 Dec 2025 13:35:47 +0100 Subject: =?UTF-8?Q?Re=3A_Odp=3A_RE=3A_What_to_do_if_a_legacy_compatibility_character_is_defective=3F?= In-Reply-To: <<59598cd6e838407f88138b93a707aca2@grupawp.pl>> References: <03c761daf930423295d0e5b5f8de424c@grupawp.pl> <105968cf-57f1-418d-8782-c84132d676e8@ix.netcom.com> <14bd764ef49d4e64b11144eeb0ce5a92@grupawp.pl> <59598cd6e838407f88138b93a707aca2@grupawp.pl> Message-ID: I have investigated the situation further and it seems that defect in the Unicode 13.0?17.0 mapping is even more fundamental than I previously thought. In particular, the proposal L2/25-037 does not acknowledge the proposal L2/00-159, which had already been incorporated into Unicode 3.2.?In that proposal, the description of characters U+23B8 (LEFT VERTICAL BOX LINE) and U+23B9 (RIGHT VERTICAL BOX LINE) exactly matches the proposed characters L2/25-037:1FBFC (BOX DRAWINGS LIGHT LEFT EDGE) and L2/25-037:1FBFD (BOX DRAWINGS LIGHT RIGHT EDGE). In both proposals, those two characters are specified to be aligned to left or right edge, span the entire edge (extending to the top and bottom), and match the thickness of Box Drawings Light lines. The description of the characters U+23BA (HORIZONTAL SCAN LINE-1) and U+23BD (HORIZONTAL SCAN LINE-9) also exactly matches the proposed characters L2/25-037:1FBFA (BOX DRAWINGS LIGHT TOP EDGE) and L2/25-037:1FBFB (BOX DRAWINGS LIGHT BOTTOM EDGE). In both proposals, those two characters are specified to be aligned to top and bottom edges, span the entire edge (extending to the left and right), and match the thickness of Box Drawings Light lines. However, the proposal L2/00-159 had already set precedent for usage of [U+23BA, U+23BD, U+23B8, U+23B9] (and not the 1?8 blocks or 1?4 blocks) in mapping to certain platforms such as?The Heath/Zenith 19 Graphics Character Set and?The DEC Special Graphics Character Set. This contrasts with the usage of 1?8 blocks [U+2594, U+2581, U+258F, U+2595] and other related 1?8 or 7?8 block characters in the mapping to PETSCII and Apple II.? Therefore there is a discrepancy between the legacy platforms added in Unicode 3.2 (which use the box drawing lines 23B8, 23B9, 23BA, 23BD) and the legacy platforms added in Unicode 13.0?17.0 (which use 1?8 blocks 2594, 2581, 258F, 2595). Dnia 25 pa?dziernika 2025 10:27 piotrunio-2004 at wp.pl via Unicode <unicode at corp.unicode.org> napisa?(a): Dnia 25 pa?dziernika 2025 08:29 Asmus Freytag via Unicode <unicode at corp.unicode.org> napisa?(a): Again, the identity of the Unicode character is giving by encoding the intended mappings. If Unicode decides to map the same character to similar characters on different platforms, that is not a problem, as long as implementers know that the intent is to use a platform-specific rendering (and not assume that there is only one possible rendering per character). If you feel that the guidance available to implementers in the text of the standard or in an annotation of the nameslist is not sufficent, then the remedy would be to ask for the explanation to be updated. We are unfortunately locked in as far as character names are concerned, but we can add a note (best in the text of the standard) that explains that emulators for some systems will need an adjusted design so a sequence or other arrangement of these characters looks correct. Indeed the character names cannot be changed due to stability policies. An explanation note has been provided for U+1FB81 that claims "The lines corresponding to 3 and 5 are not actually block elements, but can show any horizontally repeating pattern", but still implicitly enforces 1?8 blocks for top and bottom. However, this doesn't address other cases such as the PETSCII C64 variation. And if?1FB70?1FB81 1FBB5?1FBB8 1FBBC were all noted to no longer require exact 1?8 blocks, that would also not remedy the issue because it would introduce an inconsistency with the existing 1?8 or 7?8 block characters 2581 2589 258F 2594?2595, which already have established compatibility precedents that require the exact fraction, but are also used in the Unicode 13.0 mapping to PETSCII and Apple II character sets despite those platforms using varying thickness (consistent with light box drawings, except for the 1?8 top and bottom blocks in C64, where the 1?4 top and bottom blocks are made consistent instead). -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Thu Dec 4 14:15:27 2025 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 4 Dec 2025 12:15:27 -0800 Subject: Odp: RE: What to do if a legacy compatibility character is defective? In-Reply-To: References: <03c761daf930423295d0e5b5f8de424c@grupawp.pl> <105968cf-57f1-418d-8782-c84132d676e8@ix.netcom.com> <14bd764ef49d4e64b11144eeb0ce5a92@grupawp.pl> <59598cd6e838407f88138b93a707aca2@grupawp.pl> Message-ID: <16921e34-e06e-427b-ad69-c0a6bcbb6c69@ix.netcom.com> On 12/4/2025 4:35 AM, piotrunio-2004 at wp.pl via Unicode wrote: > I have investigated the situation further and it seems that defect in > the Unicode 13.0?17.0 mapping is even more fundamental than I > previously thought. In particular, the proposal L2/25-037 does not > acknowledge the proposal L2/00-159, which had already been > incorporated into Unicode 3.2.?In that proposal, the description of > characters U+23B8 (LEFT VERTICAL BOX LINE) and U+23B9 (RIGHT VERTICAL > BOX LINE) exactly matches the proposed characters L2/25-037:1FBFC (BOX > DRAWINGS LIGHT LEFT EDGE) and L2/25-037:1FBFD (BOX DRAWINGS LIGHT > RIGHT EDGE). In both proposals, those two characters are specified to > be aligned to left or right edge, span the entire edge (extending to > the top and bottom), and match the thickness of Box Drawings Light > lines. The description of the characters U+23BA (HORIZONTAL SCAN > LINE-1) and U+23BD (HORIZONTAL SCAN LINE-9) also exactly matches the > proposed characters L2/25-037:1FBFA (BOX DRAWINGS LIGHT TOP EDGE) and > L2/25-037:1FBFB (BOX DRAWINGS LIGHT BOTTOM EDGE). In both proposals, > those two characters are specified to be aligned to top and bottom > edges, span the entire edge (extending to the left and right), and > match the thickness of Box Drawings Light lines. However, the proposal > L2/00-159 had already set precedent for usage of [U+23BA, U+23BD, > U+23B8, U+23B9] (and not the 1?8 blocks or 1?4 blocks) in mapping to > certain platforms such as?The Heath/Zenith 19 Graphics Character Set > and?The DEC Special Graphics Character Set. This contrasts with the > usage of 1?8 blocks [U+2594, U+2581, U+258F, U+2595] and other related > 1?8 or 7?8 block characters in the mapping to PETSCII and Apple II. > Therefore there is a discrepancy between the legacy platforms added in > Unicode 3.2 (which use the box drawing lines 23B8, 23B9, 23BA, 23BD) > and the legacy platforms added in Unicode 13.0?17.0 (which use 1?8 > blocks 2594, 2581, 258F, 2595). > > Dnia 25 pa?dziernika 2025 10:27 piotrunio-2004 at wp.pl via Unicode > napisa?(a): > > > Dnia 25 pa?dziernika 2025 08:29 Asmus Freytag via Unicode > napisa?(a): > > Again, the identity of the Unicode character is giving by > encoding the intended mappings. If Unicode decides to map the > same character to similar characters on different platforms, > that is not a problem, as long as implementers know that the > intent is to use a platform-specific rendering (and not assume > that there is only one possible rendering per character). > > If you feel that the guidance available to implementers in the > text of the standard or in an annotation of the nameslist is > not sufficent, then the remedy would be to ask for the > explanation to be updated. We are unfortunately locked in as > far as character names are concerned, but we can add a note > (best in the text of the standard) that explains that > emulators for some systems will need an adjusted design so a > sequence or other arrangement of these characters looks correct. > > Indeed the character names cannot be changed due to stability > policies. An explanation note has been provided for U+1FB81 that > claims "The lines corresponding to 3 and 5 are not actually block > elements, but can show any horizontally repeating pattern", but > still implicitly enforces 1?8 blocks for top and bottom. However, > this doesn't address other cases such as the PETSCII C64 > variation. And if?1FB70?1FB81 1FBB5?1FBB8 1FBBC were all noted to > no longer require exact 1?8 blocks, that would also not remedy the > issue because it would introduce an inconsistency with the > existing 1?8 or 7?8 block characters 2581 2589 258F 2594?2595, > which already have established compatibility precedents that > require the exact fraction, but are also used in the Unicode 13.0 > mapping to PETSCII and Apple II character sets despite those > platforms using varying thickness (consistent with light box > drawings, except for the 1?8 top and bottom blocks in C64, where > the 1?4 top and bottom blocks are made consistent instead). > > > > What is missing is an actual proposal. That is, not just analysis or exposition, but actual proposed wording or proposed encoding that would fix the issue. That would need to be provided as a UTC document (aka L2 document) submission, with the analysis appended in a background section. A./ PS: I am not convinced that platform-specific mappings (glyphs) are an issue, because the scenario where these data are reliably transferred *between* legacy implementations can't have existed then, so it's questionably why it needs to be perfect today. My assumption would be that the use case is lossless round trip from (each) legacy emulator to Unicode and back. Having PETSII / Apple II specific characters does not improve things, because any data stream containing those could not be displayed on any other emulator. This is different from legacy characters mapped to letters and common text symbols because we have an expectation that we can share text across devices (or emulators). -------------- next part -------------- An HTML attachment was scrubbed... URL: From piotrunio-2004 at wp.pl Thu Dec 4 16:37:29 2025 From: piotrunio-2004 at wp.pl (=?UTF-8?Q?piotrunio-2004=40wp=2Epl?=) Date: Thu, 04 Dec 2025 23:37:29 +0100 Subject: =?UTF-8?Q?Re=3A_Odp=3A_RE=3A_What_to_do_if_a_legacy_compatibility_character_is_defective=3F?= In-Reply-To: <<16921e34-e06e-427b-ad69-c0a6bcbb6c69@ix.netcom.com>> References: <03c761daf930423295d0e5b5f8de424c@grupawp.pl> <105968cf-57f1-418d-8782-c84132d676e8@ix.netcom.com> <14bd764ef49d4e64b11144eeb0ce5a92@grupawp.pl> <59598cd6e838407f88138b93a707aca2@grupawp.pl> <16921e34-e06e-427b-ad69-c0a6bcbb6c69@ix.netcom.com> Message-ID: Dnia 04 grudnia 2025 21:28 Asmus Freytag via Unicode <unicode at corp.unicode.org> napisa?(a): On 12/4/2025 4:35 AM, piotrunio-2004 at wp.pl via Unicode wrote: I have investigated the situation further and it seems that defect in the Unicode 13.0?17.0 mapping is even more fundamental than I previously thought. In particular, the proposal L2/25-037 does not acknowledge the proposal L2/00-159, which had already been incorporated into Unicode 3.2.?In that proposal, the description of characters U+23B8 (LEFT VERTICAL BOX LINE) and U+23B9 (RIGHT VERTICAL BOX LINE) exactly matches the proposed characters L2/25-037:1FBFC (BOX DRAWINGS LIGHT LEFT EDGE) and L2/25-037:1FBFD (BOX DRAWINGS LIGHT RIGHT EDGE). In both proposals, those two characters are specified to be aligned to left or right edge, span the entire edge (extending to the top and bottom), and match the thickness of Box Drawings Light lines. The description of the characters U+23BA (HORIZONTAL SCAN LINE-1) and U+23BD (HORIZONTAL SCAN LINE-9) also exactly matches the proposed characters L2/25-037:1FBFA (BOX DRAWINGS LIGHT TOP EDGE) and L2/25-037:1FBFB (BOX DRAWINGS LIGHT BOTTOM EDGE). In both proposals, those two characters are specified to be aligned to top and bottom edges, span the entire edge (extending to the left and right), and match the thickness of Box Drawings Light lines. However, the proposal L2/00-159 had already set precedent for usage of [U+23BA, U+23BD, U+23B8, U+23B9] (and not the 1?8 blocks or 1?4 blocks) in mapping to certain platforms such as?The Heath/Zenith 19 Graphics Character Set and?The DEC Special Graphics Character Set. This contrasts with the usage of 1?8 blocks [U+2594, U+2581, U+258F, U+2595] and other related 1?8 or 7?8 block characters in the mapping to PETSCII and Apple II.? Therefore there is a discrepancy between the legacy platforms added in Unicode 3.2 (which use the box drawing lines 23B8, 23B9, 23BA, 23BD) and the legacy platforms added in Unicode 13.0?17.0 (which use 1?8 blocks 2594, 2581, 258F, 2595). Dnia 25 pa?dziernika 2025 10:27 piotrunio-2004 at wp.pl via Unicode <unicode at corp.unicode.org> napisa?(a): Dnia 25 pa?dziernika 2025 08:29 Asmus Freytag via Unicode <unicode at corp.unicode.org> napisa?(a): Again, the identity of the Unicode character is giving by encoding the intended mappings. If Unicode decides to map the same character to similar characters on different platforms, that is not a problem, as long as implementers know that the intent is to use a platform-specific rendering (and not assume that there is only one possible rendering per character). If you feel that the guidance available to implementers in the text of the standard or in an annotation of the nameslist is not sufficent, then the remedy would be to ask for the explanation to be updated. We are unfortunately locked in as far as character names are concerned, but we can add a note (best in the text of the standard) that explains that emulators for some systems will need an adjusted design so a sequence or other arrangement of these characters looks correct. Indeed the character names cannot be changed due to stability policies. An explanation note has been provided for U+1FB81 that claims "The lines corresponding to 3 and 5 are not actually block elements, but can show any horizontally repeating pattern", but still implicitly enforces 1?8 blocks for top and bottom. However, this doesn't address other cases such as the PETSCII C64 variation. And if?1FB70?1FB81 1FBB5?1FBB8 1FBBC were all noted to no longer require exact 1?8 blocks, that would also not remedy the issue because it would introduce an inconsistency with the existing 1?8 or 7?8 block characters 2581 2589 258F 2594?2595, which already have established compatibility precedents that require the exact fraction, but are also used in the Unicode 13.0 mapping to PETSCII and Apple II character sets despite those platforms using varying thickness (consistent with light box drawings, except for the 1?8 top and bottom blocks in C64, where the 1?4 top and bottom blocks are made consistent instead). What is missing is an actual proposal. That is, not just analysis or exposition, but actual proposed wording or proposed encoding that would fix the issue. That would need to be provided as a UTC document (aka L2 document) submission, with the analysis appended in a background section.? A./ PS: I am not convinced that platform-specific mappings (glyphs) are an issue, because the scenario where these data are reliably transferred *between* legacy implementations can't have existed then, so it's questionably why it needs to be perfect today. My assumption would be that the use case is lossless round trip from (each) legacy emulator to Unicode and back. Having PETSII / Apple II specific characters does not improve things, because any data stream containing those could not be displayed on any other emulator. This is different from legacy characters mapped to letters and common text symbols because we have an expectation that we can share text across devices (or emulators). I have a draft of a follow up of L2/25-037?that analyzes the character sets thoroughly with the additional context provided by?L2/00-159 characters (including the particularly complex relationship between box drawings, 1?8 blocks, and 1?4 blocks in PETSCII), provides additional explanation and screenshot of evidence of HP 264x character in both isolated and in connected usage, and arrives at the conclusion that 23 characters (that is, all in?L2/25-037 except for the 4 that were already added by?L2/00-159) should be added. However, the SEW announced that they will not be discussing these characters any further, so how could any follow up of the proposal possibly get incorporated into Unicode? -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Thu Dec 4 18:11:46 2025 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 4 Dec 2025 16:11:46 -0800 Subject: Odp: RE: What to do if a legacy compatibility character is defective? In-Reply-To: References: <03c761daf930423295d0e5b5f8de424c@grupawp.pl> <105968cf-57f1-418d-8782-c84132d676e8@ix.netcom.com> <14bd764ef49d4e64b11144eeb0ce5a92@grupawp.pl> <59598cd6e838407f88138b93a707aca2@grupawp.pl> <16921e34-e06e-427b-ad69-c0a6bcbb6c69@ix.netcom.com> Message-ID: <46cf18b2-21ae-4aaf-9c44-9b641e0d3a27@ix.netcom.com> On 12/4/2025 2:37 PM, piotrunio-2004 at wp.pl via Unicode wrote: > However, the SEW announced that they will not be discussing these > characters any further, so how could any follow up of the proposal > possibly get incorporated into Unicode? Nothing can force the SEW to accept any particular proposal. However, unless there's a document actually submitted, there's nothing that will happen at all, no matter what. If it were up to me, I would focus on suggesting specific language for the standard or the nameslist rather than proposing new characters. Such feedback may be reviewed by other working groups, not solely SEW. A./ From piotrunio-2004 at wp.pl Fri Dec 5 00:55:17 2025 From: piotrunio-2004 at wp.pl (=?UTF-8?Q?piotrunio-2004=40wp=2Epl?=) Date: Fri, 05 Dec 2025 07:55:17 +0100 Subject: =?UTF-8?Q?Re=3A_Odp=3A_RE=3A_What_to_do_if_a_legacy_compatibility_character_is_defective=3F?= In-Reply-To: <<46cf18b2-21ae-4aaf-9c44-9b641e0d3a27@ix.netcom.com>> References: <03c761daf930423295d0e5b5f8de424c@grupawp.pl> <105968cf-57f1-418d-8782-c84132d676e8@ix.netcom.com> <14bd764ef49d4e64b11144eeb0ce5a92@grupawp.pl> <59598cd6e838407f88138b93a707aca2@grupawp.pl> <16921e34-e06e-427b-ad69-c0a6bcbb6c69@ix.netcom.com> <46cf18b2-21ae-4aaf-9c44-9b641e0d3a27@ix.netcom.com> Message-ID: Dnia 05 grudnia 2025 01:23 Asmus Freytag via Unicode <unicode at corp.unicode.org> napisa?(a): On 12/4/2025 2:37 PM, piotrunio-2004 at wp.pl via Unicode wrote: However, the SEW announced that they will not be discussing these characters any further, so how could any follow up of the proposal possibly get incorporated into Unicode? Nothing can force the SEW to accept any particular proposal. However, unless there's a document actually submitted, there's nothing that will happen at all, no matter what. I have already submitted the draft of the follow up. However, I don't intend to force the characters to be accepted, but instead I requested for the SEW to be made aware of the information in the follow up, so that I can continue receiving more detailed feedback that I can then use to further clarify the issue or explore potential alternative solutions. If it were up to me, I would focus on suggesting specific language for the standard or the nameslist rather than proposing new characters. Such feedback may be reviewed by other working groups, not solely SEW. A./ The issue for the 1?8 blocks (in Apple II, PETSCII, etc.) is that their encoding policy is not consistent with the previously encoded box drawing lines (in?Heath/Zenith 19,?DEC Special Graphics, etc.), therefore making it inappropriate to use 1?8 blocks in the encoding of those characters for those platforms. The issue is even worse for the C64/C128 PETSCII mapping, which is not even consistent within the same platform where the characters which L2/19-025 mapped to 1?8 blocks 1FB7C?1FBFF (????) are aligned with top and bottom 1?4 blocks (??) but are misaligned with top and bottom 1?8 blocks (??), which makes it a misuse of the 1?8 block character identity (in the context of that platform, top and bottom 1?4 blocks have same thickness as box drawings, which better matches the usage of 23BA 23BD ?? instead). Whereas the issue for the HP 264x character is that the character can be used independently from the other character that Unicode unified it with and that it forms a distinct connection type. As I can tell, those issues are baked into the existing Unicode 13.0?17.0 mappings of those platforms, so I don't see how 'specific language for the standard or the nameslist' could possibly address those issues. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at akphs.com Sun Dec 14 10:33:03 2025 From: lists at akphs.com (Phil Smith III) Date: Sun, 14 Dec 2025 11:33:03 -0500 Subject: Combining characters Message-ID: <00f301dc6d17$51fb2080$f5f16180$@akphs.com> This may be dumb/hopelessly na?ve but here goes! My observations/inferences/suppositions re combining characters: * They were originally implemented as a way to reduce the total number of characters * We?re well past any likely original vision of the number of characters/scripts anyway, so that ?savings? is kinda meaningless * Combiners are a pain overall (normalization!) * Barring Earth joining the Galactic Federation and Unicode deciding to include all twelve billion alien languages, the current scheme will suffice forever (yeah, yeah, I know, ?never say forever?, but this feels like IPv6 addresses, ?there are just SO many ?) Ergo, I posit that there should never be a need for any NEW combiners. Is there any sort of official or unofficial policy to that end? Inquiring minds and all that Thanks, ...phsiii -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sun Dec 14 11:16:19 2025 From: doug at ewellic.org (Doug Ewell) Date: Sun, 14 Dec 2025 17:16:19 +0000 Subject: Combining characters In-Reply-To: <00f301dc6d17$51fb2080$f5f16180$@akphs.com> References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com> Message-ID: Phil Smith III wrote: > My observations/inferences/suppositions re combining characters: > - They were originally implemented as a way to reduce the total number > of characters Another, possibly more farsighted reason is that, if a newly needed letter-with-diacritic can be represented today with an existing letter and an existing diacritic, instead of waiting possibly years for the precomposed combination to be encoded, that time saving is a big win for the user community. > - Combiners are a pain overall (normalization!) > [...] > Ergo, I posit that there should never be a need for any NEW combiners. Combining-character mechanisms are already implemented at many levels (processing, counting, fonts, rendering engines, etc.). More combining characters that work essentially the same as existing ones don?t really add to the pain. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From lists at akphs.com Sun Dec 14 11:37:35 2025 From: lists at akphs.com (Phil Smith III) Date: Sun, 14 Dec 2025 12:37:35 -0500 Subject: Combining characters In-Reply-To: References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com> Message-ID: <011001dc6d20$56261670$02724350$@akphs.com> Doug Ewell wrote: >Another, possibly more farsighted reason is that, if a newly needed >letter-with-diacritic can be represented today with an existing letter >and an existing diacritic, instead of waiting possibly years for the >precomposed combination to be encoded, that time saving is a big win >for the user community. "newly needed letter-with-diacritic" -- does that happen? Venusian gets added and the ONLY issue is that it needs J+Combining Grave? I see the point but am not sure it's realistic, and in any case isn't what I'm talking about: I'm asking about NEW combiners. Though "invalid" combinations can be an issue now, with different engines rendering them differently. At least if code comes across J+Combining Grave now, the combining-ness is known. When a Combining Backslash is added for Jovian, well, now that character is new and normalization adventures abound. >More combining characters that work essentially the same as existing >ones don?t really add to the pain. Actually they add a LOT of pain/complexity for certain use cases, because of normalization. Thanks; I don't mean to sound like "Go away", this is exactly the kind of discussion I was hoping for! The fact that there haven't been any new combiners in several versions (I think?) is what made me think that there might be some level of "No more, not now, not ever" policy. From doug at ewellic.org Sun Dec 14 11:57:53 2025 From: doug at ewellic.org (Doug Ewell) Date: Sun, 14 Dec 2025 17:57:53 +0000 Subject: Combining characters In-Reply-To: <011001dc6d20$56261670$02724350$@akphs.com> References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com> <011001dc6d20$56261670$02724350$@akphs.com> Message-ID: Phil Smith III wrote: > "newly needed letter-with-diacritic" -- does that happen? Venusian > gets added and the ONLY issue is that it needs J+Combining Grave? I > see the point but am not sure it's realistic, This sort of thing is not uncommon with Native American orthographies (right here on Earth!) that are newly created, or for which Unicode encoding is new. > and in any case isn't what I'm talking about: I'm asking about NEW > combiners. Though "invalid" combinations can be an issue now, with > different engines rendering them differently. At least if code comes > across J+Combining Grave now, the combining-ness is known. When a > Combining Backslash is added for Jovian, well, now that character is > new and normalization adventures abound. Normalization (NFC or NFD, not NFK*) for characters like this comes into play only when the character exists as both a precomposed unitary character and a combining sequence. When there is only one or the other, normalization to NFC or NFD yields the same result, and is thus a no-op, and not particularly adventurous. > Actually they add a LOT of pain/complexity for certain use cases, > because of normalization. Only if a separate NFC (precomposed) or NFD (decomposed) form is added where one already exists, and IIRC there is indeed a policy against that. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From irgendeinbenutzername at gmail.com Sun Dec 14 12:28:16 2025 From: irgendeinbenutzername at gmail.com (Charlotte Eiffel Lilith Buff) Date: Sun, 14 Dec 2025 19:28:16 +0100 Subject: Combining characters In-Reply-To: <011001dc6d20$56261670$02724350$@akphs.com> References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com> <011001dc6d20$56261670$02724350$@akphs.com> Message-ID: > The fact that there haven't been any new combiners in several versions I?m actually really curious what gave you that impression. Pretty much every Unicode update adds tons of new combining characters (the only exceptions being those weird inbetween-y versions we occasionally get). Am So., 14. Dez. 2025 um 18:39 Uhr schrieb Phil Smith III via Unicode < unicode at corp.unicode.org>: > Doug Ewell wrote: > >Another, possibly more farsighted reason is that, if a newly needed > >letter-with-diacritic can be represented today with an existing letter > >and an existing diacritic, instead of waiting possibly years for the > >precomposed combination to be encoded, that time saving is a big win > >for the user community. > > "newly needed letter-with-diacritic" -- does that happen? Venusian gets > added and the ONLY issue is that it needs J+Combining Grave? I see the > point but am not sure it's realistic, and in any case isn't what I'm > talking about: I'm asking about NEW combiners. Though "invalid" > combinations can be an issue now, with different engines rendering them > differently. At least if code comes across J+Combining Grave now, the > combining-ness is known. When a Combining Backslash is added for Jovian, > well, now that character is new and normalization adventures abound. > > >More combining characters that work essentially the same as existing > >ones don?t really add to the pain. > > Actually they add a LOT of pain/complexity for certain use cases, because > of normalization. > > Thanks; I don't mean to sound like "Go away", this is exactly the kind of > discussion I was hoping for! The fact that there haven't been any new > combiners in several versions (I think?) is what made me think that there > might be some level of "No more, not now, not ever" policy. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists at akphs.com Sun Dec 14 12:47:13 2025 From: lists at akphs.com (Phil Smith III) Date: Sun, 14 Dec 2025 13:47:13 -0500 Subject: Combining characters In-Reply-To: References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com> <011001dc6d20$56261670$02724350$@akphs.com> Message-ID: <012d01dc6d2a$105656a0$310303e0$@akphs.com> Well, I?m sorta ?asking for a friend? ? a coworker who is deep in the weeds of working with something Unicode-related. I?m blaming him for having told me that :) If it?s wrong, it certainly changes things a lot, and makes my question moot(ish?)! From: Charlotte Eiffel Lilith Buff Sent: Sunday, December 14, 2025 1:28 PM To: Phil Smith III Cc: Unicode Subject: Re: Combining characters > The fact that there haven't been any new combiners in several versions I?m actually really curious what gave you that impression. Pretty much every Unicode update adds tons of new combining characters (the only exceptions being those weird inbetween-y versions we occasionally get). Am So., 14. Dez. 2025 um 18:39 Uhr schrieb Phil Smith III via Unicode >: Doug Ewell wrote: >Another, possibly more farsighted reason is that, if a newly needed >letter-with-diacritic can be represented today with an existing letter >and an existing diacritic, instead of waiting possibly years for the >precomposed combination to be encoded, that time saving is a big win >for the user community. "newly needed letter-with-diacritic" -- does that happen? Venusian gets added and the ONLY issue is that it needs J+Combining Grave? I see the point but am not sure it's realistic, and in any case isn't what I'm talking about: I'm asking about NEW combiners. Though "invalid" combinations can be an issue now, with different engines rendering them differently. At least if code comes across J+Combining Grave now, the combining-ness is known. When a Combining Backslash is added for Jovian, well, now that character is new and normalization adventures abound. >More combining characters that work essentially the same as existing >ones don?t really add to the pain. Actually they add a LOT of pain/complexity for certain use cases, because of normalization. Thanks; I don't mean to sound like "Go away", this is exactly the kind of discussion I was hoping for! The fact that there haven't been any new combiners in several versions (I think?) is what made me think that there might be some level of "No more, not now, not ever" policy. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jukkakk at gmail.com Sun Dec 14 13:50:38 2025 From: jukkakk at gmail.com (Jukka K. Korpela) Date: Sun, 14 Dec 2025 21:50:38 +0200 Subject: Combining characters In-Reply-To: <00f301dc6d17$51fb2080$f5f16180$@akphs.com> References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com> Message-ID: The question is neither dumb or na?ve, However, it has been addressed and answered several times in various discussions. There might be different views on this in the Unicode community, so I?ll label the following as mine only, Unicode has included precomposed characters such as ??? and ??? for compatibility, since they existed in earlier character codes. If Unicode were designed today and with no regard to earlier codes, it would have just base characters and combining marks. But due to the existence and widespread use of codes with precomposed characters, they were included into Unicode and defined as distinct, but ?compatibility equivalent? to combinations of base characters and combining marks. This means that you are allowed to, but not required to, or even encouraged to make a distinction between, say, the single character ??? and the sequence of ?e? followed by a combining acute accent. Combining characters provide for a general mechanism of adding combining marks on any characters. You can take this to extremes and absurdity, creating a character with dozens or zillions of marks above and below it, but this does not prevent meaningful use of combining characters. Yucca, https://jkorpela.fi su 14.12.2025 klo 18.36 Phil Smith III via Unicode (unicode at corp.unicode.org) kirjoitti: > This may be dumb/hopelessly na?ve but here goes! > > > > My observations/inferences/suppositions re combining characters: > > - They were originally implemented as a way to reduce the total number > of characters > - We?re well past any likely original vision of the number of > characters/scripts anyway, so that ?savings? is kinda meaningless > - Combiners are a pain overall (normalization!) > - Barring Earth joining the Galactic Federation and Unicode deciding > to include all twelve billion alien languages, the current scheme will > suffice forever (yeah, yeah, I know, ?never say forever?, but?this feels > like IPv6 addresses, ?there are just SO many??) > > > > Ergo, I posit that there should never be a need for any NEW combiners. > > > Is there any sort of official or unofficial policy to that end? Inquiring > minds and all that? > > > > Thanks, > > ...phsiii > -------------- next part -------------- An HTML attachment was scrubbed... URL: From don.hosek at gmail.com Sun Dec 14 14:25:22 2025 From: don.hosek at gmail.com (Don Hosek) Date: Sun, 14 Dec 2025 14:25:22 -0600 Subject: Combining Characters Message-ID: > > When a Combining Backslash is added for Jovian, well, now that character > is new and normalization adventures abound. > Just one additional note on this: Everything around combining characters, normalization and grapheme segmentation is data-driven. Other than when new rules for Indic scripts were introduced with Unicode 15.1.0, the only thing I?ve needed to update for my Unicode grapheme library has been to import the newest Unicode data tables. I?ve not written normalization code (yet), but from everything that I?ve seen on that front, it looks like a similar thing where again, everything is data-driven. The only case I can see where things could get weird would be if there suddenly became some weird case where, e.g., the Jovians insisted that the combining backslash must appear before the letter and not after it (and it?s been a few years since I had to really look at the rules and this might be possible with the existing combining character classes anyway). -dh -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Sun Dec 14 14:32:06 2025 From: markus.icu at gmail.com (Markus Scherer) Date: Sun, 14 Dec 2025 12:32:06 -0800 Subject: Combining characters In-Reply-To: <012d01dc6d2a$105656a0$310303e0$@akphs.com> References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com> <011001dc6d20$56261670$02724350$@akphs.com> <012d01dc6d2a$105656a0$310303e0$@akphs.com> Message-ID: On Sun, Dec 14, 2025 at 9:40?AM Phil Smith III via Unicode < unicode at corp.unicode.org> wrote: > The fact that there haven't been any new combiners in several versions (I > think?) We publish data files, so that you need not guess. Are we talking about combining marks per se, which include Indic-script vowel signs, or are we talking about characters with non-zero Canonical_Combining_Class? Unicode 15, 16, and 17 added 135 characters with General_Category=M. Unicode 15, 16, and 17 added 56 characters with ccc!=0. (These are a subset of the above.) On Sun, Dec 14, 2025 at 10:50?AM Phil Smith III via Unicode < unicode at corp.unicode.org> wrote: > Well, I?m sorta ?asking for a friend? ? a coworker who is deep in the > weeds of working with something Unicode-related. > Maybe you could ask your coworker to chime in and say what he is working on, and maybe we can give some tips? markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From list+unicode at jdlh.com Sun Dec 14 14:40:25 2025 From: list+unicode at jdlh.com (list+unicode at jdlh.com) Date: Sun, 14 Dec 2025 12:40:25 -0800 Subject: Combining characters In-Reply-To: References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com> <011001dc6d20$56261670$02724350$@akphs.com> Message-ID: <7bf3ed16-4fc5-4fc9-929b-5eb34b312075@jdlh.com> Following up on Charlotte's comment: for Unicode 17.0, released in September 2025, see: 1. *Unicode? 17.0 Versioned Charts Index *, which lists all 4803 characters newly encoded in Unicode 17.0, including 27 in "Combining Diacritical Marks Extended". 2. *Combining Diacritical Marks Extended* which has pictures of new combining characters U-1ACF..U-1ADD and U-1AE0..U-1AEB. So, there have been new combiners in the most recent version of The Unicode Standard. ???? ?Jim DeLaHunt On 2025-12-14 10:28, Charlotte Eiffel Lilith Buff via Unicode wrote: > > The fact that there haven't been any new combiners in several versions > > I?m actually really curious what gave you that impression. Pretty much > every Unicode update adds tons of new combining characters (the only > exceptions being those weird inbetween-y versions we occasionally get). > > Am So., 14. Dez. 2025 um 18:39?Uhr schrieb Phil Smith III via Unicode > : > > Doug Ewell wrote: > >Another, possibly more farsighted reason is that, if a newly needed > >letter-with-diacritic can be represented today with an existing > letter > >and an existing diacritic, instead of waiting possibly years for the > >precomposed combination to be encoded, that time saving is a big win > >for the user community. > > "newly needed letter-with-diacritic" -- does that happen? Venusian > gets added and the ONLY issue is that it needs J+Combining Grave? > I see the point but am not sure it's realistic, and in any case > isn't what I'm talking about: I'm asking about NEW combiners. > Though "invalid" combinations can be an issue now, with different > engines rendering them differently. At least if code comes across > J+Combining Grave now, the combining-ness is known. When a > Combining Backslash is added for Jovian, well, now that character > is new and normalization adventures abound. > > >More combining characters that work essentially the same as existing > >ones don?t really add to the pain. > > Actually they add a LOT of pain/complexity for certain use cases, > because of normalization. > > Thanks; I don't mean to sound like "Go away", this is exactly the > kind of discussion I was hoping for! The fact that there haven't > been any new combiners in several versions (I think?) is what made > me think that there might be some level of "No more, not now, not > ever" policy. > > -- . --Jim DeLaHunt,jdlh at jdlh.com http://blog.jdlh.com/ (http://jdlh.com/) multilingual websites consultant, Vancouver, B.C., Canada -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Sun Dec 14 14:40:54 2025 From: markus.icu at gmail.com (Markus Scherer) Date: Sun, 14 Dec 2025 12:40:54 -0800 Subject: Combining Characters In-Reply-To: References: Message-ID: On Sun, Dec 14, 2025 at 12:27?PM Don Hosek via Unicode < unicode at corp.unicode.org> wrote: > When a Combining Backslash is added for Jovian, well, now that character >> is new and normalization adventures abound. >> > > [...] > > The only case I can see where things could get weird would be if there > suddenly became some weird case where, e.g., the Jovians insisted that the > combining backslash must appear before the letter and not after it (and > it?s been a few years since I had to really look at the rules and this > might be possible with the existing combining character classes anyway). > Some of the Indic-script vowel marks *appear graphically* before their consonant. We also have scripts like Thai with characters that have the Logical_Order_Exception property and are encoded in memory before their consontants. However, when the Jovians arrive with their billion-character encoding, then Unicode will become a legacy encoding. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Sun Dec 14 16:02:41 2025 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 14 Dec 2025 14:02:41 -0800 Subject: Combining characters In-Reply-To: References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com> <011001dc6d20$56261670$02724350$@akphs.com> Message-ID: On 12/14/2025 9:57 AM, Doug Ewell via Unicode wrote: > Normalization (NFC or NFD, not NFK*) for characters like this comes into play only when the character exists as both a precomposed unitary character and a combining sequence. When there is only one or the other, normalization to NFC or NFD yields the same result, and is thus a no-op, and not particularly adventurous. This is actually incorrect. (And Doug actually knows better :) ). It would be correct for a sequence of a base character with */single /*combining mark, but as soon as you have two or more combining marks, their order is defined by NFC. The idea is that that if two combining marks don't interact (such as by stacking), different orders could result in the same display and normalization enforces a preferred ordering. To make matters more complex, some combining marks are defined to not reorder. Those can be in any order defined by the author and could lead to duplicate encoding for the same display. The reasons behind supporting that are a bit complex, but generally it's done for scripts other than Latin. But in general, */canonical reordering/*?is a thing and is part of normalization. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Sun Dec 14 16:44:49 2025 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 14 Dec 2025 14:44:49 -0800 Subject: Combining characters In-Reply-To: <012d01dc6d2a$105656a0$310303e0$@akphs.com> References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com> <011001dc6d20$56261670$02724350$@akphs.com> <012d01dc6d2a$105656a0$310303e0$@akphs.com> Message-ID: On 12/14/2025 10:47 AM, Phil Smith III via Unicode wrote: > > Well, I?m sorta ?asking for a friend? ? a coworker who is deep in the > weeds of working with something Unicode-related. I?m blaming him for > having told me that :) > > This actually deserves a deeper answer, or a more "bird's-eye" one, if you want. Read to the end. The way you asked the question seems to hint that in your minds you and your friend conflate the concept of "combining" mark and "diacritic". That would not be surprising if you are mainly familiar with European scripts and languages, because in that case, this equivalence kind of applies. And you may also be thinking mainly of languages and their orthographies, and not of notations, phonetic or otherwise, that give rise to unusual combinations. Most European languages do have a reasonably small, fixed set of letters with diacritics in their orthographies, even though there are many languages where, if you ask the native users to list all the combinations, they will fall short. Example is the use of an accent with the letter 'e' in some of the Scandinavian languages to distinguish two identically spelled small words that have very different functions in the syntax. You will see that accent used in books and formal writing, but I doubt people bother when writing a text message. The focus on code space is a red herring to a degree. The real difficulty would be in cataloging all of the rare combinations, and get all fonts to be aware of them. It is much easier to encode the diacritic as a combining character and have general rules for layout. With modern fonts, you can, in principle, get acceptable display even for unexpected combinations without the effort of first cataloging, then publishing and then having all font vendors explicitly adding an implementation for that combination before it can be used. Other languages and scripts have combinatorics as part of their DNA, so to speak. Their structural unit is not the letter (with or without decorations) but the syllable, which is naturally combined from components that graphically attach to each other or even fuse into a combined shape. Because that process is not random, it's easier to encode these structural elements (some of which are combining characters) than to try to enumerate the possible combinations. It doesn't hurt that the components nicely map onto discrete keys on the respective keyboards. Notations, such as scientific notation, also often assigns a discrete identity to the combining mark. A dot above can be the first derivative with respect to time, which can be applied to any letter designating a variable, which can be, at the minimum any letter from the Latin or Greek alphabets, but why stop there. There's nothing in the notation itself that would enjoin a scientist from combining that dot with any character they find suitable. The only sensible solution is encoding a combining mark, even though some letters exist that have a dot above as part of an orthography and are also encoded in precomposed form. In contrast, Chinese ideographs, while visually composed of identifiable elements, are treated by their users as units and well before Unicode came along there was an established approach how to manage things like keyboard entry while encoding these as precomposed entities and not as their building blocks. A big part of the encoding decision is always to do what makes sense for the writing system or notation (and the script it is based on). For a universal encoding, such as Unicode, there simply isn't a "one-size-fits-all" solution that would work. But if you look at this universal encoding only from a very narrow perspective of the orthographies that you are most familiar with, then, understandably, you might feel that anything that isn't directly required (from your point of view) is an unnecessary complication. However, once you adopt a more universal perspective, it's much easier to not rat-hole on some seeming inconsistencies, because you can always discover how certain decisions relate to the specific requirements for one or more writing systems. Importantly, this often includes requirements based on de-facto implementations for these systems before the advent of Unicode. Being universal, Unicode needed to be designed to allow easy conversion from all existing data sets. And for European scripts, the business community and the librarians had competing systems, one with limited sets of precomposed characters and one with combining marks for diacritics. The ultimate source of the duality stems from there, but the two communities had different goals. One wanted to efficiently handle the common case (primarily mapping all the modern national typewriters into character encoding) while the other was interested in a full representation of anything that could be present in printed book titles (for cataloging), including unusual or historic combinations. In conclusion, the question isn't a bad one, but the real answer is that complexity is very much part of human writing, and when you design (and extend) a universal character encoding, you will need to be able to represent that full degree of complexity. Therefore, what seem like obvious simplifications really aren't feasible, unless you give up on attempting to be universal. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From ashpilkin at gmail.com Sun Dec 14 17:07:41 2025 From: ashpilkin at gmail.com (Alex Shpilkin) Date: Mon, 15 Dec 2025 01:07:41 +0200 Subject: Combining characters In-Reply-To: References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com> <011001dc6d20$56261670$02724350$@akphs.com> Message-ID: On Sun, Dec 14 2025 at 14:02:41 -08:00:00, Asmus Freytag via Unicode wrote: > To make matters more complex, some combining marks are defined to not > reorder. Those can be in any order defined by the author and could > lead to duplicate encoding for the same display. The reasons behind > supporting that are a bit complex, but generally it's done for > scripts other than Latin. Amusingly, study of literal Latin, the language, uses two combining marks of the same CCC together as a matter of course: dictionaries mark a vowel with (what in NFD would be) the sequence COMBINING MACRON, COMBINING BREVE to tell the reader that a syllable?s length either varies or cannot be determined. -- Cheers, Alex From mark at kli.org Sun Dec 14 17:22:06 2025 From: mark at kli.org (Mark E. Shoulson) Date: Sun, 14 Dec 2025 18:22:06 -0500 Subject: Combining characters In-Reply-To: References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com> <011001dc6d20$56261670$02724350$@akphs.com> <012d01dc6d2a$105656a0$310303e0$@akphs.com> Message-ID: On 12/14/25 5:44 PM, Asmus Freytag via Unicode wrote: > On 12/14/2025 10:47 AM, Phil Smith III via Unicode wrote: >> >> Well, I?m sorta ?asking for a friend? ? a coworker who is deep in the >> weeds of working with something Unicode-related. I?m blaming him for >> having told me that :) >> >> > This actually deserves a deeper answer, or a more "bird's-eye" one, if > you want. Read to the end. > > The way you asked the question seems to hint that in your minds you > and your friend conflate the concept of "combining" mark and > "diacritic". That would not be surprising if you are mainly familiar > with European scripts and languages, because in that case, this > equivalence kind of applies. > Yes.? This is crucial.? You (Phil) are writing like "sheez, so there's e and there's e-with-an-acute, we might as well just treat them like separate letters."? And that maybe makes sense for languages where "combining characters" are maybe two or three diacritics that can live on five or six letters.? Maybe it does make sense to consider those combinations as distinct letters (indeed, some of the languages in question do just that.)? But some combining characters are more rightly perceived as things separate from the letters which are written in the same space (and have historically always been considered so).? The most obvious examples would be Hebrew and Arabic vowel-points.? Does it really make sense to consider ?? and ?? and ??? and all the other combinatorics as separate distinct things, when they clearly contain separate units, each of which has its own consistent character?? Throw in the Hebrew "accents" (cantillation marks) and you're talking an enormous combinatorial explosion at the *cost* of simplicity and consistency, not improving it.? Ditto Indic vowel-marks and a jillion other abjads and abugidas.? If anything, there's a better case to be made that the precomposed letters were maybe a wrong move. (TL;DR: what Asmus said.) ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Sun Dec 14 17:54:28 2025 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2E_D=C3=BCrst?=) Date: Mon, 15 Dec 2025 08:54:28 +0900 Subject: Combining characters In-Reply-To: References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com> <011001dc6d20$56261670$02724350$@akphs.com> Message-ID: <191bffe1-705e-4865-9f60-80f990c19cff@it.aoyama.ac.jp> Hello everybody, On 2025-12-15 02:57, Doug Ewell via Unicode wrote: > Phil Smith III wrote: > Only if a separate NFC (precomposed) or NFD (decomposed) form is added where one already exists, and IIRC there is indeed a policy against that. Yes. It's at https://www.unicode.org/policies/stability_policy.html, under the heading "Normalization Stability". It's written in a result-oriented way, but it essentially means that if a text can already be expressed in decomposed form, no new precompositions will be added. That doesn't mean that it's not possible to add new precompositions and decompositions at the same time, e.g. for a new script (see my next mail for an example). Regards, Martin. From duerst at it.aoyama.ac.jp Sun Dec 14 17:54:33 2025 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2E_D=C3=BCrst?=) Date: Mon, 15 Dec 2025 08:54:33 +0900 Subject: Combining Characters In-Reply-To: References: Message-ID: <392cff9f-9c2f-4221-b46f-8a49e878ae12@it.aoyama.ac.jp> Hello everybody, On 2025-12-15 05:25, Don Hosek via Unicode wrote: > Just one additional note on this: Everything around combining characters, > normalization and grapheme segmentation is data-driven. Other than when new > rules for Indic scripts were introduced with Unicode 15.1.0, the only thing > I?ve needed to update for my Unicode grapheme library has been to import > the newest Unicode data tables. I?ve not written normalization code (yet), > but from everything that I?ve seen on that front, it looks like a similar > thing where again, everything is data-driven. That's essentially true, based on my experience with Unicode-related code for the programming language Ruby. > The only case I can see where things could get weird would be if there > suddenly became some weird case where, e.g., the Jovians insisted that the > combining backslash must appear before the letter and not after it (and > it?s been a few years since I had to really look at the rules and this > might be possible with the existing combining character classes anyway). Because of the way we have optimized normalization in Ruby (caching normalization results for runs of a base character followed by modifiers), that wasn't exactly true when we upgraded to Unicode 16.0.0. See the "Normalization Behavior" entry at https://www.unicode.org/versions/Unicode16.0.0/#Migration. New scripts introduced in 16.0.0 (Kirat Rai, Tulu-Tigalari, and Gurung Khema) contained combining marks that had combining class 0 and were also base characters combining with other combining marks (or even with themselves). That was something we hadn't taken account of in our implementation previously (because it was not needed). You can see an example at https://github.com/ruby/ruby/blob/master/test/test_unicode_normalize.rb#L219: assert_equal "\u{16121 16121 16121 16121 16121 1611E}", "\u{1611E 16121 16121 16121 16121 16121}".unicode_normalize U+1611E is GURUNG KHEMA VOWEL SIGN AA, a single bar on top of a character. It combines with itsel to form U+16121, GURUNG KHEMA VOWEL SIGN U, which is a double bar above. Although not required for actually writing Gurung Khema (or so I assume), the correct form to represent a number of bars above (11 in the test code above) is to first group them into pairs with U+16121, and only in the case of an odd number add a single U+1611E to the end. Regards, Martin. From asmusf at ix.netcom.com Sun Dec 14 17:55:21 2025 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 14 Dec 2025 15:55:21 -0800 Subject: Combining characters In-Reply-To: References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com> <011001dc6d20$56261670$02724350$@akphs.com> <012d01dc6d2a$105656a0$310303e0$@akphs.com> Message-ID: On 12/14/2025 3:22 PM, Mark E. Shoulson via Unicode wrote: > > On 12/14/25 5:44 PM, Asmus Freytag via Unicode wrote: > >> On 12/14/2025 10:47 AM, Phil Smith III via Unicode wrote: >>> >>> Well, I?m sorta ?asking for a friend? ? a coworker who is deep in >>> the weeds of working with something Unicode-related. I?m blaming him >>> for having told me that :) >>> >>> >> This actually deserves a deeper answer, or a more "bird's-eye" one, >> if you want. Read to the end. >> >> The way you asked the question seems to hint that in your minds you >> and your friend conflate the concept of "combining" mark and >> "diacritic". That would not be surprising if you are mainly familiar >> with European scripts and languages, because in that case, this >> equivalence kind of applies. >> > Yes.? This is crucial.? You (Phil) are writing like "sheez, so there's > e and there's e-with-an-acute, we might as well just treat them like > separate letters."? And that maybe makes sense for languages where > "combining characters" are maybe two or three diacritics that can live > on five or six letters.? Maybe it does make sense to consider those > combinations as distinct letters (indeed, some of the languages in > question do just that.)? But some combining characters are more > rightly perceived as things separate from the letters which are > written in the same space (and have historically always been > considered so). The most obvious examples would be Hebrew and Arabic > vowel-points.? Does it really make sense to consider ?? and ?? and ??? > and all the other combinatorics as separate distinct things, when they > clearly contain separate units, each of which has its own consistent > character?? Throw in the Hebrew "accents" (cantillation marks) and > you're talking an enormous combinatorial explosion at the *cost* of > simplicity and consistency, not improving it.? Ditto Indic vowel-marks > and a jillion other abjads and abugidas. > Nice examples to back up what I wrote. > > ?If anything, there's a better case to be made that the precomposed > letters were maybe a wrong move. > > That "might" have been the case, had Unicode been created in a vacuum. Instead, Unicode needed to offer the easiest migration path from the installed base of pre-existing character encodings, or risk failing to gain ground at all. All the early systems mainly started out with legacy applications and legacy data that needed to be supported as transparently as possible. Given the pervasive amount of indexing into strings and length calculations that are deeply embedded into these legacy applications, trying to support these with a different encoding model (not just with a different encoding) would have been a non-starter. As we've seen since, the final key in that puzzle was IETF creating an ASCII compatible, variable length encoding form that violated one of Unicode's early design goals (to have a fixed number of code units per character). However, allowing direct parsing of data streams for ASCII-based syntax characters was more of a compatibility requirement than had appeared at first. The reason, this was not built directly into the earliest Unicode versions was that it is something that (transport) protocol designers are up against more than people worried about representing text in documents. Looking at Unicode from the perspective "what, if I could design something from scratch?" can be intellectually interesting but is of little practical value. Any design that would have prevented people from different legacy environments from coalescing around would simply have died out. If it amuses you, you could think of some features of Unicode as being akin to the "vestiginal" organs that evolution sometimes leaves behind. They may not strictly be required the way the organism functions today, but without their use in the historical transition, the current form of the organism would not exist, because the species would be extinct. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Sun Dec 14 18:31:33 2025 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J=2E_D=C3=BCrst?=) Date: Mon, 15 Dec 2025 09:31:33 +0900 Subject: Combining characters In-Reply-To: References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com> <011001dc6d20$56261670$02724350$@akphs.com> Message-ID: On 2025-12-15 08:07, Alex Shpilkin via Unicode wrote: > > On Sun, Dec 14 2025 at 14:02:41 -08:00:00, Asmus Freytag via Unicode > wrote: >> To make matters more complex, some combining marks are defined to not >> reorder. Those can be in any order defined by the author and could >> lead to duplicate encoding for the same display. The reasons behind >> supporting that are a bit complex, but generally it's done for scripts >> other than Latin. > > Amusingly, study of literal Latin, the language, uses two combining > marks of the same CCC together as a matter of course: dictionaries mark > a vowel with (what in NFD would be) the sequence COMBINING MACRON, > COMBINING BREVE to tell the reader that a syllable?s length either > varies or cannot be determined. These two characters are indeed not reordered, but that's not a problem, because they are stacked. The sequence COMBINING MACRON, COMBINING BREVE will have the macron between the character and the breve, whereas the sequence COMBINING BREVE, COMBINING MACRON will have the macron above the breve. Not an expert, but my assumption is that only the first one is customary for Latin. Regards, Martin. From doug at ewellic.org Sun Dec 14 23:03:14 2025 From: doug at ewellic.org (Doug Ewell) Date: Mon, 15 Dec 2025 05:03:14 +0000 Subject: Combining characters In-Reply-To: References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com> <011001dc6d20$56261670$02724350$@akphs.com> Message-ID: Asmus Freytag wrote: > It would be correct for a sequence of a base character with _single_ > combining mark, but as soon as you have two or more combining marks, > their order is defined by NFC. I had mistakenly assumed that Phil?s use case considered only sequences with a single combining mark, and consciously chose to limit my response to that scenario. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From doug at ewellic.org Sun Dec 14 23:26:24 2025 From: doug at ewellic.org (Doug Ewell) Date: Mon, 15 Dec 2025 05:26:24 +0000 Subject: Combining Characters In-Reply-To: References: Message-ID: Markus Scherer wrote: > However, when the Jovians arrive with their billion-character > encoding, then Unicode will become a legacy encoding. Yeah, I?m really not a fan of this whole ?Jovian? and ?Venusian? and ?Galactic Federation? line of argument. As several (including Markus) have observed here, the use of combining characters in non-Latin scripts and in transcription can differ markedly from their use in Latin-script, natural-language scenarios. The original questions strongly implied that only Latin-script, natural-language scenarios were relevant, to the extent that anything else must be from outer space. That?s unfair to languages written in scripts other than Latin, as well as to the efforts made in Unicode for 35 years to provide good support for these scripts and for contexts other than natural language. Apologies if humor was implied throughout and I didn?t get it. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org From asmusf at ix.netcom.com Sun Dec 14 23:42:13 2025 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 14 Dec 2025 21:42:13 -0800 Subject: Combining characters In-Reply-To: References: <00f301dc6d17$51fb2080$f5f16180$@akphs.com> <011001dc6d20$56261670$02724350$@akphs.com> Message-ID: On 12/14/2025 9:03 PM, Doug Ewell wrote: > Asmus Freytag wrote: > >> It would be correct for a sequence of a base character with _single_ >> combining mark, but as soon as you have two or more combining marks, >> their order is defined by NFC. > I had mistakenly assumed that Phil?s use case considered only sequences with a single combining mark, and consciously chose to limit my response to that scenario. > I know that you were aware of the general case. What I was trying to communicate (and expounded upon in the other reply) is the degree to which human writing in the general case is highly complex, usually even more complex than most native speakers (other than typesetters) are ever aware of, even for their own language. And it is acknowledging this complexity ? and how it is necessarily reflected in anything that aims to be a universal system of character encoding ? that drives the understanding that such a system must be full of complexities of its own that cannot even be reconciled down to a minimally simplistic system. For those of us, unlike the questioner, who have been around this effort for any length of time, this complexity can seem to be a given. But many people who have not worked in this space are genuinely surprised and challenged by it. And that includes people who have impressive credentials in technical work. Without realizing it, they apply their own native understanding of writing systems as if that was exhaustive or even typical. When they try to come up with solutions, such as protocols, that need to be robust in the face of the full variety of global text (even only the living subset) they may reach conclusions that fatefully fall well short of what is needed, or they try to "simplify" away complexities that to them feel ill motivated. Commonly, I also observe that solutions are proposed that "micro-manage" some well-understood or familiar subset of characters, but leave a protocol without meaningful solutions or safeguards to the vast majority which contains all the other scripts and writing systems. There's no quick fix, but it is my firm conviction that we always need to start from a point of correctly scoping the issues as those belonging to a "universal" system of character encoding, as opposed to one that is optimized for some subset. A./ From unicode.org at sl.neatnit.net Thu Dec 18 10:08:57 2025 From: unicode.org at sl.neatnit.net (Nitai Sasson) Date: Thu, 18 Dec 2025 16:08:57 +0000 Subject: RFC: controlling bidirectional mirroring of characters Message-ID: <176607414306.7.7927205120432481899.1073729710@sl.neatnit.net> Hello all! I've been sitting on this for a while, kind of afraid to finish it up and send it. I've finally decided to just do so, even though it's not perfect. Following the email discussion from [April 2025](https://corp.unicode.org/pipermail/unicode/2025-April/thread.html), I want to propose a combining formatting character to affect the mirroring behavior of arrow characters (and potentially other characters) in bidirectional text. The initial idea for this was originally brought up by Mark E. Shoulson while brainstorming in his first reply. This is a request for comment and a draft for that proposal. Please see it at: https://codeberg.org/NeatNit/unicode-bidi-arrows-proposal/src/branch/main/email.md Thank you, Nitai Sasson -------------- next part -------------- An HTML attachment was scrubbed... URL: From moody at posixcafe.org Fri Dec 19 10:32:57 2025 From: moody at posixcafe.org (Jacob Moody) Date: Fri, 19 Dec 2025 10:32:57 -0600 Subject: Combining Characters In-Reply-To: <392cff9f-9c2f-4221-b46f-8a49e878ae12@it.aoyama.ac.jp> References: <392cff9f-9c2f-4221-b46f-8a49e878ae12@it.aoyama.ac.jp> Message-ID: On 12/14/25 17:54, Martin J. D?rst via Unicode wrote: > >> The only case I can see where things could get weird would be if there >> suddenly became some weird case where, e.g., the Jovians insisted that the >> combining backslash must appear before the letter and not after it (and >> it?s been a few years since I had to really look at the rules and this >> might be possible with the existing combining character classes anyway). > > Because of the way we have optimized normalization in Ruby (caching > normalization results for runs of a base character followed by > modifiers), that wasn't exactly true when we upgraded to Unicode 16.0.0. > See the "Normalization Behavior" entry at > https://www.unicode.org/versions/Unicode16.0.0/#Migration. I also ran in to some issues with exactly this with my implementation for 9front[0]. Took me a bit to figure out what was going on, unfortunately I had first written my implementation for v15 so at the time I wasn't sure if I had somehow overfit my code to 15 or something had changed. > > New scripts introduced in 16.0.0 (Kirat Rai, Tulu-Tigalari, and Gurung > Khema) contained combining marks that had combining class 0 and were > also base characters combining with other combining marks (or even with > themselves). That was something we hadn't taken account of in our > implementation previously (because it was not needed). > I do wish the documents on migration[1] had explicitly explained that these new characters have ccc=0 conjoiners, it may imply it when discussing them, and maybe I'm still a bit green on the details to put 2 and 2 together but it would have saved me some time. On the topic I did find the suggested resolution of using the quickcheck value a bit strange, as far as I know use of quickcheck was not strictly required for normalziation prior to this update. Or well, my v15 implementation did not use it and passed all the normalization tests. I guess as an upside I found that with these changes and the inclusion of quickcheck hangul no longer needed to be special cased. Thanks, Jacob Moody [0] https://github.com/9front/9front/blob/front/sys/src/libc/ucd/runenorm.c [1] https://www.unicode.org/reports/tr15/tr15-56.html#Contexts_Care From ashpilkin at gmail.com Fri Dec 19 15:02:55 2025 From: ashpilkin at gmail.com (Alex Shpilkin) Date: Fri, 19 Dec 2025 23:02:55 +0200 Subject: Combining Characters In-Reply-To: References: <392cff9f-9c2f-4221-b46f-8a49e878ae12@it.aoyama.ac.jp> Message-ID: On Fri, Dec 19 2025 at 10:32:57 -06:00:00, Jacob Moody via Unicode wrote: > I do wish the documents on migration[1] had explicitly explained that > these > new characters have ccc=0 conjoiners, it may imply it when discussing > them, > and maybe I'm still a bit green on the details to put 2 and 2 together > but it would have saved me some time. No objection here despite the foregoing. > On the topic I did find the suggested resolution of using the > quickcheck value a bit strange, as far as I know use of quickcheck > was not strictly required for normalziation prior to this update. Or > well, my v15 implementation did not use it and passed all the > normalization > tests. I haven?t gotten to implementing canonical composition yet, nor have I looked at any other implementation including yours, but AFAICT the QC properties aren?t required now either: looking at the 3.11 Normalization Forms in Unicode 13, predating this change, the recomposition algorithm that suggests itself is: starter = 0 # sentinel not part of any compositions starter index = uninitialized index = 0 while index < length of string: composition = try to compose (starter, string[index]) if succeeded: assert ccc[composition] = 0 string[starter index] = composition delete string[index] else: if ccc[string[index]] = 0: # NB only this late starter = string[index] starter index = index index = index + 1 If you check conditions in this order, then the handling of starter+starter compositions falls out naturally. (Also note that the composition table only needs to contain pairs of an NFC-form starter and an NFD character, and there are possible optimizations connected to the fact that, if the next character after a successful composition is a nonstarter too, then the first character in the next lookup will be the result of this one.) Trying to merge de- and recomposition into a single streaming process (e.g. with limits on the length of a composing character sequence to avoid worst-case linear memory consumption) will of course make things much more difficult. > I guess as an upside I found that with these changes and the > inclusion of quickcheck hangul no longer needed to be special cased. I don?t believe you ever actually *have* to special-case Hangul after you?ve generated your tables, it?s just that if you are trying to keep your table size down (as I am) then doing so will give you something like 2x savings. -- HTH, Alex From ashpilkin at gmail.com Fri Dec 19 15:17:25 2025 From: ashpilkin at gmail.com (Alex Shpilkin) Date: Fri, 19 Dec 2025 23:17:25 +0200 Subject: Combining Characters In-Reply-To: References: <392cff9f-9c2f-4221-b46f-8a49e878ae12@it.aoyama.ac.jp> Message-ID: <1HCJ7T.7BQ2ABPDSB4F3@gmail.com> On Fri, Dec 19 2025 at 23:02:55 +02:00:00, Alex Shpilkin wrote: > I haven?t gotten to implementing canonical composition yet And you can tell because the algorithm I?ve posted is wrong. Attempted correction (which does introduce a bit of special handling to account for the starter+starter case): starter = 0 # sentinel not part of any compositions starter index = uninitialized index = 0 while index < length of string: composition = try to compose (starter, string[index]) if succeeded and (ccc[string[index]] != 0 or index == starter index + 1): string[starter index] = composition delete string[index] else: if ccc[string[index]] == 0: # NB only this late starter = string[index] starter index = index index = index + 1 -- Sorry for the noise, Alex