From kent.b.karlsson at outlook.com Wed Sep 3 08:32:32 2025 From: kent.b.karlsson at outlook.com (Kent Karlsson) Date: Wed, 3 Sep 2025 13:32:32 +0000 Subject: "Code pages" for SMS/CBS Message-ID: Hi! I have finally had the opportunity review the new 3GPP 23.038 "code pages", mostly for Indic scripts. (Started five years ago, but been busy with other stuff; sorry for the delay.) Note that these "code pages" are for SMS/CBS use only. They are suitable ONLY for that realm of use, and inappropriate everywhere else. Unfortunately, the "code pages" in current 23.038 are not well constructed, nor does it seem that they have even been independently reviewed. So... I made new ones to replace them (technically with other reference numbers since changing an existing "code page", using the same refence number would be inappropriate). I also added "code pages" for several scripts not currently covered by 7-bit code pages (thus having to fall back to using "UCS2" (actually UTF-16(BE) currently, likely incurring a ?size penalty??; the SMS protocol has strict size restrictions, it is not called SHORT message service for nothing). I have no "new" "code pages" for Spanish, Portuguese or Turkish (which have separate "code pages" in 23.038), since these languages are covered better by the new(!!) "default" (actually not default but Latin script) "code page"; intending to deprecate the special code pages for Spanish, Portuguese and Turkish. (Though I call it "new default" it actually has to be set explicitly.) SMS and CBS are still ?a thing? for 5G, 6G and very likely beyond, despite the numerous chat apps and other apps. You can find (draft!) mapping tables (.TXT) and charts (.docx) in https://github.com/kent-karlsson/3gpp-propositions. The text files have in the file name the language code for the principal language for which it is intended (except the "default" code page). The charts have the (SMS/CBS) protocol code page number (in hexadecimal) in the file name and section name. Note that this is work in progress, not yet put forward for standardisation. If you want to comment on these draft proposals, you can do so via github. /Kent Karlsson -------------- next part -------------- An HTML attachment was scrubbed... URL: From thin.crew1671 at railgunlabs.com Wed Sep 3 11:41:40 2025 From: thin.crew1671 at railgunlabs.com (thin.crew1671 at railgunlabs.com) Date: Wed, 03 Sep 2025 11:41:40 -0500 Subject: Possibly incorrect line break tests? Message-ID: In LineBreakTest.txt, there are test cases that indicate there should *not* be a break after U+0308, however, the LB rule cited does not appear to apply and it would appear that there *should* be a break. For example: ? 000A ? 0308 ? 23E9 ? # ? [0.3] (LF_NotEastAsian) ? [5.03] COMBINING DIAERESIS (CM1_NotEastAsian_CM) ? [28.0] BLACK RIGHT-POINTING DOUBLE TRIANGLE (AL) ? [0.3] LB28 states "Do not break between alphabetics (?at?)" with the following break rule: (AL | HL) ? (AL | HL) However, in the aforementioned test case, neither U+000A nor U+0308 has break class AL or HL (they have break class LF and CM). Yet rule 28.0 is cited as the reason for not breaking between U+0308 and U+23E9. It would appear that there _should_ be a break here. Likewise, for the test: ? 200B ? 0308 ? 0024 ? # ? [0.3] ZERO WIDTH SPACE (ZW_NotEastAsian) ? [8.0] COMBINING DIAERESIS (CM1_NotEastAsian_CM) ? [24.03] DOLLAR SIGN (PR_NotEastAsian) ? [0.3] LB24 states "Do not break between alphabetics (?at?)" with the following break rule: (PR | PO) ? (AL | HL) (AL | HL) ? (PR | PO) However, neither U+200B nor U+0308 has break class PR, PO, AL, or HL (they have break class ZW and CM). Yet rule 24.03 is cited as the reason for not breaking between U+0308 and U+0024. It would appear that there _should_ be a break here. In total, I have collected ~80 test cases from LineBreakTest.txt that exhibit this same pattern. I'm wondering if these test cases were meant to have a hyphen character because then they'd respect rule LB20a which states "Do not break after a word-initial hyphen". This rule has the definition: ( sot | BK | CR | LF | NL | SP | ZW | CB | GL ) ( HY | [\u2010] ) ? AL So, for example, test case: ? 000A ? 0308 ? 23E9 ? # LF ? CM ? AL (incorrect?) would become: ? 000A ? 0308 ? 002D ? 23E9 ? # LF ? CM ? HY ? AL (correct) -------------- next part -------------- An HTML attachment was scrubbed... URL: From egg.robin.leroy at gmail.com Wed Sep 3 14:23:15 2025 From: egg.robin.leroy at gmail.com (Robin Leroy) Date: Wed, 3 Sep 2025 21:23:15 +0200 Subject: Possibly incorrect line break tests? In-Reply-To: References: Message-ID: Le mer. 3 sept. 2025 ? 20:10, Henry via Unicode a ?crit : > ? 200B ? 0308 ? 0024 ? # ? [0.3] ZERO WIDTH SPACE (ZW_NotEastAsian) ? > [8.0] COMBINING DIAERESIS (CM1_NotEastAsian_CM) ? [24.03] DOLLAR SIGN > (PR_NotEastAsian) ? [0.3] > > LB24 states "Do not break between alphabetics (?at?)" with the following > break rule: > > (PR | PO) ? (AL | HL) > (AL | HL) ? (PR | PO) > > However, neither U+200B nor U+0308 has break class PR, PO, AL, or HL (they > have break class ZW and CM). > You missed rule LB10. LB9: Treat X (CM | ZWJ)* as if it were X, where X is any line break class except BK, CR, LF, NL, SP, or ZW. LB10: Treat any remaining CM or ZWJ as if it had the properties of U+0041 A LATIN CAPITAL LETTER A, that is, Line_Break=AL, General_Category=Lu, East_Asian_Width=Na, Extended_Pictographic=N. U+0208 is CM. U+200B is ZW, so LB9 does not apply. Therefore, LB10 applies, and it becomes AL for subsequent rules. LB24 therefore applies, (AL | HL) ? (PR | PO). Same for the other example you cite, a CM becomes AL. Best regards, Robin Leroy -------------- next part -------------- An HTML attachment was scrubbed... URL: From kent.b.karlsson at outlook.com Wed Sep 3 14:55:33 2025 From: kent.b.karlsson at outlook.com (Kent Karlsson) Date: Wed, 3 Sep 2025 19:55:33 +0000 Subject: Possibly incorrect line break tests? In-Reply-To: References: Message-ID: Speaking of line breaking... I have not done an analysis of the rules, but something is wrong when it comes to quote marks (QU) and line breaking, at least in some common applications. And it is wrong in a very annoying way, and it happens very often (especially on 'smartphones' where the line length is relatively short). 1) It is common that an automatic line break is inserted between an open quote mark (which vary by language) and the quoted text (no space after the (open) quote mark). 2) It is not uncommon to see an automatic line break between the quoted text and an end quote mark (varies by language). 3) This never happend in the "naive old days" when (almost) only spaces guided where automatic line breaks were inserted. I know that for French it is common to have a space after a begin quote mark, and before an end quote mark. Maybe those should be NARROW NO_BREAK SPACEs (U+202F)... And yes, in some scripts one does not use space inside phrases/sentences at all. It is still quite annoying to see inappropriate automatic line breaks between a (begin) quote mark and a letter/symbol or between a letter/symbol and an (end) quote mark. I can't at this time point to a specific rule to change/fix... (or if it is just some implementations that are at fault). Kent Karlsson From: Unicode On Behalf Of Henry via Unicode Sent: Wednesday, September 3, 2025 6:42 PM To: unicode at corp.unicode.org Subject: Possibly incorrect line break tests? In LineBreakTest.txt, there are test cases that indicate there should *not* be a break after U+0308, however, the LB rule cited does not appear to apply and it would appear that there *should* be a break. For example: ? 000A ? 0308 ? 23E9 ? # ? [0.3] (LF_NotEastAsian) ? [5.03] COMBINING DIAERESIS (CM1_NotEastAsian_CM) ? [28.0] BLACK RIGHT-POINTING DOUBLE TRIANGLE (AL) ? [0.3] LB28 states "Do not break between alphabetics (?at?)" with the following break rule: (AL | HL) ? (AL | HL) However, in the aforementioned test case, neither U+000A nor U+0308 has break class AL or HL (they have break class LF and CM). Yet rule 28.0 is cited as the reason for not breaking between U+0308 and U+23E9. It would appear that there _should_ be a break here. Likewise, for the test: ? 200B ? 0308 ? 0024 ? # ? [0.3] ZERO WIDTH SPACE (ZW_NotEastAsian) ? [8.0] COMBINING DIAERESIS (CM1_NotEastAsian_CM) ? [24.03] DOLLAR SIGN (PR_NotEastAsian) ? [0.3] LB24 states "Do not break between alphabetics (?at?)" with the following break rule: (PR | PO) ? (AL | HL) (AL | HL) ? (PR | PO) However, neither U+200B nor U+0308 has break class PR, PO, AL, or HL (they have break class ZW and CM). Yet rule 24.03 is cited as the reason for not breaking between U+0308 and U+0024. It would appear that there _should_ be a break here. In total, I have collected ~80 test cases from LineBreakTest.txt that exhibit this same pattern. I'm wondering if these test cases were meant to have a hyphen character because then they'd respect rule LB20a which states "Do not break after a word-initial hyphen". This rule has the definition: ( sot | BK | CR | LF | NL | SP | ZW | CB | GL ) ( HY | [\u2010] ) ? AL So, for example, test case: ? 000A ? 0308 ? 23E9 ? # LF ? CM ? AL (incorrect?) would become: ? 000A ? 0308 ? 002D ? 23E9 ? # LF ? CM ? HY ? AL (correct) -------------- next part -------------- An HTML attachment was scrubbed... URL: