<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">Le 16/12/2020 à 14:47, Roger L Costello
via Unicode a écrit :<br>
</div>
<blockquote type="cite"
cite="mid:SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:Wingdings;
panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:"Nirmala UI";
panose-1:2 11 5 2 4 2 4 2 2 3;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin-top:0in;
margin-right:0in;
margin-bottom:8.0pt;
margin-left:0in;
line-height:106%;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin-top:0in;
margin-right:0in;
margin-bottom:8.0pt;
margin-left:.5in;
mso-add-space:auto;
line-height:106%;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
p.MsoListParagraphCxSpFirst, li.MsoListParagraphCxSpFirst, div.MsoListParagraphCxSpFirst
{mso-style-priority:34;
mso-style-type:export-only;
margin-top:0in;
margin-right:0in;
margin-bottom:0in;
margin-left:.5in;
mso-add-space:auto;
line-height:106%;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
p.MsoListParagraphCxSpMiddle, li.MsoListParagraphCxSpMiddle, div.MsoListParagraphCxSpMiddle
{mso-style-priority:34;
mso-style-type:export-only;
margin-top:0in;
margin-right:0in;
margin-bottom:0in;
margin-left:.5in;
mso-add-space:auto;
line-height:106%;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
p.MsoListParagraphCxSpLast, li.MsoListParagraphCxSpLast, div.MsoListParagraphCxSpLast
{mso-style-priority:34;
mso-style-type:export-only;
margin-top:0in;
margin-right:0in;
margin-bottom:8.0pt;
margin-left:.5in;
mso-add-space:auto;
line-height:106%;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri",sans-serif;}size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
{page:WordSection1;}mso-level-number-format:bullet;
mso-level-text:\F0A7;
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;
font-family:Wingdings;}
ol
{margin-bottom:0in;}
ul
{margin-bottom:0in;}</style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<div class="WordSection1">
<p class="MsoNormal"><br>
<o:p></o:p></p>
<p class="MsoNormal">Unicode make it possible to write things in
different languages.<o:p></o:p></p>
[...]<o:p></o:p>
<p class="MsoNormal">But, but, but, … how come that universality
doesn’t extend to digits?
<o:p></o:p></p>
<p class="MsoNormal">How come we can only use these digits: 0
(hex 30), 1 (hex 31), …, 9 (hex 39)?<o:p></o:p></p>
<p class="MsoNormal">Why, for example, can’t a Bengali-speaking
person use the Bengali digits: Bengali digit 0 (U+09E6),
Bengali digit 1 (U+09E7), …, Bengali digit 9 (U+09EF)?<o:p></o:p></p>
<p class="MsoNormal">Why, for example, can’t a Bengali-speaking
person create XML such as this:<o:p></o:p></p>
<p class="MsoNormal" style="text-indent:.5in"><<span
style="font-family:"Nirmala UI",sans-serif">সংখ্যা</span>_<span
style="font-family:"Nirmala UI",sans-serif">ছাত্র</span>><strong><span
style="font-size:13.5pt;line-height:106%;font-family:"Nirmala
UI",sans-serif;color:black;background:white">৪</span></strong><strong><span
style="font-family:"Nirmala UI",sans-serif">୨</span></strong></<span
style="font-family:"Nirmala UI",sans-serif">সংখ্যা</span>_<span
style="font-family:"Nirmala UI",sans-serif">ছাত্র</span>><o:p></o:p></p>
<p class="MsoNormal">or write a program assignment statement
like this:<o:p></o:p></p>
<p class="MsoNormal"> <span
style="font-family:"Nirmala UI",sans-serif">
সংখ্যা</span>_<span style="font-family:"Nirmala
UI",sans-serif">ছাত্র</span> = <strong>
<span
style="font-size:13.5pt;line-height:106%;font-family:"Nirmala
UI",sans-serif;color:black;background:white">৪</span></strong><strong><span
style="font-family:"Nirmala UI",sans-serif">୨</span></strong></p>
</div>
</blockquote>
<p>Is the a specific reason you mix U+09EA : BENGALI DIGIT FOUR and
U+0B68 : ORIYA DIGIT TWO. Why not using the Bengali ৪২ or or the
Oriya ୪୨ ? As such, to me, the string is not a valid number, and
is merely a way to troll programmers, by encoding forty-two in a
way which looks like 89.</p>
<p>To me, parsing <span
style="font-size:13.5pt;line-height:106%;font-family:"Nirmala
UI",sans-serif;color:black;background:white">৪</span><span
style="font-family:"Nirmala UI",sans-serif">୨ as 42 is
both a bug and a security problem ! </span><span
style="font-family:"Nirmala UI",sans-serif">Parsing
৪২ as 42 is a valid use case, though, and it is addressed in
Unicode.</span><span style="font-family:"Nirmala
UI",sans-serif"><span
style="font-size:13.5pt;line-height:106%;font-family:"Nirmala
UI",sans-serif;color:black;background:white"></span><span
style="font-family:"Nirmala UI",sans-serif"></span></span></p>
<blockquote type="cite"
cite="mid:SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com">
<div class="WordSection1">
<p class="MsoNormal"><strong><span
style="font-family:"Calibri",sans-serif"><o:p></o:p></span></strong></p>
<p class="MsoNormal">Let me explain why I assert that the
Bengali-speaking person “cannot” do that.
<o:p></o:p></p>
<p class="MsoNormal">Numbers in an XML document or in a program
are just strings and, to perform arithmetic operations on
them, those string numbers must be converted to actual
numbers. I looked at the source code for the C function
(strtol) that converts strings to numbers and here is the key
to how it converts a character digit to a number digit:<o:p></o:p></p>
<p class="MsoNormal"> digit_number =
digit_character - '0’<o:p></o:p></p>
<p class="MsoNormal">Yikes!<o:p></o:p></p>
<p class="MsoNormal">That generates a number digit by treating
the character digit as a number and subtracting the number
corresponding to the character ‘0’. For example, if the
character digit is ‘4’ (hex 34) then when we<font
face="monospace">SEGMENTED</font> subtract ‘0’ (hex 30) we
get the number 4. Perfect! But ……… only if we allow European
digits (0, 1, …, 9). Clearly, if we were to subtract ‘0’ (hex
30) from the Bengali digit 4 we do not get the number 4.<o:p></o:p></p>
<p class="MsoNormal">Thus I conclude:<o:p></o:p></p>
<ul style="margin-top:0in" type="disc">
<li class="MsoListParagraphCxSpFirst"
style="margin-left:0in;mso-add-space:auto;mso-list:l0 level1
lfo1">
When expressing numbers, the only digits that can be used
are the European digits<o:p></o:p></li>
<li class="MsoListParagraphCxSpLast"
style="margin-left:0in;mso-add-space:auto;mso-list:l0 level1
lfo1">
Unicode is universal, but that universality does not apply
to digits or numbers<o:p></o:p></li>
</ul>
<p class="MsoNormal">Obviously I am not understanding something
correctly. Please help me to understand.</p>
</div>
</blockquote>
<p> digit_number = digit_character - '0’</p>
<p>Setting aside the Bengali/Oriya problem I stress above, your
critics should be addressed somewhere else, since the Unicode
standard is specifically organized to make this possible and easy,
down to variants of this “hack”: If you read the section 4.6 of
the standard (in this pdf, accessed from here)<br>
<br>
</p>
<blockquote>The Numeric_Type = Decimal property value (which is
correlated with the General_Category = Nd property value) is
limited to those numeric characters that are used in decimal-radix
numbers and for which a full set of digits has been encoded in a
contiguous range, with ascending order of Numeric_Value, and with
the digit zero as the first code point in the range.<br>
<br>
</blockquote>
<p>It’s quite easy to make a lbrary which parses <font
face="monospace">UniccodeData.txt</font> (version 13.0 <a
moz-do-not-send="true"
href="https://www.unicode.org/Public/13.0.0/ucd/UnicodeData.txt">here</a>)
and extract the digit ranges of the various scripts and convert
the various strings into number for the 50 scripts listed in table
22-3 of the standard plus the western digits (<a
moz-do-not-send="true"
href="https://www.unicode.org/versions/Unicode13.0.0/ch22.pdf">Unicode
13.0 pdf here</a>), it should be reasonably furureproof, in the
sense that parsing future unicode datafile should add stipts as
they are encoded. However, do not forget to check the exceptions
in the text around this table in in the relevant script pages: in
Unicode 13.0, it concerns Arabic, which has to sets of digits,
Myanmar (3 sets), and Tai Tham (2 sets).</p>
<p>Automatically processing this data files give you a few extra
sets (<font face="monospace">FULLWIDTH, </font><font
face="monospace">MATHEMATICAL</font> formatted sets, and <font
face="monospace">SEGMENTED</font>) as witnessed by these way of
encoding 42 extracted by a quick script among the lines given
above:</p>
<p><font face="monospace">: 42 </font><font
face="monospace"> digit_number = digit_character - '0’
ARABIC-INDIC: ٤٢ EXTENDED ARABIC-INDIC: ۴۲<br>
NKO: ߄߂ DEVANAGARI:
४२ BENGALI: ৪২<br>
GURMUKHI: ੪੨ GUJARATI:
૪૨ ORIYA: ୪୨<br>
TAMIL: ௪௨ TELUGU:
౪౨ KANNADA: ೪೨<br>
MALAYALAM: ൪൨ SINHALA LITH:
෪෨ THAI: ๔๒<br>
LAO: ໔໒ TIBETAN:
༤༢ MYANMAR: ၄၂<br>
MYANMAR SHAN: ႔႒ KHMER:
៤២ MONGOLIAN: ᠔᠒<br>
LIMBU: ᥊᥈ NEW TAI LUE:
᧔᧒ TAI THAM HORA: ᪄᪂<br>
TAI THAM THAM: ᪔᪒ BALINESE:
᭔᭒ SUNDANESE: ᮴᮲<br>
LEPCHA: ᱄᱂ OL CHIKI:
᱔᱒ VAI: ꘤꘢<br>
SAURASHTRA: ꣔꣒ KAYAH LI:
꤄꤂ JAVANESE: ꧔꧒<br>
MYANMAR TAI LAING: ꧴꧲ CHAM:
꩔꩒ MEETEI MAYEK: ꯴꯲<br>
FULLWIDTH: 42 OSMANYA:
𐒤𐒢 HANIFI ROHINGYA: 𐴴𐴲<br>
BRAHMI: 𑁪𑁨 SORA SOMPENG:
𑃴𑃲 CHAKMA: 𑄺𑄸<br>
SHARADA: 𑇔𑇒 KHUDAWADI:
𑋴𑋲 NEWA: 𑑔𑑒<br>
TIRHUTA: 𑓔𑓒 MODI:
𑙔𑙒 TAKRI: 𑛄𑛂<br>
AHOM: 𑜴𑜲 WARANG CITI:
𑣤𑣢 DIVES AKURU: 𑥔𑥒<br>
BHAIKSUKI: 𑱔𑱒 MASARAM GONDI:
𑵔𑵒 GUNJALA GONDI: 𑶤𑶢<br>
MRO: 𖩤𖩢 PAHAWH HMONG:
𖭔𖭒 MATHEMATICAL BOLD: 𝟒𝟐<br>
MATHEMATICAL DOUBLE-STRUCK: 𝟜𝟚 MATHEMATICAL SANS-SERIF:
𝟦𝟤 MATHEMATICAL SANS-SERIF BOLD: 𝟰𝟮<br>
MATHEMATICAL MONOSPACE: 𝟺𝟸 NYIAKENG PUACHUE HMONG:
𞅄𞅂 WANCHO: 𞋴𞋲<br>
ADLAM: 𞥔𞥒 SEGMENTED:
🯴🯲 <br>
<br>
</font></p>
<p>In all cases, parsing can work by subtractiong the value of digit
zero: <br>
</p>
<p>The following works, but something more subtle/universal is
advisable:<br>
</p>
<p> <font face="monospace"> Bengali_digit_number =
Bengli_digit_character - </font>'০'</p>
<p>However, this approach is not universal: it doesn’t take into
account number systems, including some currently in use e.g. in
East Asia (四十二, 四二, or 肆拾贰), Ethiopia (፵፪) or even Europe (XLII)<br>
</p>
<p> Frédéric<br>
</p>
</body>
</html>