<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <div class="moz-cite-prefix">Le 16/12/2020 à 14:47, Roger L Costello

      via Unicode a écrit :<br>

    </div>

    <blockquote type="cite"

cite="mid:SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      <meta name="Generator" content="Microsoft Word 15 (filtered

        medium)">

      <style><!--

/* Font Definitions */

@font-face

        {font-family:Wingdings;

        panose-1:5 0 0 0 0 0 0 0 0 0;}

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:"Nirmala UI";

        panose-1:2 11 5 2 4 2 4 2 2 3;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin-top:0in;

        margin-right:0in;

        margin-bottom:8.0pt;

        margin-left:0in;

        line-height:106%;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;}

p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph

        {mso-style-priority:34;

        margin-top:0in;

        margin-right:0in;

        margin-bottom:8.0pt;

        margin-left:.5in;

        mso-add-space:auto;

        line-height:106%;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;}

p.MsoListParagraphCxSpFirst, li.MsoListParagraphCxSpFirst, div.MsoListParagraphCxSpFirst

        {mso-style-priority:34;

        mso-style-type:export-only;

        margin-top:0in;

        margin-right:0in;

        margin-bottom:0in;

        margin-left:.5in;

        mso-add-space:auto;

        line-height:106%;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;}

p.MsoListParagraphCxSpMiddle, li.MsoListParagraphCxSpMiddle, div.MsoListParagraphCxSpMiddle

        {mso-style-priority:34;

        mso-style-type:export-only;

        margin-top:0in;

        margin-right:0in;

        margin-bottom:0in;

        margin-left:.5in;

        mso-add-space:auto;

        line-height:106%;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;}

p.MsoListParagraphCxSpLast, li.MsoListParagraphCxSpLast, div.MsoListParagraphCxSpLast

        {mso-style-priority:34;

        mso-style-type:export-only;

        margin-top:0in;

        margin-right:0in;

        margin-bottom:8.0pt;

        margin-left:.5in;

        mso-add-space:auto;

        line-height:106%;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;}

span.EmailStyle17

        {mso-style-type:personal-compose;

        font-family:"Calibri",sans-serif;

        color:windowtext;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-family:"Calibri",sans-serif;}size:8.5in 11.0in;

        margin:1.0in 1.0in 1.0in 1.0in;}

div.WordSection1

        {page:WordSection1;}mso-level-number-format:bullet;

        mso-level-text:\F0A7;

        mso-level-tab-stop:none;

        mso-level-number-position:left;

        text-indent:-.25in;

        font-family:Wingdings;}

ol

        {margin-bottom:0in;}

ul

        {margin-bottom:0in;}</style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]-->

      <div class="WordSection1">

        <p class="MsoNormal"><br>

          <o:p></o:p></p>

        <p class="MsoNormal">Unicode make it possible to write things in

          different languages.<o:p></o:p></p>

        [...]<o:p></o:p>

        <p class="MsoNormal">But, but, but, … how come that universality

          doesn’t extend to digits?

          <o:p></o:p></p>

        <p class="MsoNormal">How come we can only use these digits: 0

          (hex 30), 1 (hex 31), …, 9 (hex 39)?<o:p></o:p></p>

        <p class="MsoNormal">Why, for example, can’t a Bengali-speaking

          person use the Bengali digits: Bengali digit 0 (U+09E6),

          Bengali digit 1 (U+09E7), …, Bengali digit 9 (U+09EF)?<o:p></o:p></p>

        <p class="MsoNormal">Why, for example, can’t a Bengali-speaking

          person create XML such as this:<o:p></o:p></p>

        <p class="MsoNormal" style="text-indent:.5in"><<span

            style="font-family:"Nirmala UI",sans-serif">সংখ্যা</span>_<span

            style="font-family:"Nirmala UI",sans-serif">ছাত্র</span>><strong><span

style="font-size:13.5pt;line-height:106%;font-family:"Nirmala

              UI",sans-serif;color:black;background:white">৪</span></strong><strong><span

              style="font-family:"Nirmala UI",sans-serif">୨</span></strong></<span

            style="font-family:"Nirmala UI",sans-serif">সংখ্যা</span>_<span

            style="font-family:"Nirmala UI",sans-serif">ছাত্র</span>><o:p></o:p></p>

        <p class="MsoNormal">or write a program assignment statement

          like this:<o:p></o:p></p>

        <p class="MsoNormal">              <span

            style="font-family:"Nirmala UI",sans-serif">

            সংখ্যা</span>_<span style="font-family:"Nirmala

            UI",sans-serif">ছাত্র</span> = <strong>

            <span

              style="font-size:13.5pt;line-height:106%;font-family:"Nirmala

              UI",sans-serif;color:black;background:white">৪</span></strong><strong><span

              style="font-family:"Nirmala UI",sans-serif">୨</span></strong></p>

      </div>

    </blockquote>

    <p>Is the a specific reason you mix U+09EA : BENGALI DIGIT FOUR and 

      U+0B68 : ORIYA DIGIT TWO. Why not using the Bengali ৪২ or or the

      Oriya ୪୨ ? As such, to me, the string is not a valid number, and

      is merely a way to troll programmers, by encoding forty-two in a

      way which looks like 89.</p>

    <p>To me, parsing <span

        style="font-size:13.5pt;line-height:106%;font-family:"Nirmala

        UI",sans-serif;color:black;background:white">৪</span><span

        style="font-family:"Nirmala UI",sans-serif">୨ as 42 is

        both a bug and a security problem ! </span><span

        style="font-family:"Nirmala UI",sans-serif">Parsing 

        ৪২ as 42 is a valid use case, though, and it is addressed in

        Unicode.</span><span style="font-family:"Nirmala

        UI",sans-serif"><span

          style="font-size:13.5pt;line-height:106%;font-family:"Nirmala

          UI",sans-serif;color:black;background:white"></span><span

          style="font-family:"Nirmala UI",sans-serif"></span></span></p>

    <blockquote type="cite"

cite="mid:SA0PR09MB6907DE4090F17A22E382CF4CC8C50@SA0PR09MB6907.namprd09.prod.outlook.com">

      <div class="WordSection1">

        <p class="MsoNormal"><strong><span

              style="font-family:"Calibri",sans-serif"><o:p></o:p></span></strong></p>

        <p class="MsoNormal">Let me explain why I assert that the

          Bengali-speaking person “cannot” do that.

          <o:p></o:p></p>

        <p class="MsoNormal">Numbers in an XML document or in a program

          are just strings and, to perform arithmetic operations on

          them, those string numbers must be converted to actual

          numbers. I looked at the source code for the C function

          (strtol) that converts strings to numbers and here is the key

          to how it converts a character digit to a number digit:<o:p></o:p></p>

        <p class="MsoNormal">              digit_number =

          digit_character - '0’<o:p></o:p></p>

        <p class="MsoNormal">Yikes!<o:p></o:p></p>

        <p class="MsoNormal">That generates a number digit by treating

          the character digit as a number and subtracting the number

          corresponding to the character ‘0’. For example, if the

          character digit is ‘4’ (hex 34) then when we<font

            face="monospace">SEGMENTED</font> subtract ‘0’ (hex 30) we

          get the number 4. Perfect! But ……… only if we allow European

          digits (0, 1, …, 9). Clearly, if we were to subtract ‘0’ (hex

          30) from the Bengali digit 4 we do not get the number 4.<o:p></o:p></p>

        <p class="MsoNormal">Thus I conclude:<o:p></o:p></p>

        <ul style="margin-top:0in" type="disc">

          <li class="MsoListParagraphCxSpFirst"

            style="margin-left:0in;mso-add-space:auto;mso-list:l0 level1

            lfo1">

            When expressing numbers, the only digits that can be used

            are the European digits<o:p></o:p></li>

          <li class="MsoListParagraphCxSpLast"

            style="margin-left:0in;mso-add-space:auto;mso-list:l0 level1

            lfo1">

            Unicode is universal, but that universality does not apply

            to digits or numbers<o:p></o:p></li>

        </ul>

        <p class="MsoNormal">Obviously I am not understanding something

          correctly. Please help me to understand.</p>

      </div>

    </blockquote>

    <p> digit_number = digit_character - '0’</p>

    <p>Setting aside the Bengali/Oriya problem I stress above, your

      critics should be addressed somewhere else, since the Unicode

      standard is specifically organized to make this possible and easy,

      down to variants of this “hack”: If you read the section 4.6 of

      the standard (in this pdf, accessed from here)<br>

      <br>

    </p>

    <blockquote>The Numeric_Type = Decimal property value (which is

      correlated with the General_Category = Nd property value) is

      limited to those numeric characters that are used in decimal-radix

      numbers and for which a full set of digits has been encoded in a

      contiguous range, with ascending order of Numeric_Value, and with

      the digit zero as the first code point in the range.<br>

      <br>

    </blockquote>

    <p>It’s quite easy to make a lbrary which parses <font

        face="monospace">UniccodeData.txt</font> (version 13.0 <a

        moz-do-not-send="true"

        href="https://www.unicode.org/Public/13.0.0/ucd/UnicodeData.txt">here</a>)

      and extract the digit ranges of the various scripts and convert

      the various strings into number for the 50 scripts listed in table

      22-3 of the standard plus the western digits (<a

        moz-do-not-send="true"

        href="https://www.unicode.org/versions/Unicode13.0.0/ch22.pdf">Unicode

        13.0 pdf here</a>), it should be reasonably furureproof, in the

      sense that parsing future unicode datafile should add stipts as

      they are encoded. However, do not forget to check the exceptions

      in the text around this table in in the relevant script pages: in

      Unicode 13.0, it concerns Arabic, which has to sets of digits,

      Myanmar (3 sets), and Tai Tham (2 sets).</p>

    <p>Automatically processing this data files give you a few extra

      sets (<font face="monospace">FULLWIDTH, </font><font

        face="monospace">MATHEMATICAL</font> formatted sets, and <font

        face="monospace">SEGMENTED</font>) as witnessed by these way of

      encoding 42 extracted by a quick script among the lines given

      above:</p>

    <p><font face="monospace">: 42                             </font><font

        face="monospace"> digit_number = digit_character - '0’ 

        ARABIC-INDIC: ٤٢                   EXTENDED ARABIC-INDIC: ۴۲<br>

        NKO: ߄߂                            DEVANAGARI:

        ४२                     BENGALI: ৪২<br>

        GURMUKHI: ੪੨                       GUJARATI:

        ૪૨                       ORIYA: ୪୨<br>

        TAMIL: ௪௨                          TELUGU:

        ౪౨                         KANNADA: ೪೨<br>

        MALAYALAM: ൪൨                      SINHALA LITH:

        ෪෨                   THAI: ๔๒<br>

        LAO: ໔໒                            TIBETAN:

        ༤༢                        MYANMAR: ၄၂<br>

        MYANMAR SHAN: ႔႒                   KHMER:

        ៤២                          MONGOLIAN: ᠔᠒<br>

        LIMBU: ᥊᥈                          NEW TAI LUE:

        ᧔᧒                    TAI THAM HORA: ᪄᪂<br>

        TAI THAM THAM: ᪔᪒                  BALINESE:

        ᭔᭒                       SUNDANESE: ᮴᮲<br>

        LEPCHA: ᱄᱂                         OL CHIKI:

        ᱔᱒                       VAI: ꘤꘢<br>

        SAURASHTRA: ꣔꣒                     KAYAH LI:

        ꤄꤂                       JAVANESE: ꧔꧒<br>

        MYANMAR TAI LAING: ꧴꧲              CHAM:

        ꩔꩒                           MEETEI MAYEK: ꯴꯲<br>

        FULLWIDTH: ４２                      OSMANYA:

        𐒤𐒢                        HANIFI ROHINGYA: 𐴴𐴲<br>

        BRAHMI: 𑁪𑁨                         SORA SOMPENG:

        𑃴𑃲                   CHAKMA: 𑄺𑄸<br>

        SHARADA: 𑇔𑇒                        KHUDAWADI:

        𑋴𑋲                      NEWA: 𑑔𑑒<br>

        TIRHUTA: 𑓔𑓒                        MODI:

        𑙔𑙒                           TAKRI: 𑛄𑛂<br>

        AHOM: 𑜴𑜲                           WARANG CITI:

        𑣤𑣢                    DIVES AKURU: 𑥔𑥒<br>

        BHAIKSUKI: 𑱔𑱒                      MASARAM GONDI:

        𑵔𑵒                  GUNJALA GONDI: 𑶤𑶢<br>

        MRO: 𖩤𖩢                            PAHAWH HMONG:

        𖭔𖭒                   MATHEMATICAL BOLD: 𝟒𝟐<br>

        MATHEMATICAL DOUBLE-STRUCK: 𝟜𝟚     MATHEMATICAL SANS-SERIF:

        𝟦𝟤        MATHEMATICAL SANS-SERIF BOLD: 𝟰𝟮<br>

        MATHEMATICAL MONOSPACE: 𝟺𝟸         NYIAKENG PUACHUE HMONG:

        𞅄𞅂         WANCHO: 𞋴𞋲<br>

        ADLAM: 𞥔𞥒                          SEGMENTED:

        🯴🯲                      <br>

        <br>

      </font></p>

    <p>In all cases, parsing can work by subtractiong the value of digit

      zero: <br>

    </p>

    <p>The following works, but something more subtle/universal is

      advisable:<br>

    </p>

    <p> <font face="monospace">    Bengali_digit_number =

        Bengli_digit_character - </font>'০'</p>

    <p>However, this approach is not universal: it doesn’t take into

      account number systems, including some currently in use e.g. in

      East Asia (四十二, 四二, or 肆拾贰), Ethiopia (፵፪) or even Europe (XLII)<br>

    </p>

    <p>    Frédéric<br>

    </p>

  </body>

</html>