From unicode at unicode.org Sat Feb 1 16:10:44 2020 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sat, 1 Feb 2020 22:10:44 +0000 Subject: Combining Marks and Variation Selectors Message-ID: <20200201221044.42bd7b0a@JRWUBU2> Why are variation selectors not allowed for combining marks? I can see a reason for them not being allowed on characters with non-zero canonical combining classes, but not for them being prohibited for combining marks that are starters, i.e. have ccc=0. Richard. From unicode at unicode.org Sat Feb 1 19:59:57 2020 From: unicode at unicode.org (Roozbeh Pournader via Unicode) Date: Sat, 1 Feb 2020 17:59:57 -0800 Subject: Combining Marks and Variation Selectors In-Reply-To: <20200201221044.42bd7b0a@JRWUBU2> References: <20200201221044.42bd7b0a@JRWUBU2> Message-ID: They are actually allowed on combining marks of ccc=0. We even define one such variation sequence for Myanmar, IIRC. On Sat, Feb 1, 2020, 2:12 PM Richard Wordingham via Unicode < unicode at unicode.org> wrote: > Why are variation selectors not allowed for combining marks? I can see > a reason for them not being allowed on characters with non-zero > canonical combining classes, but not for them being prohibited for > combining marks that are starters, i.e. have ccc=0. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sat Feb 1 21:30:31 2020 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 2 Feb 2020 03:30:31 +0000 Subject: Combining Marks and Variation Selectors In-Reply-To: References: <20200201221044.42bd7b0a@JRWUBU2> Message-ID: <20200202033031.3b33ccab@JRWUBU2> On Sat, 1 Feb 2020 17:59:57 -0800 Roozbeh Pournader via Unicode wrote: > They are actually allowed on combining marks of ccc=0. We even define > one such variation sequence for Myanmar, IIRC. > > On Sat, Feb 1, 2020, 2:12 PM Richard Wordingham via Unicode < > unicode at unicode.org> wrote: > > > Why are variation selectors not allowed for combining marks? I can > > see a reason for them not being allowed on characters with non-zero > > canonical combining classes, but not for them being prohibited for > > combining marks that are starters, i.e. have ccc=0. Ah, I missed that change from Version 5.0, where the restriction was, 'The base character in a variation sequence is never a combining character or a decomposable character'. I now need to rephrase the question. Why are marks other than spacing marks prohibited? Richard. From unicode at unicode.org Sun Feb 2 09:51:56 2020 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Sun, 2 Feb 2020 07:51:56 -0800 Subject: Combining Marks and Variation Selectors In-Reply-To: <20200202033031.3b33ccab@JRWUBU2> References: <20200201221044.42bd7b0a@JRWUBU2> <20200202033031.3b33ccab@JRWUBU2> Message-ID: <12f84cee-4d5c-fe95-5a98-794fef7de97a@sonic.net> Richard, What it comes down to is avoidance of conundrums involving canonical reordering for normalization. The effect of variation selectors is defined in terms of an immediate adjacency. If you allowed variation selectors to be defined for combining marks of ccc!=0, then normalization of sequences could, in principle, move the two apart. That would make implementation of the intended rendering much more difficult. That is basically why the UTC, from the start, ruled out using variation selectors to try to make graphic distinctions between different styles of acute accent marks explicit, for example. --Ken On 2/1/2020 7:30 PM, Richard Wordingham via Unicode wrote: > Ah, I missed that change from Version 5.0, where the restriction was, > 'The base character in a variation sequence is never a combining > character or a decomposable character'. I now need to rephrase the > question. Why are marks other than spacing marks prohibited? > From unicode at unicode.org Sun Feb 2 12:05:07 2020 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Sun, 2 Feb 2020 18:05:07 +0000 Subject: Combining Marks and Variation Selectors In-Reply-To: <12f84cee-4d5c-fe95-5a98-794fef7de97a@sonic.net> References: <20200201221044.42bd7b0a@JRWUBU2> <20200202033031.3b33ccab@JRWUBU2> <12f84cee-4d5c-fe95-5a98-794fef7de97a@sonic.net> Message-ID: <20200202180507.539bc3f9@JRWUBU2> On Sun, 2 Feb 2020 07:51:56 -0800 Ken Whistler via Unicode wrote: > What it comes down to is avoidance of conundrums involving canonical > reordering for normalization. The effect of variation selectors is > defined in terms of an immediate adjacency. If you allowed variation > selectors to be defined for combining marks of ccc!=0, then > normalization of sequences could, in principle, move the two apart. > That would make implementation of the intended rendering much more > difficult. I can understand that for non-starters. However, a lot of non-spacing combining marks are starters (i.e. ccc=0), so they would not be a problem. is an unbreakable block in canonical equivalence-preserving changes. Is this restriction therefore just a holdover from when canonical equivalence could be corrected? Richard. From unicode at unicode.org Sun Feb 2 13:43:18 2020 From: unicode at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via Unicode) Date: Sun, 2 Feb 2020 11:43:18 -0800 Subject: Combining Marks and Variation Selectors In-Reply-To: <20200202180507.539bc3f9@JRWUBU2> References: <20200201221044.42bd7b0a@JRWUBU2> <20200202033031.3b33ccab@JRWUBU2> <12f84cee-4d5c-fe95-5a98-794fef7de97a@sonic.net> <20200202180507.539bc3f9@JRWUBU2> Message-ID: I don't think there is a technical reason for disallowing variation selectors after any starters (ccc=000); the normalization algorithm doesn't care about the general category of characters. Mark On Sun, Feb 2, 2020 at 10:09 AM Richard Wordingham via Unicode < unicode at unicode.org> wrote: > On Sun, 2 Feb 2020 07:51:56 -0800 > Ken Whistler via Unicode wrote: > > > What it comes down to is avoidance of conundrums involving canonical > > reordering for normalization. The effect of variation selectors is > > defined in terms of an immediate adjacency. If you allowed variation > > selectors to be defined for combining marks of ccc!=0, then > > normalization of sequences could, in principle, move the two apart. > > That would make implementation of the intended rendering much more > > difficult. > > I can understand that for non-starters. However, a lot of non-spacing > combining marks are starters (i.e. ccc=0), so they would not be a > problem. is an unbreakable block in > canonical equivalence-preserving changes. Is this restriction therefore > just a holdover from when canonical equivalence could be corrected? > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Feb 2 18:20:07 2020 From: unicode at unicode.org (Eric Muller via Unicode) Date: Sun, 2 Feb 2020 16:20:07 -0800 Subject: Combining Marks and Variation Selectors In-Reply-To: References: <20200201221044.42bd7b0a@JRWUBU2> <20200202033031.3b33ccab@JRWUBU2> <12f84cee-4d5c-fe95-5a98-794fef7de97a@sonic.net> <20200202180507.539bc3f9@JRWUBU2> Message-ID: <08e31147-4bda-16cc-d22b-e89accd965ac@efele.net> An HTML attachment was scrubbed... URL: From unicode at unicode.org Sun Feb 2 19:22:50 2020 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Mon, 3 Feb 2020 01:22:50 +0000 Subject: Combining Marks and Variation Selectors In-Reply-To: <08e31147-4bda-16cc-d22b-e89accd965ac@efele.net> References: <20200201221044.42bd7b0a@JRWUBU2> <20200202033031.3b33ccab@JRWUBU2> <12f84cee-4d5c-fe95-5a98-794fef7de97a@sonic.net> <20200202180507.539bc3f9@JRWUBU2> <08e31147-4bda-16cc-d22b-e89accd965ac@efele.net> Message-ID: <20200203012250.5b886d5b@JRWUBU2> On Sun, 2 Feb 2020 16:20:07 -0800 Eric Muller via Unicode wrote: > That would imply some coordination among variations sequences on > different code points, right? > > E.g. <0B48> ? <0B47, 0B56>, so a variation sequence on 0B56 (Mn, > ccc=0) would imply the existence of a variation sequence on 0B48 with > the same variation selector, and the same effect. That particular case oughtn't to be impossible, as in NFD everything in sight has ccc=0. However TUS 12.0 Section 23.4 does contain an additional prohibition against meaningfully applying a variation selector to a 'canonical decomposable character'. (Scare quotes because 'ly' seems to be missing from the phrase.) Richard. > On 2/2/2020 11:43 AM, Mark Davis ?? via Unicode wrote: > I don't think there is a technical reason for disallowing variation > selectors after any starters (ccc=000); the normalization algorithm > doesn't care about the general category of characters. > > Mark From unicode at unicode.org Sun Feb 2 19:43:34 2020 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Sun, 2 Feb 2020 17:43:34 -0800 Subject: Combining Marks and Variation Selectors In-Reply-To: <20200203012250.5b886d5b@JRWUBU2> References: <20200201221044.42bd7b0a@JRWUBU2> <20200202033031.3b33ccab@JRWUBU2> <12f84cee-4d5c-fe95-5a98-794fef7de97a@sonic.net> <20200202180507.539bc3f9@JRWUBU2> <08e31147-4bda-16cc-d22b-e89accd965ac@efele.net> <20200203012250.5b886d5b@JRWUBU2> Message-ID: An HTML attachment was scrubbed... URL: From unicode at unicode.org Mon Feb 10 04:29:00 2020 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Mon, 10 Feb 2020 10:29:00 +0000 (GMT) Subject: Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good reason why I ask Message-ID: <141cecf1.23e.1702ea529c1.Webtop.218@btinternet.com> Hi Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good reason why I ask There is a German song, Lorelei, and I searched to find an English translation. I found the following video. https://www.youtube.com/watch?v=lJ3JhxOUbw0 The video is an instrumental version and is particularly interesting is that there are lyrics displayed in four languages, with two versions of the translation in English. Being a native speaker of English and living in England I first watched the video viewing just the version labelled British:. Later I played the video again and I just viewed the version labelled U.S.. Remembering that I had some time ago heard a version in Esperanto, I searched nd found the two following videos. https://www.youtube.com/watch?v=reUpdGgdBsA https://www.youtube.com/watch?v=7dHhTXDmP0k They may be of the same recording. This first has in its notes the text of the lyrics. The song in Esperanto has the rather expressive Esperanto word belega in it. This single word, an adjective, is composed from the Esperanto word bela which means beautiful augmented with the Esperanto word-building component -eg- that modifies the word to which it is an augmentation to indicate greatness. So the word belega expresses in one three-syllable Esperanto word the concept that is in English "greatly beautiful". http://esperanto.davidgsimpson.com/eo-affixes.html Thinking of the first video to which I linked, it occurred to me that if a plain text message were sent containing each of two or more versions of the same text, for whatever text, probably a short message in practice, each in a different language from the other or others, with the language of a particular version preceded by a tag sequence: then software at the receiving end could be set to a chosen language and only text in that language would be displayed. Thinking around this idea I thought that this could be very useful in The Internet of Things for machine to human communication, whereby, if, say, an end user (human) is wanting to dialogue with a device (thing) then the technique could be used to send the message Please enter the password from the thing in a number of languages. The decoding software in the end user's computer could use the first message in the list as the default if the sequence sent by the thing does not have a version for the particular language set by the end user in his or her computer. The list of languages supported by a particular thing would not be specified by a universal standard, but could perhaps have English, French, German and one or more others depending up the location and application of the thing. Any language expressible in Unicode could be included in the list. Support for Unicode characters beyond plane 0 is much more obtainable in software these days. I know that people have been urged to use a higher level protocol for indicating in language documents, but please consider if one is wanting to assemble automatically a status report by combing reports from each of a number of mutually independent sensors on the Internet of Things, each of relatively small size, located in a variety of physical locations perhaps miles apart. In such a case the concatenation of such plain text sequences would be straightforward. Such an undeprecating of U+E0001 LANGUAGE TAG would, in my opinion, contribute to the development of The Internet of Things. William Overington Monday 10 February 2020 From unicode at unicode.org Mon Feb 10 10:55:00 2020 From: unicode at unicode.org (Steffen Nurpmeso via Unicode) Date: Mon, 10 Feb 2020 17:55:00 +0100 Subject: Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good reason why I ask In-Reply-To: <141cecf1.23e.1702ea529c1.Webtop.218@btinternet.com> References: <141cecf1.23e.1702ea529c1.Webtop.218@btinternet.com> Message-ID: <20200210165500.9X6W-%steffen@sdaoden.eu> wjgo_10009 at btinternet.com via Unicode wrote in <141cecf1.23e.1702ea529c1.Webtop.218 at btinternet.com>: |Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good |reason why I ask | |There is a German song, Lorelei, and I searched to find an English |translation. Regarding Rhine and this thing of yours, there is also the German joke from the middle of the 1950s, i think, with "T?nnes und Sch?l". T?nnes und Sch?l stehen auf der Rheinbr?cke. Da f?llt T?nnes die Brille in den Flu? und er sagt "Da schau, jetzt ist mir die Brille in die Mosel gefallen", worauf Sch?l sagt, "Mensch, T?nnes, dat is doch de Ring!", und T?nnes antwortet "Da kannste mal sehen wie schlecht ich ohne Brille sehen kann!" Tuennes und Schael stand on the Rhine bridge. Then Tuennes glasses fall into the river, and he says "Look, now i lost my glasses to the Moselle", whereupon Schael says "Crumbs!, Tuennes, that is the Rhine!", and Tuennes responds "There you can say how bad i can see without glasses!" P.S.: i cannot speak "K?sch" aka Cologne dialect. P.P.S.: i think i got you wrong. --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) From unicode at unicode.org Mon Feb 10 17:14:03 2020 From: unicode at unicode.org (=?UTF-8?Q?S=C5=82awomir_Osipiuk?= via Unicode) Date: Mon, 10 Feb 2020 18:14:03 -0500 Subject: Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good reason why I ask In-Reply-To: <141cecf1.23e.1702ea529c1.Webtop.218@btinternet.com> References: <141cecf1.23e.1702ea529c1.Webtop.218@btinternet.com> Message-ID: <000d01d5e067$c9d747c0$5d85d740$@gmail.com> The examples given don't convince me that "higher-level protocols" would not be sufficient. There are very few messages being sent in the "Internet of Things" that are truly plain-text. Even those that use a text base (as opposed to binary data) are still in some kind of structured computer language, be it HTML, XML, JSON, etc. The intended natural language can be specified using that structure. Sending multiples of the same message in different languages is really only applicable to broadcast/multicast scenarios, where you have a transmission going out live to multiple recipients who have different language demands. I can't immediately think of any examples where this is done with plain-text only, though I'd be glad to learn about them, if they exist. For any peer-to-peer or client-server interaction, as in your password example, it makes more sense to have the recipient request a specific language (e.g. using HTTP's "Accept-Language" header) and the sender to send its message in that language automatically. As for "concatenation of such plain text sequences" where each sequence is in a different language, I must again ask: Is there a system that actually does this, that does not have a higher-level protocol that can carry metadata about the natural language of the text sequences? Basically, I doubt Unicode language tags would be useful here because there simply is no Internet-based system that transmits human-readable text, in multiple natural languages, in such a rudimentary way, with no encapsulating protocol or metadata. And I doubt there will be; it seems like such a strange design choice in this day and age. Though I'd be glad to be corrected if someone has an example. S?awomir Osipiuk From unicode at unicode.org Mon Feb 10 19:06:24 2020 From: unicode at unicode.org (Mark E. Shoulson via Unicode) Date: Mon, 10 Feb 2020 20:06:24 -0500 Subject: Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good reason why I ask In-Reply-To: <000d01d5e067$c9d747c0$5d85d740$@gmail.com> References: <141cecf1.23e.1702ea529c1.Webtop.218@btinternet.com> <000d01d5e067$c9d747c0$5d85d740$@gmail.com> Message-ID: On 2/10/20 6:14 PM, S?awomir Osipiuk via Unicode wrote: > As for "concatenation of such plain text sequences" where each sequence is in a different language, I must again ask: Is there a system that actually does this, that does not have a higher-level protocol that can carry metadata about the natural language of the text sequences? Indeed, it seems to me that concatenating such sequences *is* in itself a higher-level protocol.? After all, it isn't? "plain text" anymore when you have to suppress printing out some of it.? And we already have other higher-level protocols that can do the job about as efficiently.? So at least this particular application would be a solution to a problem that's already been solved. ~mark From unicode at unicode.org Tue Feb 11 18:00:01 2020 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Wed, 12 Feb 2020 00:00:01 +0000 (GMT) Subject: Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good reason why I ask Message-ID: <60f4441b.204f.17036b207d1.Webtop.73@btinternet.com> Hi Thank you to everybody who replied to this thread, both online and offline. S?awomir Osipiuk wrote: >> As for "concatenation of such plain text sequences" where each >> sequence is in a different language, ... Actually I was meaning the concatenation of a number of messages, one from each of a number "things", where each message includes text in several languages. The result being a report in several languages, just by simple concatenation of the number of reports. That is, if there are seven sensors, the final report has seven uses of the language code for English, seven for French, seven for German, seven for Polish, and so on. Mark E. Shoulson wrote: > So at least this particular application would be a solution to a > problem that's already been solved. Well, maybe it is now a solution that is out there and maybe some day a problem will arise for which this would be a solution worth considering. So for now it drifts into the archives. Best regards, William Overington Tuesday 11 February 2020 From unicode at unicode.org Wed Feb 12 10:24:34 2020 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Wed, 12 Feb 2020 16:24:34 +0000 (GMT) Subject: Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good reason why I ask In-Reply-To: <000d01d5e067$c9d747c0$5d85d740$@gmail.com> References: <141cecf1.23e.1702ea529c1.Webtop.218@btinternet.com> <000d01d5e067$c9d747c0$5d85d740$@gmail.com> Message-ID: <41235c02.8df.1703a3766cf.Webtop.231@btinternet.com> Hi At the time, I thought that my post yesterday concluded the thread. However, later something occurred to me as a result of something in the post by S?awomir Osipiuk. The gentleman wrote as follows: > Sending multiples of the same message in different languages is really > only applicable to broadcast/multicast scenarios, where you have a > transmission going out live to multiple recipients who have different > language demands. I can't immediately think of any examples where this > is done with plain-text only, though I'd be glad to learn about them, > if they exist. Whilst I do not know of anything of where this is presently done, I realized that this would be a practical proposition for some of the things in the Internet of things. I am reminded of the teletext system (with brand names such as Ceefax and Oracle) in the United KIngdom, which was a broadcasting technology introduced in the 1970s and which became very much a part of British culture during the 1980s and 1990s. A digital signal of a special purpose 7-bit character set was broadcast in the vertical blanking interval of a 625 line analogue television signal. Basically in some lines normally used for the colour picture but some lines were not used during the time allowed for the scan go back to the top of the picture once it reached the lower edge of the picture. So this digital information service got a free ride in the picture signal going out to receivers all over the country. The information was organised into pages and an end user could go to "text" and then wait for a selected page to come round again in the continuous cyclic broadcasting of pages. Pages could be arranged by the broadcaster so that, say, the news headlines page came around maybe four times in each, say, 20 second cycle and some pages only once. It was very effective as the special purpose 7-bit character set, while being basically ASCII, had control characters that were stateful and displayed each as a space yet some of them switched the colour of the following text until a new control character for a colour were received, if it indeed one were received; or until the end of the 40 character line of the display. Each line started with white text, though if the first character of the line switched to a colour, the end user would not see any white text. The control codes set also included switching to chunky graphics mode. There was also a facility to use the system for subtitles to the television programme, optional subtitles so that end users could have them on if desired yet other users were not thereby forced to have subtitles. It was good, as various participants in a discussion - whether news or drama - could each have a colour for their speaking, such as green, yellow, cyan, white. No return link was needed to send information from the end user to the central broadcasting computer. A system with the same format of display was a viewdata system (brand name Prestel) but that was very different from teletext and used a two-way telephone line connection. In a viewdata system, the end user selected a page from a menu then a message requesting that page was sent to the central computer and just that page was sent to the end user. A fee for a page was often charged and the system never really took off. Teletext thrived because economy of scale brought the cost of teletext-capable electronics down and it was installed using a set of for-the-purpose integrated circuits during manufacture of most colour television sets in that era, and once installed then it was a free add-on with no ongoing cost apart from the ordinary television licence. It seems to me that there could be, in the future, a type of thing that sends out a continuous signal over a wire of, say, a temperature reading at its location, all formatted in several languages. So, no passwords, no input from an end user, just a continuous feeding into The Internet of Things its output, with the numerical value in the messages changed as the temperature changes. This would allow the digits to be expressed in the digits used in the particular script of the particular language used in an individual message. William Overington Wednesday 12 February 2020 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 12 11:44:56 2020 From: unicode at unicode.org (=?UTF-8?Q?S=C5=82awomir_Osipiuk?= via Unicode) Date: Wed, 12 Feb 2020 12:44:56 -0500 Subject: Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good reason why I ask In-Reply-To: <41235c02.8df.1703a3766cf.Webtop.231@btinternet.com> References: <141cecf1.23e.1702ea529c1.Webtop.218@btinternet.com> <000d01d5e067$c9d747c0$5d85d740$@gmail.com> <41235c02.8df.1703a3766cf.Webtop.231@btinternet.com> Message-ID: <001201d5e1cc$247a4af0$6d6ee0d0$@gmail.com> On Wed, Feb 12, 2020 at 11:28 AM wjgo_10009 at btinternet.com via Unicode wrote: > > I am reminded of the teletext system (with brand names such as Ceefax and Oracle) in the United KIngdom, which was a broadcasting technology introduced in the 1970s and which became very much a part of British culture during the 1980s and 1990s. A digital signal of a special purpose 7-bit character set was broadcast in the vertical blanking interval of a 625 line analogue television signal. [...] > It seems to me that there could be, in the future, a type of thing that sends out a continuous signal over a wire of, say, a temperature reading at its location, all formatted in several languages. So, no passwords, no input from an end user, just a continuous feeding into The Internet of Things its output, with the numerical value in the messages changed as the temperature changes. This would allow the digits to be expressed in the digits used in the particular script of the particular language used in an individual message. Teletext had a data rate of 7 kilobits/s (less than 1 kilobyte/s), was cleverly grafted onto a system never designed for it, and the terminals to display it couldn't handle modern markup. Language tags, or something very like them, would make sense for very low-rate transmissions like Teletext (or the similar Line 21 closed captions in NTSC). It's too late for them, though. The proposal is for "Internet of Things". In 2020, 1kpbs transmissions are laughably slow, unless you're talking to the Voyager space probes. Receiving equipment, even at the lowest end, has more than enough processing power to interpret a proper markup language. If for some reason you really do want to minimize data rate, you're better off with data compression rather than saving bytes by using Unicode language tags instead of XML. The receiving equipment can handle a decompression step at basically no cost (that wasn't true in the 1970s), and markup languages compress very well. The particular circumstances that would encourage unicode tag characters don't exist today: Razor-thin data rate and miniscule receiver processing power. With the resources we have now, anything done by tag characters can be done BETTER with proper encapsulating protocols and markup. With all that said, there is no Unicode Police that will come banging on your door if you make a system that uses the tag characters. If you, or anyone, thinks it's the best solution for a particular project, then do it. Deprecation just means, "There are better ways of doing this. Seriously, please look around." And I think that message is still valid. (This reply may read overly critical, but I'm very much enjoying this discussion.) S?awomir Osipiuk From unicode at unicode.org Wed Feb 12 12:12:14 2020 From: unicode at unicode.org (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?= via Unicode) Date: Wed, 12 Feb 2020 19:12:14 +0100 Subject: Egyptian Hieroglyph Man with a Laptop Message-ID: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> Dear Unicode list members (CC Michel Suignard), ? the Unicode proposal L2/20-068 , ?Revised draft for the encoding of an extended Egyptian Hieroglyphs repertoire, Groups A to N? ( https://www.unicode.org/L2/L2020/20068-n5128-ext-hieroglyph.pdf ) by Michel Suignard contains a very interesting hieroglyph at position *U+13579 EGYPTIAN HIEROGLYPH A-12-054, which seems to represent a man with a laptop, as can be obvious in the attached image. ? I am curious about the source of this hieroglyph: in the table acompannying the document, its sources are said to be ?Hieroglyphica extension (various sources)? with number A58C and ?Hornung & Schenkel (2007, last modified in 2015)?, but with no number (A;), which seems unique in the table. It leads me to think this glyph only exist in some modern font, either as a joke, or for some computer related modern use. Can anyone infirm or confirm this intuition ? ?? Fr?d?ric -------------- next part -------------- A non-text attachment was scrubbed... Name: HieroglyphManWithALaptop.png Type: image/png Size: 13078 bytes Desc: not available URL: From unicode at unicode.org Wed Feb 12 13:38:23 2020 From: unicode at unicode.org (Marius Spix via Unicode) Date: Wed, 12 Feb 2020 20:38:23 +0100 Subject: Egyptian Hieroglyph Man with a Laptop In-Reply-To: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> References: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> Message-ID: <20200212203810.00f2e65d@spixxi> That is a pretty interesting finding. This glyph was not part of http://www.unicode.org/L2/L2018/18165-n4944-hieroglyphs.pdf but has been first seen in http://www.unicode.org/L2/L2019/19220-n5063-hieroglyphs.pdf The only "evidence" for this glyph I could find, is a stock photo, which is clearly made in the 21th century. https://www.alamy.com/stock-photo-egyptian-hieroglyphics-with-notebook-digital-illustration-57472465.html I know, that some font creators include so-called trap characters, similar to trap streets which are often found in maps to catch copyright violations. But it is also possible that the someone wanted to smuggle an easter-egg into Unicode or just test if the quality assurance works. In my opinion, this is an invalid character, which should not be included in Unicode. On Thu, 12 Feb 2020 19:12:14 +0100 Fr?d?ric Grosshans via Unicode wrote: > Dear Unicode list members (CC Michel Suignard), > > ? the Unicode proposal L2/20-068 > , > ?Revised draft for the encoding of an extended Egyptian Hieroglyphs > repertoire, Groups A to N? ( > https://www.unicode.org/L2/L2020/20068-n5128-ext-hieroglyph.pdf ) by > Michel Suignard contains a very interesting hieroglyph at position > *U+13579 EGYPTIAN HIEROGLYPH A-12-054, which seems to represent a man > with a laptop, as can be obvious in the attached image. > > ? I am curious about the source of this hieroglyph: in the table > acompannying the document, its sources are said to be ?Hieroglyphica > extension (various sources)? with number A58C and ?Hornung & Schenkel > (2007, last modified in 2015)?, but with no number (A;), which seems > unique in the table. It leads me to think this glyph only exist in > some modern font, either as a joke, or for some computer related > modern use. Can anyone infirm or confirm this intuition ? > > ?? Fr?d?ric > > From unicode at unicode.org Wed Feb 12 14:04:01 2020 From: unicode at unicode.org (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?= via Unicode) Date: Wed, 12 Feb 2020 21:04:01 +0100 Subject: Egyptian Hieroglyph Man with a Laptop In-Reply-To: <20200212203810.00f2e65d@spixxi> References: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> <20200212203810.00f2e65d@spixxi> Message-ID: Le 12/02/2020 ? 20:38, Marius Spix a ?crit?: > That is a pretty interesting finding. This glyph was not part of > http://www.unicode.org/L2/L2018/18165-n4944-hieroglyphs.pdf It is, as *U+1355A EGYPTIAN HIEROGLYPH A-12-051 > but has been first seen in > http://www.unicode.org/L2/L2019/19220-n5063-hieroglyphs.pdf > > The only "evidence" for this glyph I could find, is a stock photo, > which is clearly made in the 21th century. > https://www.alamy.com/stock-photo-egyptian-hieroglyphics-with-notebook-digital-illustration-57472465.html I don?t even think it could qualify, since I think the woman in this picture would correspond to another hieroglyph, from the B series (B-04), not a A-12. > > I know, that some font creators include so-called trap characters, > similar to trap streets which are often found in maps to catch copyright > violations. But it is also possible that the someone wanted to smuggle > an easter-egg into Unicode or just test if the quality assurance works. The question is then: was this well known about people reading hieroglyphs who checked this proposal? If not, it is very difficult to trust other hieroglyphs, especially if the first explanation is the good one: some trap characters could actually look like real ones. Except of course if we accept some hieroglyphs for compatibility purpose, but this is not mentioned as a valid reason in any propoal yet. > In my opinion, this is an invalid character, which should not be > included in Unicode. I agree. ? Fr?d?ric > > On Thu, 12 Feb 2020 19:12:14 +0100 > Fr?d?ric Grosshans via Unicode wrote: > >> Dear Unicode list members (CC Michel Suignard), >> >> ? the Unicode proposal L2/20-068 >> , >> ?Revised draft for the encoding of an extended Egyptian Hieroglyphs >> repertoire, Groups A to N? ( >> https://www.unicode.org/L2/L2020/20068-n5128-ext-hieroglyph.pdf ) by >> Michel Suignard contains a very interesting hieroglyph at position >> *U+13579 EGYPTIAN HIEROGLYPH A-12-054, which seems to represent a man >> with a laptop, as can be obvious in the attached image. >> >> ? I am curious about the source of this hieroglyph: in the table >> acompannying the document, its sources are said to be ?Hieroglyphica >> extension (various sources)? with number A58C and ?Hornung & Schenkel >> (2007, last modified in 2015)?, but with no number (A;), which seems >> unique in the table. It leads me to think this glyph only exist in >> some modern font, either as a joke, or for some computer related >> modern use. Can anyone infirm or confirm this intuition ? >> >> ?? Fr?d?ric >> >> From unicode at unicode.org Wed Feb 12 14:17:45 2020 From: unicode at unicode.org (Joe Becker via Unicode) Date: Wed, 12 Feb 2020 12:17:45 -0800 Subject: Egyptian Hieroglyph Man with a Laptop In-Reply-To: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> References: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> Message-ID: <5E445D69.7070800@unicode.org> I assume this glyph was created to honor Cleo Huggins, the designer of Sonata at Adobe, who decades ago created a similar hieroglyph of a *woman* in front of her computer. Joe From unicode at unicode.org Wed Feb 12 15:01:50 2020 From: unicode at unicode.org (Markus Scherer via Unicode) Date: Wed, 12 Feb 2020 13:01:50 -0800 Subject: Egyptian Hieroglyph Man with a Laptop In-Reply-To: <20200212203810.00f2e65d@spixxi> References: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> <20200212203810.00f2e65d@spixxi> Message-ID: On Wed, Feb 12, 2020 at 11:37 AM Marius Spix via Unicode < unicode at unicode.org> wrote: > In my opinion, this is an invalid character, which should not be > included in Unicode. > Please remember that feedback that you want the committee to look at needs to go through http://www.unicode.org/reporting.html Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Wed Feb 12 16:30:17 2020 From: unicode at unicode.org (Michel Suignard via Unicode) Date: Wed, 12 Feb 2020 22:30:17 +0000 Subject: Egyptian Hieroglyph Man with a Laptop In-Reply-To: References: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> <20200212203810.00f2e65d@spixxi> Message-ID: Interesting that a single character is creating so much feedback, but it is not the first time. It is true that the glyph in question was not in the base Hieroglyphica glyph set (that is why I referenced it as an 'extension'). Its presence though raises an interesting point concerning abstraction of Egyptian hieroglyphs in general. All Egyptian hieroglyphs proposals imply some abstraction from the original evidences found on stone, wood, papyrus. At some point you have to decide some level where you feel confident that you created enough glyphs to allow meaningful interaction among Egyptologists. Because the set represents an extinct system you probably have to be a bit liberal in allowing some visual variants (because we can never be completely sure two similar looking signs are 100% equivalent in all their possible functions in the writing system and are never used in contrast). These abstract collections have started to appear in the first part of the nineteen century (Champollion starting in 1822). Interestingly these collections have started to be useful on their own even if in some case the main use of parts is self-referencing, either because the glyph is a known mistake, or a ghost (character for which attestation is now firmly disputed). For example, it would be very difficult to create a new set not including the full Gardiner set, even if some of the characters are not necessarily justified. To a large degree, Hieroglyphica (and its related collection JSesh) has obtained that status as well. The IFAO (Institut Fran?ais d?Arch?ologie Orientatle) set is another one, although there is no modern font representing all of it (although many of the IFAO glyphs should not be encoded separately). There is obviously no doubt that the character in question [cid:image003.png at 01D5E1B0.F18C11C0] is a modern invention and not based on historical evidence. But interestingly enough it has started to be used as a pictogram with some content value, describing in fact an Egyptologist. It may not belong to that block, but it actually describes an use case and has been used a symbol in some technical publication. Concerning: The question is then: was this well known about people reading hieroglyphs who checked this proposal? If not, it is very difficult to trust other hieroglyphs, especially if the first explanation is the good one: some trap characters could actually look like real ones. Except of course if we accept some hieroglyphs for compatibility purpose, but this is not mentioned as a valid reason in any propoal yet. > In my opinion, this is an invalid character, which should not be > included in Unicode. I agree. You are allowed to have your own opinion, but I can tell you I have spent a lot of times checking attestation from many sources for the proposed repertoire. It won?t be perfect, but perfection (or a closer reach) would probably cost decades in study while preventing current research to have a communication platform. I don?t have a strong opinion about that character, but I would be very disappointed if people stop the review for what is a minor issue in the overall scheme. Best regards Michel -----Original Message----- From: Fr?d?ric Grosshans Sent: Wednesday, February 12, 2020 12:04 PM To: Marius Spix ; Unicode Cc: Michel Suignard Subject: Re: Egyptian Hieroglyph Man with a Laptop Le 12/02/2020 ? 20:38, Marius Spix a ?crit : > That is a pretty interesting finding. This glyph was not part of > http://www.unicode.org/L2/L2018/18165-n4944-hieroglyphs.pdf It is, as *U+1355A EGYPTIAN HIEROGLYPH A-12-051 > but has been first seen in > http://www.unicode.org/L2/L2019/19220-n5063-hieroglyphs.pdf > > The only "evidence" for this glyph I could find, is a stock photo, > which is clearly made in the 21th century. > https://www.alamy.com/stock-photo-egyptian-hieroglyphics-with-notebook > -digital-illustration-57472465.html I don?t even think it could qualify, since I think the woman in this picture would correspond to another hieroglyph, from the B series (B-04), not a A-12. > > I know, that some font creators include so-called trap characters, > similar to trap streets which are often found in maps to catch > copyright violations. But it is also possible that the someone wanted > to smuggle an easter-egg into Unicode or just test if the quality assurance works. The question is then: was this well known about people reading hieroglyphs who checked this proposal? If not, it is very difficult to trust other hieroglyphs, especially if the first explanation is the good one: some trap characters could actually look like real ones. Except of course if we accept some hieroglyphs for compatibility purpose, but this is not mentioned as a valid reason in any propoal yet. > In my opinion, this is an invalid character, which should not be > included in Unicode. I agree. Fr?d?ric > > On Thu, 12 Feb 2020 19:12:14 +0100 > Fr?d?ric Grosshans via Unicode > wrote: > >> Dear Unicode list members (CC Michel Suignard), >> >> the Unicode proposal L2/20-068 >> , >> ?Revised draft for the encoding of an extended Egyptian Hieroglyphs >> repertoire, Groups A to N? ( >> https://www.unicode.org/L2/L2020/20068-n5128-ext-hieroglyph.pdf ) by >> Michel Suignard contains a very interesting hieroglyph at position >> *U+13579 EGYPTIAN HIEROGLYPH A-12-054, which seems to represent a man >> with a laptop, as can be obvious in the attached image. >> >> I am curious about the source of this hieroglyph: in the table >> acompannying the document, its sources are said to be ?Hieroglyphica >> extension (various sources)? with number A58C and ?Hornung & Schenkel >> (2007, last modified in 2015)?, but with no number (A;), which seems >> unique in the table. It leads me to think this glyph only exist in >> some modern font, either as a joke, or for some computer related >> modern use. Can anyone infirm or confirm this intuition ? >> >> Fr?d?ric >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.png Type: image/png Size: 8939 bytes Desc: image003.png URL: From unicode at unicode.org Wed Feb 12 17:06:51 2020 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Thu, 13 Feb 2020 00:06:51 +0100 Subject: Egyptian Hieroglyph Man with a Laptop In-Reply-To: References: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> <20200212203810.00f2e65d@spixxi> Message-ID: <55965AA6-23AB-4913-A7EE-AA8F88C8C495@telia.com> > On 12 Feb 2020, at 23:30, Michel Suignard via Unicode wrote: > > These abstract collections have started to appear in the first part of the nineteen century (Champollion starting in 1822). Interestingly these collections have started to be useful on their own even if in some case the main use of parts is self-referencing, either because the glyph is a known mistake, or a ghost (character for which attestation is now firmly disputed). For example, it would be very difficult to create a new set not including the full Gardiner set, even if some of the characters are not necessarily justified. To a large degree, Hieroglyphica (and its related collection JSesh) has obtained that status as well. The IFAO (Institut Fran?ais d?Arch?ologie Orientatle) set is another one, although there is no modern font representing all of it (although many of the IFAO glyphs should not be encoded separately). > > There is obviously no doubt that the character in question is a modern invention and not based on historical evidence. But interestingly enough it has started to be used as a pictogram with some content value, describing in fact an Egyptologist. It may not belong to that block, but it actually describes an use case and has been used a symbol in some technical publication. >From the point of view of Unicode, it is simpler: If the character is in use or have had use, it should be included somehow. From unicode at unicode.org Wed Feb 12 17:26:39 2020 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Wed, 12 Feb 2020 23:26:39 +0000 Subject: Egyptian Hieroglyph Man with a Laptop In-Reply-To: <55965AA6-23AB-4913-A7EE-AA8F88C8C495@telia.com> References: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> <20200212203810.00f2e65d@spixxi> <55965AA6-23AB-4913-A7EE-AA8F88C8C495@telia.com> Message-ID: > From the point of view of Unicode, it is simpler: If the character is in use or have had use, it should be included somehow. That bar, to me, seems too low. Many things are only used briefly or in a private context that doesn't really require encoding. The hieroglyphs discussion is interesting because it presents them as living (in at least some sense) even though they're a historical script. Apparently modern Egyptologists are coopting them for their own needs. There are lots of emoji for professional fields. In this case since hieroglyphs are pictorial, it seems they've blurred the lines between the script and emoji. Given their field, I'd probably do the same thing. I'm not opposed to the character if Egyptologists use it amongst themselves, though it does make me wonder if it belongs in this set? Are there other "modern" hieroglyphs? (Other than the errors, etc mentioned earlier, but rather glyphs that have been invented for modern use). -Shawn From unicode at unicode.org Thu Feb 13 03:18:40 2020 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Thu, 13 Feb 2020 10:18:40 +0100 Subject: Egyptian Hieroglyph Man with a Laptop In-Reply-To: References: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> <20200212203810.00f2e65d@spixxi> <55965AA6-23AB-4913-A7EE-AA8F88C8C495@telia.com> Message-ID: <98A485B2-A90E-44FB-BD8D-33091DDCBBCD@telia.com> > On 13 Feb 2020, at 00:26, Shawn Steele wrote: > >> From the point of view of Unicode, it is simpler: If the character is in use or have had use, it should be included somehow. > > That bar, to me, seems too low. Many things are only used briefly or in a private context that doesn't really require encoding. That is a private use area for more special use. From unicode at unicode.org Thu Feb 13 00:58:27 2020 From: unicode at unicode.org (=?ISO-2022-JP?B?GyRCJCYkXyRbJD8kaxsoQg==?= via Unicode) Date: Thu, 13 Feb 2020 15:58:27 +0900 Subject: Egyptian Hieroglyph Man with a Laptop Message-ID: <20200213155649.FF1A.A3B63ED@gmail.com> The early versions of the font Aegyptus (http://users.teilar.gr/~g1951d/) has the glyph as one of "Dingbats" distinguished from general characters. The attached image is from the PDF file for Aegyptus.ttf version 3.17 (2012). -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001.png Type: image/png Size: 202302 bytes Desc: not available URL: From unicode at unicode.org Thu Feb 13 09:33:43 2020 From: unicode at unicode.org (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?= via Unicode) Date: Thu, 13 Feb 2020 16:33:43 +0100 Subject: Egyptian Hieroglyph Man with a Laptop In-Reply-To: References: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> <20200212203810.00f2e65d@spixxi> Message-ID: <7c5d8de7-cb0a-e79e-f9f1-7e981783bc5e@gmail.com> Le 12/02/2020 ? 23:30, Michel Suignard a ?crit?: > > Interesting that a single character is creating so much feedback, but > it is not the first time. > Extrapolating from my own case, I guess it?s because hieroglyphs have a strong cultural significance ? especially to people following unicode encoding ? but that very few are qualified enough to emit a judgement, except maybe for this character. > It is true that the glyph in question was not in the base > Hieroglyphica glyph set (that is why I referenced it as an > 'extension'). Its presence though raises an interesting point > concerning abstraction of Egyptian hieroglyphs in general. All > Egyptian hieroglyphs proposals imply some abstraction from the > original evidences found on stone, wood, papyrus. At some point you > have to decide some level where you feel confident that you created > enough glyphs to allow meaningful interaction among Egyptologists. > Because the set represents an extinct system you probably have to be a > bit liberal in allowing some visual variants (because we can never be > completely sure two similar looking signs are 100% equivalent in all > their possible functions in the writing system and are never used in > contrast). > This is clearly a problem difficult to tackle, with both extinct and logographic script, and hieroglyphics is both. It is obvious to me (and probably to anyone following unicode encoding) that the work you have been doing over the last few tear is a very difficult one. By the way, you expalin this approach very well explained on page 6, when you take the ?disunification? on *U+14828 N-19-016 and the already encoded U+1321A N037A (Which would be N-19-017) > > These abstract collections have started to appear in the first part of > the nineteen century (Champollion starting in 1822). Interestingly > these collections have started to be useful on their own even if in > some case the main use of? parts is self-referencing, either because > the glyph is a known mistake, or a ghost (character for which > attestation is now firmly disputed). For example, it would be very > difficult to create a new set not including the full Gardiner set, > even if some of the characters are not necessarily justified. To a > large degree, Hieroglyphica (and its related collection JSesh) has > obtained that status as well. The IFAO (Institut Fran?ais > d?Arch?ologie Orientatle) set is another one, although there is no > modern font representing all of it (although many of the IFAO glyphs > should not be encoded separately). > I? see this as variant of the ?round-trip compatibility? principle of unicode adapted to ancient scripts, where the role of ?legacy standards? is often taken by old scholarly litterature. > There is obviously no doubt that the character in question is a modern > invention and not based on historical evidence. But interestingly > enough it has started to be used as a pictogram with some content > value, describing in fact an Egyptologist. It may not belong to that > block, but it actually describes an use case and has been used a > symbol in some technical publication. > I think the main problem I see with this character is that it seems to be sneaked in the main proposal. The text of the proposal seems to imply that the charcters proposed where either in use in ancient egypt or correspond to abstractions used by modern (=Champollion and later) egyptologists intended to reflect them. This character does not fit in this picture, but that does not mean it does not belong to the hieroglyphic bloc: I think modern use of hieroglyphs (like e.g. the ones described in Hieroglyphs For Your Eyes Only: Samuel K. Lothrop and His Use of Ancient Egyptian as Cipher, by Pierre//http://www.mesoweb.com/articles/meyrat/Meyrat2014.pdf, 2014) should use the standard unicode encoding. There is a precedent in encoding modern characters in an extinct script with the encoding of Tolkienian characters U+16F1 to U+16F3 in the Runic block. But I feel the encoding of such a character needs at the very to be explicitly discussed in the text of the proposal., e.g. by giving evidence of its modern use. > > Concerning: > > The question is then: was this well known about people reading > hieroglyphs who checked this proposal? If not, it is very difficult to > trust other hieroglyphs, especially if the first explanation is the good > > one: some trap characters could actually look like real ones. Except > of course if we accept some hieroglyphs for compatibility purpose, but > this is not mentioned as a valid reason in any propoal yet. > > > In my opinion, this is an invalid character, which should not be > > > included in Unicode. > > I agree. > > You are allowed to have your own opinion, but I can tell you I have > spent a lot of times checking attestation from many sources for the > proposed repertoire. It won?t be perfect, but perfection (or a closer > reach) would probably cost decades in study while preventing current > research to have a communication platform. I don?t have a strong > opinion about that character, but I would be very disappointed if > people stop the review for what is a minor issue in the overall scheme. > I feel the question is not this character itself, but what it means about the process. There are several possibilities: either 1. The persons working on encoding did not notice it had anything special at all, and intended to encode it like all the others of the proposal 2. It was known to be a specific modern case (maybe along with other less obvious hieroglyphs) Given your quick answer, together with the specificities of its inputs in the data base, I do not think it was case 1. If it were, it would have meant that there are probably many other problems in the proposed character set, which are less obvious but would pose problems to egyptologists after encoding. If it is 2, the character was recognized by you and the other participants of the encoding to be a modern one, the problem is different and easier to solve, either by removing it or by making a special case in your proposal for its encoding (together with similar modern characters, if they are present in your proposal) ? Best regards, ??? ??? Fr?d?ric From unicode at unicode.org Thu Feb 13 09:41:41 2020 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Thu, 13 Feb 2020 15:41:41 +0000 (GMT) Subject: What should or should not be encoded in Unicode? (from Re: Egyptian Hieroglyph Man with a Laptop) Message-ID: <1d84bd2d.1658.1703f3682b9.Webtop.49@btinternet.com> Hans ?berg >>> From the point of view of Unicode, it is simpler: If the character is in use or have had use, it should be included somehow. Shawn Steele >> That bar, to me, seems too low. Many things are only used briefly or in a private context that doesn';t really require encoding. Hans ?berg > That is a private use area for more special use. I have used the Private Use Area, quite a lot over many years. I have a licence for a fontmaking program, FontCreator. A good feature of the Windows operating system is that all installed fonts can be used in most installed programs. Private Use Area code points are official Unicode code points. These three factors together allow me to design and produce TrueType fonts for new symbols each encoded at a Private Use Area code point (a different code point for each such novel symbol), install the fonts, and use them in various programs, including a desktop publishing program and thereby make PDF (Portable Document Format) documents that include both ordinary text and the novel symbols. These PDF documents are then suitable for placing on the web and for Legal Deposit with The British Library. Yet a Private Use Area encoding at a particular code point is not unique. Thus, except with care amongst people who are aware of the particular encoding, there is no interoperability, such as with regular Unicode encoded characters. However faced with a need for interoperability for my research project, I have found a solution making use of the Glyph Substitution capability of an OpenType font. The solution is to invent my own encoding space. This sits on top of Unicode, could be (perhaps?) called markup, but it works! I am hoping that at some future time the results of my research will become encoded as an International Standard, and that my encoding space will then after that become integrated into Unicode, thus achieving fully standardized unique interoperable encoding as part of Unicode. Quite a dream, but the way to achieve such a fully standardized unique interoperable encoding as part of Unicode is from a technological point of view, quite straightforward. There are details of this in the Accumulated Feedback on Public Review Issue #408. https://www.unicode.org/review/pri408/ Yet having my encoding space in this manner is just something that I have done on my own initiative. Anybody can have his or her own encoding space if he or she so chooses. With a little care and consideration for others these encodings need not clash one with another and all could even coexist in one document. Having my own encoding space has enabled me to make progress with my research project. William Overington Thursday 13 February 2020 From unicode at unicode.org Thu Feb 13 13:33:32 2020 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 13 Feb 2020 19:33:32 +0000 Subject: Egyptian Hieroglyph Man with a Laptop In-Reply-To: <98A485B2-A90E-44FB-BD8D-33091DDCBBCD@telia.com> References: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> <20200212203810.00f2e65d@spixxi> <55965AA6-23AB-4913-A7EE-AA8F88C8C495@telia.com> <98A485B2-A90E-44FB-BD8D-33091DDCBBCD@telia.com> Message-ID: <20200213193332.207d2a38@JRWUBU2> On Thu, 13 Feb 2020 10:18:40 +0100 Hans ?berg via Unicode wrote: > > On 13 Feb 2020, at 00:26, Shawn Steele > > wrote: > >> From the point of view of Unicode, it is simpler: If the character > >> is in use or have had use, it should be included somehow. > > > > That bar, to me, seems too low. Many things are only used briefly > > or in a private context that doesn't really require encoding. > > That is a private use area for more special use. Writing the plural ('Egyptologists') by writing the plural strokes below the glyph could be difficult if the renderer won't include them in the same script run. Richard. From unicode at unicode.org Thu Feb 13 13:47:56 2020 From: unicode at unicode.org (Phake Nick via Unicode) Date: Fri, 14 Feb 2020 03:47:56 +0800 Subject: Egyptian Hieroglyph Man with a Laptop In-Reply-To: <20200213193332.207d2a38@JRWUBU2> References: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> <20200212203810.00f2e65d@spixxi> <55965AA6-23AB-4913-A7EE-AA8F88C8C495@telia.com> <98A485B2-A90E-44FB-BD8D-33091DDCBBCD@telia.com> <20200213193332.207d2a38@JRWUBU2> Message-ID: Those characters could also be put into another block for the same script similar to how dubious characters in CJK are included by placing them into "CJK Compatibility Ideographs" for round trip compatibility with source encoding. ? 2020?2?14??? 03:35?Richard Wordingham via Unicode ??? > On Thu, 13 Feb 2020 10:18:40 +0100 > Hans ?berg via Unicode wrote: > > > > On 13 Feb 2020, at 00:26, Shawn Steele > > > wrote: > > >> From the point of view of Unicode, it is simpler: If the character > > >> is in use or have had use, it should be included somehow. > > > > > > That bar, to me, seems too low. Many things are only used briefly > > > or in a private context that doesn't really require encoding. > > > > That is a private use area for more special use. > > Writing the plural ('Egyptologists') by writing the plural strokes below > the glyph could be difficult if the renderer won't include them in the > same script run. > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 13 14:08:23 2020 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Thu, 13 Feb 2020 12:08:23 -0800 Subject: Egyptian Hieroglyph Man with a Laptop In-Reply-To: References: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> <20200212203810.00f2e65d@spixxi> <55965AA6-23AB-4913-A7EE-AA8F88C8C495@telia.com> <98A485B2-A90E-44FB-BD8D-33091DDCBBCD@telia.com> <20200213193332.207d2a38@JRWUBU2> Message-ID: <95585e0e-4436-76df-12fa-cd0df7d9d00d@sonic.net> You want "dubious"?! You should see the hundreds of strange characters already encoded in the CJK *Unified* Ideographs blocks, as recently documented in great detail by Ken Lunde: https://www.unicode.org/L2/L2020/20059-unihan-kstrange-update.pdf Compared to many of those, a hieroglyph of a man (or woman) holding a laptop is positively orthodox! --Ken On 2/13/2020 11:47 AM, Phake Nick via Unicode wrote: > Those characters could also be put into another block for the same > script similar to how dubious characters in CJK are included by > placing them into "CJK Compatibility Ideographs" for round trip > compatibility with source encoding. From unicode at unicode.org Thu Feb 13 14:15:07 2020 From: unicode at unicode.org (Shawn Steele via Unicode) Date: Thu, 13 Feb 2020 20:15:07 +0000 Subject: Egyptian Hieroglyph Man with a Laptop In-Reply-To: <95585e0e-4436-76df-12fa-cd0df7d9d00d@sonic.net> References: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> <20200212203810.00f2e65d@spixxi> <55965AA6-23AB-4913-A7EE-AA8F88C8C495@telia.com> <98A485B2-A90E-44FB-BD8D-33091DDCBBCD@telia.com> <20200213193332.207d2a38@JRWUBU2> <95585e0e-4436-76df-12fa-cd0df7d9d00d@sonic.net> Message-ID: I'm not opposed to a sub-bloc for "Modern Hieroglyphs" I confess that even though I know nothing about Hieroglyphs, that I find it fascinating that such a thoroughly dead script might still be living in some way, even if it's only a little bit. -Shawn -----Original Message----- From: Unicode On Behalf Of Ken Whistler via Unicode Sent: Thursday, February 13, 2020 12:08 PM To: Phake Nick Cc: unicode at unicode.org Subject: Re: Egyptian Hieroglyph Man with a Laptop You want "dubious"?! You should see the hundreds of strange characters already encoded in the CJK *Unified* Ideographs blocks, as recently documented in great detail by Ken Lunde: https://www.unicode.org/L2/L2020/20059-unihan-kstrange-update.pdf Compared to many of those, a hieroglyph of a man (or woman) holding a laptop is positively orthodox! --Ken On 2/13/2020 11:47 AM, Phake Nick via Unicode wrote: > Those characters could also be put into another block for the same > script similar to how dubious characters in CJK are included by > placing them into "CJK Compatibility Ideographs" for round trip > compatibility with source encoding. From unicode at unicode.org Thu Feb 13 13:12:05 2020 From: unicode at unicode.org (Asmus Freytag via Unicode) Date: Thu, 13 Feb 2020 11:12:05 -0800 Subject: Egyptian Hieroglyph Man with a Laptop In-Reply-To: References: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> <20200212203810.00f2e65d@spixxi> <55965AA6-23AB-4913-A7EE-AA8F88C8C495@telia.com> Message-ID: <994b78c9-68c4-3bf5-0833-546f06c5ba12@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Thu Feb 13 15:15:18 2020 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Thu, 13 Feb 2020 21:15:18 +0000 Subject: Egyptian Hieroglyph Man with a Laptop In-Reply-To: References: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> <20200212203810.00f2e65d@spixxi> <55965AA6-23AB-4913-A7EE-AA8F88C8C495@telia.com> <98A485B2-A90E-44FB-BD8D-33091DDCBBCD@telia.com> <20200213193332.207d2a38@JRWUBU2> <95585e0e-4436-76df-12fa-cd0df7d9d00d@sonic.net> Message-ID: <20200213211518.5a372141@JRWUBU2> On Thu, 13 Feb 2020 20:15:07 +0000 Shawn Steele via Unicode wrote: > I confess that even though I know nothing about Hieroglyphs, that I > find it fascinating that such a thoroughly dead script might still be > living in some way, even if it's only a little bit. Plenty of people have learnt how to write their name in hieroglyphs. However, it is rare enough that my initials suffice to label my milk at work. What's more striking is the implication that people are still exchanging messages in Middle Egyptian. Richard. From unicode at unicode.org Thu Feb 13 17:47:12 2020 From: unicode at unicode.org (via Unicode) Date: Fri, 14 Feb 2020 07:47:12 +0800 Subject: Egyptian Hieroglyph Man with a Laptop In-Reply-To: <95585e0e-4436-76df-12fa-cd0df7d9d00d@sonic.net> References: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> <20200212203810.00f2e65d@spixxi> <55965AA6-23AB-4913-A7EE-AA8F88C8C495@telia.com> <98A485B2-A90E-44FB-BD8D-33091DDCBBCD@telia.com> <20200213193332.207d2a38@JRWUBU2> <95585e0e-4436-76df-12fa-cd0df7d9d00d@sonic.net> Message-ID: Dear Ken An interesting comparison, if strange means dubious, then the name kstrange should be changed or some of the content removed because many of the characters in the set are not dubious in the least. Regards John On 2020-02-14 04:08, Ken Whistler via Unicode wrote: > You want "dubious"?! > > You should see the hundreds of strange characters already encoded in > the CJK *Unified* Ideographs blocks, as recently documented in great > detail by Ken Lunde: > > https://www.unicode.org/L2/L2020/20059-unihan-kstrange-update.pdf > > Compared to many of those, a hieroglyph of a man (or woman) holding a > laptop is positively orthodox! > > --Ken > > On 2/13/2020 11:47 AM, Phake Nick via Unicode wrote: >> Those characters could also be put into another block for the same >> script similar to how dubious characters in CJK are included by >> placing them into "CJK Compatibility Ideographs" for round trip >> compatibility with source encoding. From unicode at unicode.org Thu Feb 13 18:13:00 2020 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Thu, 13 Feb 2020 16:13:00 -0800 Subject: Egyptian Hieroglyph Man with a Laptop In-Reply-To: References: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> <20200212203810.00f2e65d@spixxi> <55965AA6-23AB-4913-A7EE-AA8F88C8C495@telia.com> <98A485B2-A90E-44FB-BD8D-33091DDCBBCD@telia.com> <20200213193332.207d2a38@JRWUBU2> <95585e0e-4436-76df-12fa-cd0df7d9d00d@sonic.net> Message-ID: Well, no, in this case "strange" means strange, as Ken Lunde notes. I'm just pointing to his list, because it pulls together quite a few Han characters that *also* have dubious cases for encoding. Or you could turn the argument around, I suppose, and note that just because the hieroglyph for "Egyptologist" is strange, that doesn't necessarily mean that the case for encoding it is dubious. ;-) --Ken On 2/13/2020 3:47 PM, jk at koremail.com wrote: > An interesting comparison, if strange means dubious, then the name > kstrange should be changed or some of the content removed because many > of the characters in the set are not dubious in the least. > From unicode at unicode.org Thu Feb 13 19:03:20 2020 From: unicode at unicode.org (via Unicode) Date: Fri, 14 Feb 2020 09:03:20 +0800 Subject: Egyptian Hieroglyph Man with a Laptop In-Reply-To: References: <882d3d3d-c164-3681-2e1b-76289bc0500e@gmail.com> <20200212203810.00f2e65d@spixxi> <55965AA6-23AB-4913-A7EE-AA8F88C8C495@telia.com> <98A485B2-A90E-44FB-BD8D-33091DDCBBCD@telia.com> <20200213193332.207d2a38@JRWUBU2> <95585e0e-4436-76df-12fa-cd0df7d9d00d@sonic.net> Message-ID: <13dfedc9b6a16e92263857b981eb70b4@koremail.com> Strange, has several meanings, not all positive. Perhaps the term outlier is less ambiguous. One definition is unfamiliar, some outliers over time become widespread in use, become famliar we no longer consider them strange, but as they are still different are still outliers. CJK is a living script so new characters come and go, not all become widespread in there use. "Egyptologist" is certainly an outlier, an certainly strange to me. One question is what do "Egyptologist" think of it. John On 2020-02-14 08:13, Ken Whistler via Unicode wrote: > Well, no, in this case "strange" means strange, as Ken Lunde notes. > I'm just pointing to his list, because it pulls together quite a few > Han characters that *also* have dubious cases for encoding. > > Or you could turn the argument around, I suppose, and note that just > because the hieroglyph for "Egyptologist" is strange, that doesn't > necessarily mean that the case for encoding it is dubious. ;-) > > --Ken > > On 2/13/2020 3:47 PM, jk at koremail.com wrote: >> An interesting comparison, if strange means dubious, then the name >> kstrange should be changed or some of the content removed because many >> of the characters in the set are not dubious in the least. >> From unicode at unicode.org Fri Feb 14 07:25:58 2020 From: unicode at unicode.org (Marius Spix via Unicode) Date: Fri, 14 Feb 2020 14:25:58 +0100 Subject: Aw: RE: Egyptian Hieroglyph Man with a Laptop In-Reply-To: <20200213155649.FF1A.A3B63ED@gmail.com> References: <20200213155649.FF1A.A3B63ED@gmail.com> Message-ID: That glyph is coded on position U+1F5B3 OLD PERSONAL COMPUTER, see http://users.teilar.gr/~g1951d/Aegyptus.pdf ? ? Gesendet:?Donnerstag, 13. Februar 2020 um 07:58 Uhr Von:?"????? via Unicode" An:?unicode at unicode.org Betreff:?RE: Egyptian Hieroglyph Man with a Laptop The early versions of the font Aegyptus (http://users.teilar.gr/~g1951d/) has the glyph as one of "Dingbats" distinguished from general characters. The attached image is from the PDF file for Aegyptus.ttf version 3.17 (2012). From unicode at unicode.org Fri Feb 14 07:31:30 2020 From: unicode at unicode.org (Adam Borowski via Unicode) Date: Fri, 14 Feb 2020 14:31:30 +0100 Subject: Egyptian Hieroglyph Man with a Laptop In-Reply-To: <20200213211518.5a372141@JRWUBU2> References: <55965AA6-23AB-4913-A7EE-AA8F88C8C495@telia.com> <98A485B2-A90E-44FB-BD8D-33091DDCBBCD@telia.com> <20200213193332.207d2a38@JRWUBU2> <95585e0e-4436-76df-12fa-cd0df7d9d00d@sonic.net> <20200213211518.5a372141@JRWUBU2> Message-ID: <20200214133130.GB23095@angband.pl> On Thu, Feb 13, 2020 at 09:15:18PM +0000, Richard Wordingham via Unicode wrote: > On Thu, 13 Feb 2020 20:15:07 +0000 > Shawn Steele via Unicode wrote: > > > I confess that even though I know nothing about Hieroglyphs, that I > > find it fascinating that such a thoroughly dead script might still be > > living in some way, even if it's only a little bit. > > Plenty of people have learnt how to write their name in hieroglyphs. > However, it is rare enough that my initials suffice to label my milk at > work. > > What's more striking is the implication that people are still > exchanging messages in Middle Egyptian. I don't think non-Egyptologist recipients are even aware what language that is, or even that it's actual meaningful message rather than an hieroglyph- looking doodle. It's like maker's marks done by/for illiterate people (such as most artisans in the past) -- as long as it's a distinct symbol, it does its job. For example, I end my work emails with "????" and everyone so far assumed it's either my initials or at most some greeting. ?! -- ??????? Latin: meow 4 characters, 4 columns, 4 bytes ??????? Greek: ???? 4 characters, 4 columns, 8 bytes ?????? Runes: ???? 4 characters, 4 columns, 12 bytes ??????? Chinese: ? 1 character, 2 columns, 3 bytes <-- best! From unicode at unicode.org Fri Feb 14 07:37:05 2020 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Fri, 14 Feb 2020 14:37:05 +0100 Subject: What should or should not be encoded in Unicode? (from Re: Egyptian Hieroglyph Man with a Laptop) In-Reply-To: <1d84bd2d.1658.1703f3682b9.Webtop.49@btinternet.com> References: <1d84bd2d.1658.1703f3682b9.Webtop.49@btinternet.com> Message-ID: > On 13 Feb 2020, at 16:41, wjgo_10009 at btinternet.com via Unicode wrote: > > Yet a Private Use Area encoding at a particular code point is not unique. Thus, except with care amongst people who are aware of the particular encoding, there is no interoperability, such as with regular Unicode encoded characters. > > However faced with a need for interoperability for my research project, I have found a solution making use of the Glyph Substitution capability of an OpenType font. > > The solution is to invent my own encoding space. This sits on top of Unicode, could be (perhaps?) called markup, but it works! It may be perilous, because some software may enforce the strict official code point limits. -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 14 16:52:25 2020 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Fri, 14 Feb 2020 22:52:25 +0000 (GMT) Subject: What should or should not be encoded in Unicode? (from Re: Egyptian Hieroglyph Man with a Laptop) In-Reply-To: References: <1d84bd2d.1658.1703f3682b9.Webtop.49@btinternet.com> Message-ID: <1f0406d4.b22.17045e7354c.Webtop.218@btinternet.com> >> The solution is to invent my own encoding space. This sits on top of >> Unicode, could be (perhaps?) called markup, but it works! > It may be perilous, because some software may enforce the strict > official code point limits. I have now realized that what I wrote before is ambiguous. When I wrote "sits on top of Unicode" I was not meaning at some code points above U+10FFFF in the Unicode map, though I accept that it could quite reasonably be read as meaning that. My encoding space sits on top of Unicode in the sense that it uses a sequence of regular Unicode characters for each code point in my encoding space. For example ???? or !781 or a character sequence of a base character, followed by a tag exclamation mark followed by three tag digits and a cancel tag. All three examples above have the same meaning. ???? is useful as more unlikely otherwise than !123, though !123 is easier to use and could be used in a GS1-128 barcode. The tag sequence has the potential to become incorporated into Unicode for universal standardization of unambiguous interoperability everywhere. That is a long term goal for me. The example above uses a three-digit code number. My encoding space allows for various numbers of digits, with a minimum of three digits and a much larger theoretical maximum. The most digits in use at present in my research project in any one code number is six. William Overington Friday 14 February 2020 From unicode at unicode.org Sat Feb 15 04:11:48 2020 From: unicode at unicode.org (via Unicode) Date: Sat, 15 Feb 2020 03:11:48 -0700 Subject: What should or should not be encoded in Unicode? (from Re: Egyptian Hieroglyph Man with a Laptop) In-Reply-To: <1f0406d4.b22.17045e7354c.Webtop.218@btinternet.com> References: <1d84bd2d.1658.1703f3682b9.Webtop.49@btinternet.com> <1f0406d4.b22.17045e7354c.Webtop.218@btinternet.com> Message-ID: Hi William, I don't fully understand your proposed encoding scheme (e.g., Is there a namespace each encoding scheme is bound to? How do namespaces get encoded? How are syntax strictures encoded?), but even then, presuming it's sound, you've said in the message before that this encoding space will enhance interoperability. What mechanism is in place to make my encoding space interoperable with yours? Perhaps, independent of each other, you bind !123 to a character semantically identical to one I've bound to !234. What rules are in place to allow interchangeability? What about one-to-many or many-to-many or vague or ambiguous mappings across encoding schemes, or mappings that we might reasonably contest? Or maybe you're not so much concerned about interoperability as are you are with extending the PUA block beyond its current limits? Something like SGML/XML entities? Couldn't you simply capitalize on the rules that already exist for entities? Best wishes, jk -- Joel Kalvesmaki Director, Text Alignment Network http://textalign.net On 2020-02-14 15:52, wjgo_10009 at btinternet.com via Unicode wrote: >>> The solution is to invent my own encoding space. This sits on top of >>> Unicode, could be (perhaps?) called markup, but it works! > >> It may be perilous, because some software may enforce the strict >> official code point limits. > > I have now realized that what I wrote before is ambiguous. > > When I wrote "sits on top of Unicode" I was not meaning at some code > points above U+10FFFF in the Unicode map, though I accept that it > could quite reasonably be read as meaning that. > > My encoding space sits on top of Unicode in the sense that it uses a > sequence of regular Unicode characters for each code point in my > encoding space. > > For example > > ???? > > or > > !781 > > or > > a character sequence of a base character, followed by a tag > exclamation mark followed by three tag digits and a cancel tag. > > All three examples above have the same meaning. > > ???? is useful as more unlikely otherwise than !123, though !123 is > easier to use and could be used in a GS1-128 barcode. > > The tag sequence has the potential to become incorporated into Unicode > for universal standardization of unambiguous interoperability > everywhere. That is a long term goal for me. > > The example above uses a three-digit code number. My encoding space > allows for various numbers of digits, with a minimum of three digits > and a much larger theoretical maximum. The most digits in use at > present in my research project in any one code number is six. > > William Overington > > Friday 14 February 2020 From unicode at unicode.org Sat Feb 15 14:46:54 2020 From: unicode at unicode.org (wjgo_10009@btinternet.com via Unicode) Date: Sat, 15 Feb 2020 20:46:54 +0000 (GMT) Subject: What should or should not be encoded in Unicode? (from Re: Egyptian Hieroglyph Man with a Laptop) In-Reply-To: References: <1d84bd2d.1658.1703f3682b9.Webtop.49@btinternet.com> <1f0406d4.b22.17045e7354c.Webtop.218@btinternet.com> Message-ID: <1bc9f021.1353.1704a9aa8a1.Webtop.52@btinternet.com> Joel Kalvesmaki asks nine questions, six in the first block and three in the second block. Numbering from 1 through to 9 in the order that they are asked, I do not, at present understand the question for many of them and I can, at present, only answer question 7 definitively. Some questions may need an answer in two parts, one of the parts about my specific project, and the other part about if one or more people also decide to have his or her own encoding space in a similar manner. I realize that not even understanding the question at this time may not sound very good to just some of the people who do understand the question, but I am not someone who knowingly purports that he knows what he is talking about when he does not. I am a researcher and as I am now on awareness of these questions.I need to find out so that in the future I can answer such questions with a sound background knowledge of the topic. It might be that I know of some matters but that I am not aware of the parlance used to describe them in the post to which I am replying.. So now to my thoughts on some of the questions. 1 to 4. I do not at present understand the question. 5. Perhaps, independent of each other, you bind !123 to a character semantically identical to one I've bound to !234. What rules are in place to allow interchangeability? I am not sure this is the best possible answer, but with care the problem should not happen in the first place. I am thinking that people could perhaps avoid it happening in the first place by using an informal discussion method similar that used when proposing a new alt. group in the usenet system that was in widespread use before the web was invented. 6. I do not at present understand the question. 7. Or maybe you're not so much concerned about interoperability as are you are with extending the PUA block beyond its current limits? No, absolutely not. I have used the Private Use Areas on a number of occasions and found them extremely useful to have available. Yet any assignment in not unique and, except in very limited special limited prearranged circumstances, interoperability is not possible. My research project is very much about interoperability with provenance. Interoperabilty with provenance is central to what I am trying fo achieve. 8. Something like SGML/XML entities? Until mention in the post to which I am replying, I had never known of them. 9. Couldn't you simply capitalize on the rules that already exist for entities? From what I have read about them today, well, I suppose that I could, but that is not my approach and I am not going to use them. My items are not emoji, but emoji are either expressed by an atomic character or by a sequence of atomic characters, such sequences decoded upon reception to produce a glyph. My proposed system uses sequences of atomic character such that such sequences could be decoded upon reception to produce localized output. A similar yet different process. I simply do not want, as a design choice, all that angled bracket stuff, it is just not what I am trying to do. ---- If anyone on this mailing list who understands some or all of what I do not, your comments in this thread would be very welcome please. The first three links on my webspace are relevant to my research project. http://www.users.globalnet.co.uk/~ngo/ The website is safe to use. It is hosted on a server run these days by Plusnet PLC, a United Kingdom internet service provider. It is not hosted on my computer. William Overington Saturday 15 February 2020 ------ Original Message ------ From: "via Unicode" To: wjgo_10009 at btinternet.com Cc: unicode at unicode.org Sent: Saturday, 2020 Feb 15 At 10:11 Subject: Re: What should or should not be encoded in Unicode? (from Re: Egyptian Hieroglyph Man with a Laptop) Hi William, I don't fully understand your proposed encoding scheme (e.g., Is there a namespace each encoding scheme is bound to? How do namespaces get encoded? How are syntax strictures encoded?), but even then, presuming it's sound, you've said in the message before that this encoding space will enhance interoperability. What mechanism is in place to make my encoding space interoperable with yours? Perhaps, independent of each other, you bind !123 to a character semantically identical to one I've bound to !234. What rules are in place to allow interchangeability? What about one-to-many or many-to-many or vague or ambiguous mappings across encoding schemes, or mappings that we might reasonably contest? Or maybe you're not so much concerned about interoperability as are you are with extending the PUA block beyond its current limits? Something like SGML/XML entities? Couldn't you simply capitalize on the rules that already exist for entities? Best wishes, jk -- Joel Kalvesmaki Director, Text Alignment Network http://textalign.net On 2020-02-14 15:52, wjgo_10009 at btinternet.com via Unicode wrote: The solution is to invent my own encoding space. This sits on top of Unicode, could be (perhaps?) called markup, but it works! It may be perilous, because some software may enforce the strict official code point limits. I have now realized that what I wrote before is ambiguous. When I wrote "sits on top of Unicode" I was not meaning at some code points above U+10FFFF in the Unicode map, though I accept that it could quite reasonably be read as meaning that. My encoding space sits on top of Unicode in the sense that it uses a sequence of regular Unicode characters for each code point in my encoding space. For example ???? or !781 or a character sequence of a base character, followed by a tag exclamation mark followed by three tag digits and a cancel tag. All three examples above have the same meaning. ???? is useful as more unlikely otherwise than !123, though !123 is easier to use and could be used in a GS1-128 barcode. The tag sequence has the potential to become incorporated into Unicode for universal standardization of unambiguous interoperability everywhere. That is a long term goal for me. The example above uses a three-digit code number. My encoding space allows for various numbers of digits, with a minimum of three digits and a much larger theoretical maximum. The most digits in use at present in my research project in any one code number is six. William Overington Friday 14 February 2020 -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 21 06:21:34 2020 From: unicode at unicode.org (Costello, Roger L. via Unicode) Date: Fri, 21 Feb 2020 12:21:34 +0000 Subject: Why do binary files contain text but text files don't contain binary? Message-ID: Hi Folks, There are binary files and there are text files. Binary files often contain portions that are text. For example, the start of Windows executable files is the text MZ. To the best of my knowledge, text files never contain binary, i.e., bytes that cannot be interpreted as characters. (Of course, text files may contain a text-encoding of binary, such as base64-encoded text.) Why the asymmetry? /Roger -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 21 06:42:17 2020 From: unicode at unicode.org (via Unicode) Date: Fri, 21 Feb 2020 20:42:17 +0800 Subject: Why do binary files contain text but text files don't contain binary? In-Reply-To: References: Message-ID: Dear Roger, because in when unicode is used in real life, utf8 etc then text ? binary John Knightley On 2020-02-21 20:21, Costello, Roger L. via Unicode wrote: > Hi Folks, > > There are binary files and there are text files. > > Binary files often contain portions that are text. For example, the > start of Windows executable files is the text MZ. > > To the best of my knowledge, text files never contain binary, i.e., > bytes that cannot be interpreted as characters. (Of course, text files > may contain a text-encoding of binary, such as base64-encoded text.) > > Why the asymmetry? > > /Roger From unicode at unicode.org Fri Feb 21 09:53:52 2020 From: unicode at unicode.org (Costello, Roger L. via Unicode) Date: Fri, 21 Feb 2020 15:53:52 +0000 Subject: Why do binary files contain text but text files don't contain binary? In-Reply-To: References: Message-ID: Based on a private correspondence, I now realize that this statement: > Text files do not contain binary is not correct. Text files may indeed contain binary (i.e., bytes that are not interpretable as characters). Namely, text files may contain newlines, tabs, and some other invisible things. Question: "characters" are defined as only the visible things, right? I conclude: Binary files may contain arbitrary text. Text files may contain binary, but only a restricted set of binary. Do you agree? /Roger From: Costello, Roger L. Sent: Friday, February 21, 2020 7:22 AM To: unicode at unicode.org Subject: Why do binary files contain text but text files don't contain binary? Hi Folks, There are binary files and there are text files. Binary files often contain portions that are text. For example, the start of Windows executable files is the text MZ. To the best of my knowledge, text files never contain binary, i.e., bytes that cannot be interpreted as characters. (Of course, text files may contain a text-encoding of binary, such as base64-encoded text.) Why the asymmetry? /Roger -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 21 10:17:09 2020 From: unicode at unicode.org (Richard Wordingham via Unicode) Date: Fri, 21 Feb 2020 16:17:09 +0000 Subject: Why do binary files contain text but text files don't contain binary? In-Reply-To: References: Message-ID: <20200221161709.356c6ec8@JRWUBU2> On Fri, 21 Feb 2020 15:53:52 +0000 "Costello, Roger L. via Unicode" wrote: > Based on a private correspondence, I now realize that this statement: > > > > > Text files do not contain binary > > > > is not correct. > > > > Text files may indeed contain binary (i.e., bytes that are not > interpretable as characters). Namely, text files may contain > newlines, tabs, and some other invisible things. > > > > Question: "characters" are defined as only the visible things, right? No, white space (e.g. spaces, tabs and newlines) is normally considered to be composed of characters. And then there are much harder to discern things, such as zero-width spaces, line-break suppressors such as U+2060 WORD JOINER, and soft hyphens (interpreted as line-break opportunities). Richard. From unicode at unicode.org Fri Feb 21 10:28:27 2020 From: unicode at unicode.org (Ken Whistler via Unicode) Date: Fri, 21 Feb 2020 08:28:27 -0800 Subject: Why do binary files contain text but text files don't contain binary? In-Reply-To: References: Message-ID: <270d3154-8902-64ec-15d8-1f476fa00ba6@sonic.net> On 2/21/2020 7:53 AM, Costello, Roger L. via Unicode wrote: > > Text files may indeed contain binary (i.e., bytes that are not > interpretable as characters). Namely, text files may contain newlines, > tabs, and some other invisible things. > > Question: "characters" are defined as only the visible things, right? > No. You've gone astray right there. Please read Chapter 2 of the Unicode Standard, and in particular, Section 2.4, Code Points and Characters: https://www.unicode.org/versions/Unicode12.0.0/ch02.pdf#G25564 All of those types of characters can occur in Unicode plain text. (With the exception of surrogate code points.) > I conclude: > > Binary files may contain arbitrary text. > Binary files can contain *whatever*, including text. > > Text files may contain binary, but only a restricted set of binary. > The distinction is definitional. A text file contains *only* characters, interpretable by a specific character encoding (usually Unicode, these days). But a text file need not be "plain text". An HTML file is an example of a text file (it contains only a sequence of characters, whose identity and interpretation is all clearly specified by looking them up in the Unicode Standard), but it is not *plain* text. It is *rich* text, consisting of markup tags interspersed with runs of plain text. Another distinction that may be leading you astray is the distinction between binary file transfer and text file transfer. If you are using ftp, for example, you can specify use of binary file transfer, *even if* the file you are transferring is actually a text file. That simply means that the file transfer will agree to treat the entire file as a binary blob and transfer it byte-for-byte intact. A text file transfer, on the other hand, may look for "lines" in a text file and may adjust line endings to suit the receiving platform conventions. > Do you agree? > No. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 21 10:59:18 2020 From: unicode at unicode.org (Doug Ewell via Unicode) Date: Fri, 21 Feb 2020 09:59:18 -0700 Subject: Why do binary files contain text but text files don't contain =?UTF-8?Q?binary=3F?= Message-ID: <20200221095918.665a7a7059d7ee80bb4d670165c8327d.7d1a4d24ca.wbe@email15.godaddy.com> An HTML attachment was scrubbed... URL: From unicode at unicode.org Fri Feb 21 11:05:21 2020 From: unicode at unicode.org (=?utf-8?Q?Hans_=C3=85berg?= via Unicode) Date: Fri, 21 Feb 2020 18:05:21 +0100 Subject: Why do binary files contain text but text files don't contain binary? In-Reply-To: References: Message-ID: <2682882F-1E2C-4A5E-8BB0-D84B5F2DA763@telia.com> > On 21 Feb 2020, at 13:21, Costello, Roger L. via Unicode wrote: > > There are binary files and there are text files. In C, when opening a file as binary with the function fopen, the newlines are untranslated [1]. If not using this option, the file is informally text, which means that internally in the program, one can assume that the newline [2] is the character U+000A LINE FEED (LF). 1. https://en.cppreference.com/w/cpp/io/c/fopen 2. https://en.wikipedia.org/wiki/Newline From unicode at unicode.org Fri Feb 21 13:02:18 2020 From: unicode at unicode.org (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?= via Unicode) Date: Fri, 21 Feb 2020 20:02:18 +0100 Subject: Aw: Why do binary files contain text but text files don't contain binary? In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: