Talk:UTF-8

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Citation needed on "truncate in middle of a character"[edit]

In the section "Comparison with single-byte encodings", a bullet point mentions an obvious fact:

"It is possible in UTF-8 (or any other multi-byte encoding) to split or truncate a string in the middle of a character", but this is flagged with [citation needed]".

This is certainly true in languages such as C where strings are stored as an array of bytes. So I will add that proviso and remove the citation needed. — Preceding unsigned comment added by 138.207.235.63 (talkcontribs) 02:45, 4 August 2019 (UTC)

No, if the string is not an array of bytes, it is *NOT* UTF-8 (unless you are talking about some weird scheme where it is split into tiny 1-4 byte UTF-8 strings at the code points????). Therefore you have just added a lot of pointless bloat to the article, which I am going to revert.Spitzak (talk) 18:49, 4 August 2019 (UTC)

138.207. and @Spitzak: watch the WP: Signatures policy, both of you.
By the way, I removed from the article two instances of older applications that can… (or cannot…) gibberish. Please, write articles based on facts, not advocacy. Incnis Mrsi (talk) 07:19, 5 August 2019 (UTC)

Table should not only use color to encode information (but formatting like bold and underline)[edit]

As in a previous comment https://en.wikipedia.org/wiki/Talk:UTF-8/Archive_1#Colour_in_example_table? this has been done before, and is *better* so that everyone can clearly see the different part of the code. Relying on color alone is not good, due to color vision deficiencies and varying color rendition on devices.

UTF-8, the lead section, and my hidden agenda[edit]

Hi Ita140188, I see you've moved important information out of the lead, including the graph.

At a minimum, I find it important to have the graph above the fold, in the lead. Per MOS:LEAD: "It gives the basics in a nutshell and cultivates interest in reading on—though not by teasing the reader or hinting at what follows."

94.5% shows clear majority, while showing that 5% do use something else (could be 430 million people), without going into details, so "what are those 5% using" people might think? There is space to go into more details, show the rest is very divided. If you do not know that, then you would be excused to thinking 1/20 use some ONE other encoding.

The missing 5% is roughly equal to the populations of Russia, Japan, Egypt and Thailand combined, all with their good reasons to avoid Unicode, and for all we know all those countries (and no other) could be avoiding UTF-8 for their old legacy encodings not covering what Unicode/UTF-8 can cover (but all of those countries have high UTF-8 use, and no country has UTF-8 use much lower than 90% on the web).

I checked on mobile, and there you have to press to expand other sections. On my wide-screen desktop monitor, there is empty space that could well be filled with the graph.

Most people are not going to scroll past the lead. And I would argue, the MOST important information about an encoding is that it is used, or not used, and what are the alternatives.

You may not be aware of the UTF8 Everywhere Manifesto: "In particular, we believe that the very popular UTF-16 encoding (often mistakenly referred to as ‘widechar’ or simply ‘Unicode’ in the Windows world) has no place [except]". comp.arch (talk) 19:01, 6 May 2020 (UTC)

I think that is way too much detail for the lead, but ok. In any case, there must be a section detailing the use, since I think it is the most likely information people are looking for when opening the article. If they are like me, they would not read the lead but just jump to the relevant section from the table of contents. Also, the lead cannot have independent content: it should only be a summary of the article. This means any information in the lead should also be present elsewhere in the article. --Ita140188 (talk) 01:56, 7 May 2020 (UTC)
I made the section because some other editor duplicated the entire paragraph and inserted it in the middle of the article as a new section. I thought this might make them happy and avoid the duplicate. It does seem that this could be put back into the intro.Spitzak (talk) 02:18, 7 May 2020 (UTC)
I originally made the section, and my point is that it should stay. We can discuss what to add to the lead from that section. --Ita140188 (talk) 02:52, 7 May 2020 (UTC)

Runes[edit]

The Google-developed programming language Go defines a datatype called rune. A rune is "an int32 containing a Unicode character of 1,2,3, or 4 bytes". It is not clear from the documentation whether a rune contains the Unicode character number (code point) or the usual UTF-8 encoding used in Go. Testing reveals that a rune appears to be the Unicode character number.

I found a good reference to confirm this at https://blog.golang.org/strings, so this information should be added prominently to this article and similar articles that are missing it. It can be quite frustrating to read about runes in Go and not have this information. David Spector (talk) 00:42, 4 September 2020 (UTC)

It sounds like that belongs in the page for Go, not here. This is about UTF-8, not datatypes specific to one language. Tarl N. (discuss) 01:26, 4 September 2020 (UTC)
Furthermore, this isn't a software reference manual. It shouldn't be added here at all, let alone "prominently", precisely because it is an obscure implementation feature of a relatively new programming language. Chris Cunningham (user:thumperward) (talk) 17:10, 4 September 2020 (UTC)
I believe the Plan9 documentation also called unicode code points "runes" so it might be relevant here, though really does not sound very important.Spitzak (talk) 18:12, 4 September 2020 (UTC)
Given the shared heritage of all three systems it's unsurprising that they share idiosyncrasies in nomenclature, but this is probably something more pertinent to the biographies of the creators than to the individual systems. Chris Cunningham (user:thumperward) (talk) 18:32, 4 September 2020 (UTC)

Byte order mark trivia[edit]

This article has seen significant work recently to try to elevate the important aspects of the subject and reduce the amount of coverage on trivia. One such change has been reverted on the grounds that "BOM killing usage of UTF-8 is very well documented". Of course the material in question has been unreferenced ever since it was added. I don't dispute that people using e.g. Windows Notepad in the middle of the decade were very annoyed by this, but it truly isn't an important enough aspect of the subject today to warrant its own subheading. All that we need to do is note that historically some software insisted on adding BOMs to UTF-8 files and that this caused interoperability issues, with a good reference. We currently lack the latter entirely, but we should at least restore the reduced version of the content such that we aren't inflating what is basically a historic bug that has no impact on the vast majority of uses of the spec. Chris Cunningham (user:thumperward) (talk) 17:08, 4 September 2020 (UTC)

Probably not very important nowadays, and continued watering-down by people trying to whitewash bad behavior by certain companies is making it unreadable. The problem was not actually programs adding the BOM, it was software that refused to recognize UTF-8 without the BOM, which *forced* software to write it and destroyed the ASCII-compatibilty, as well as basically introducing magic bytes to a file format that is intended to be the most generic text with no structure at all, and complicates even the most trivial operations such as concatenating files. I agree that an awful lot of software has been patched to ignore a leading BOM and the only real bad result is the programming time wasted making these modifications. It actually appears that now there is an inverse problem and some Microsoft compilers work better with UTF-8 if the BOM is *missing*, the reason is that they leave the bytes in quoted string constants alone, while if the BOM is there they perform a translation to UTF-16 and back again, which introduces a lot of annoyances such as mangling any invalid byte sequences.
My main concern with the article here though was to move the description of the BOM out of the "description" section, since it is strongly discouraged by the Unicode consortium and a thing that should not exist has no right to be in the introductory description. It could be reduced a lot further. I also don't think there is much software that will show legacy letters any more.Spitzak (talk) 18:20, 4 September 2020 (UTC)
If you're accusing me of somehow having some pro-Microsoft agenda then I'd encourage you to go and have a walk or pet a dog or something. My only concern here is making the article as accessible as possible, which means minimising the amount of material in it which exists primarily to air editors' grudges against historic implementation bugs.
This material is still unsourced, and warrants a paragraph at best (and no subheader). It should be obvious to any reader that there is no actual need for a marker indicating byte order in a single-byte encoding, and without any (referenced!) context which shows this is a true and notable problem (as opposed to a historic quibble) then the reader is left wondering why the hell such a big deal is being made of it. Chris Cunningham (user:thumperward) (talk) 18:30, 4 September 2020 (UTC)

How is utf8mb3 exactly the same as CESU-8?[edit]

Spitzak, you've repeatedly asserted that MySQL UTF8mb3 and CESU-8 are exactly the same in the edit comments. I believe you, but I can't follow you, because the source materials seem to say otherwise, and the citations seem insufficient.

In Unicode Technical Report #26, CESU-8 is explicitly defined to support supplemental characters: "In CESU-8, supplementary characters are represented as six-byte sequences". Whereas the MySQL 8.0 Reference Manual explicitly states that supplemental characters are not supported: "Supports BMP characters only (no support for supplementary characters)". And the MySQL 3.23, 4.0, 4.1 Reference Manual (when utf8mb3 first appears, as "utf8") says the same: "The ucs2 and utf8 character sets do not support supplementary characters that lie outside the BMP."

How do you reconcile these conflicting definitions of CESU-8 and utf8mb3? Is one of them wrong, or do they require further interpretation? If so, is that cited somewhere? I checked the citations, but I'm not seeing how they back up what you're saying -- they only seem to note that utf8mb3 doesn't support supplemental characters. If what you're saying is in fact true, I think further explication is needed beyond saying it is so, because the MySQL docs and UTR#26 seem to suggest that utf8mb3 and CESU-8 are definitionally different, at least when perused by a non-expert like myself trying to learn about the subject.

While I think the introductory paragraph is trying to shed some light, "many programs" is vague and not cited, and nor is it cited that MySQL is definitively one of those many programs, and nor is it cited that MySQL "transforms UCS-2 codes to three bytes or fewer" for utf8mb3. Does it? How do we know?

If what you're trying to say is that when UTF-16 supplemental characters are converted to UTF-8 as though they are UCS-2 (and not UTF-16), the result is what came to be called CESU-8, then I think you also need to say that while utf8mb3 is not intended to support supplemental characters at all, it functionally operates as CESU-8 if they are present. And ideally that should be backed up with a citation, or an example sufficient to demonstrate that this article is not the only place where one will find this assertion.

And, even if you're right that utf8mb3 and CESU-8 (and Oracle UTF8) are technically identical, it's still not correct to say that "MySQL calls [UTF-16 supplemental characters converted to UTF-8 as though they were UCS-2 characters] utf8mb3", because MySQL quite clearly defines utf8mb3 as being BMP-only; so MySQL is not "calling" anything involving supplemental characters utf8mb3.

Having now been trying to understand this for hours, I think this Oracle document explains it pretty well: "The UTF8 character set encodes characters in one, two, or three bytes...If supplementary characters are inserted into a UTF8 database...the supplementary characters are treated as two separate, user-defined characters that occupy 6 bytes." If what you're saying is correct (and I don't know that it is, because I don't have anything authoritative saying so), then it sounds like this could be equally applicable to utf8mb3. The article could make that clear, if properly cited or demonstrated.

TL;DR: It's not accurate to describe utf8mb3 as having any representation of supplemental characters, even if it can technically can do so as described by CESU-8, because it is defined otherwise. Further, claiming utf8mb3 is technically identical to CESU-8 warrants citation or demonstration, and the claim would benefit from greater clarity. Ivanxqz (talk) 00:45, 15 September 2020 (UTC)

Both of then translate a UTF-16 supplemental pair into exactly the same 6 bytes, and unpaired surrogate halves into exactly the same 3 bytes, therefore they are identical.Spitzak (talk) 21:20, 15 September 2020 (UTC)
Can you cite this anywhere? No original research, etc. The only source for your information is you. (And you haven't responded to anything that I wrote above, not even the TLDR -- even if technically identical, which you have only asserted and not cited, MySQL does not "call" CESU-8 "utf8mb3" as you state -- utf8mb3 explicitly does not support supplemental characters, and therefore any handling of them in the style of CESU-8 is an accident, not a design.) Ivanxqz (talk) 04:55, 16 September 2020 (UTC)

I decided to rewrite the CESU-8 section for what I think is greater clarity and accuracy. I included that CESU-8 in utf8mb3 is possible (though unsupported), on the basis of Spitzak's claim that it's the case. I noted that it needs a citation. I think it's not actually true, though, on the basis of Bernt's counter-demonstration at Talk:CESU-8#Comments, which I also just verified myself, and also the original references regarding utf8mb3 in the previous version, but I'll leave it for now. (Spitzak? Can you show somewhere why your claim that utf8mb3 can support supplemental characters via CESU-8 is accurate?)

I also gave utf8mb3 its own section again, since it is definitionally not CESU-8, even if technically it's the same thing (which, again, I don't think it is). It's like saying that Mountain Standard Time and Pacific Daylight Time are the same thing; they represent the exact same time of day in California and Arizona in the winter, but they're not the same thing, because they have different definitions. Ivanxqz (talk) 10:53, 16 September 2020 (UTC)