Home Wiki > SDB Talk:Plain Text versus Locale
Sign up | Login

SDB Talk:Plain Text versus Locale

tagline: From openSUSE

I would not say '[the] term "plain text" per se is meaningless.' Plain-Text is characterized by having no well known format like XML, RTF, or HTML.

With regards to encoding, e.g. XML supports a header like <?xml version="1.0" encoding="UTF-8"?> - whereas plain-text is characterized by the absence of any such markup or meta-data. Anything written inside a plain-text file is its contents. There is no support to transport meta-data (e.g. encoding information).

The article explains the difference between single byte encodings (ISO-8859-X) and the multi-byte encoding UTF8, but fails to mention the unicode concept behind utf8. When you google for e.g. 'euro sign character' you may find the hexadecimal number 0x20AC, also written as U+20AC. This is the unicode codepoint of the Euro character. The range goes well beyond 8-bits and allows to represent virtually any language with one single numerical system. A computer software may represent a text that mixes e.g. latin, asian and other languages internally as unicode. Whenever a unicode text is written to a file, it gets encoded, since files consist of 8-bit bytes only. UTF-8 is an encoding that allows to represent the entire range of unicode codepoints, by efficiently using multiple bytes per character, where needed.

I'll give an example that illustrates how easy it is to confuse encodings and unicode codepoints:

In ISO-8859-1 and ISO-8859-15 the hexadecimal byte value 0xE4 (decimal 228) means the character ä (Latin small letter a with diaeresis, e.g. the German a-umlaut). The unicode codepoint of the same character is also 0xE4. When writing to a file unicode UTF-8, which represents the germain a-umlaut as two bytes 0xC2 follwed by 0xA4. The codepoint of the Euro-Sign is hexadecimal 0x20ac (decimal 8364), which is outside the range of any ISO-8859-X encoding (decimal range 0..255).

Writing unicode-codepoints 'as is' is always an error. Codepoints above decimal 255 would not fit in a byte and thus get corrupted. Nevertheless, this error often goes unnoticed in typical ISO-8859-X environments, as the codepoints that fit into 8-bits often happen to be identical to the byte values of these encodings.

--Jnweiger 09:10, 11 March 2011 (MST)

What I (User:Jsmeix) like to say is that the plain wording "plain text" is meaningless because "plain text files" have no well known format. In other words: The plain words "plain text" contain almost no meaningful information. Perhaps the wording "per se" is not correct?

I think this acrticle should be only about the "Plain Text versus Locale" topic and not expand to an explanation about the "encoding hell" and concepts behind. I think this could be better provided via separated articles to avoid that this article becomes too long.

-- Jsmeix Tue Mar 15 12:51:33 CET 2011