SDB:Plain Text versus Locale

Jump to: navigation, search



This article explains why one must consider the "locale" (the settings of language and cultural rules) when working with so called "plain text files".

The commands which are shown in this article are not meant as some kind of authoritative instructions. They are only meant as examples to point out that "plain text" and "locale" are bound to each other and that there is a "plain text versus locale" conflict when both do not match.

This article is not meant as explanation about the world of encoding. Therefore it neither provides background information nor does it mention issues which are related to concepts behind (like "UTF-8 versus Unicode", see SDB_Talk:Plain_Text_versus_Locale). Such topics could be described in separated articles. This article is already long enough.

Special characters are used in this article. To show them in your browser the following HTML entities are used and the matching special characters are shown enclosed in parenthesis here:

  • ä to show the Latin small letter a with diaeresis: ( ä )
  • Ã to show the Latin capital letter A with tilde: ( Ã )
  • ¤ to show the currency sign: ( ¤ )
  • € to show the Euro sign: ( € )
  • δ to show the Greek small letter delta: ( δ )
  • Ξ to show the Greek capital letter XI: ( Ξ )
  • ΄ to show the accent Greek tonos: ( ΄ )
  • Â to show the Latin capital letter A with circumflex: ( Â )
  • â to show the Latin small letter a with circumflex: ( â )
  • ¬ to show the not sign: ( ¬ )
  • ` to show the grave accent: ( ` )
  • ´ to show the acute accent: ( ´ )

If your browser does not show all those special characters, i.e. if you see a pair of empty parenthesis above, you may have to adjust settings in your browser so that it can display this article correctly.

Situation

You have so called "plain text files" which you process "as usual since ever" with various "traditional" Unix/Linux tools but since some time you notice every now and then some weird results or unexpected side-effects which did not happen in the past and/or which are not mentioned in the man pages of the traditional tools.

There is no such thing as "plain text"

There is ASCII text and there is text in various other encodings (e.g. ISO-8859-1 and UTF-8), but for a particular text the term "plain text" per se is meaningless.

Reasoning

So called "plain text files" have no information stored in which encoding their content is.

"Encoding" means that when characters of a text are stored (e.g. in a file), the characters are encoded as byte values, see the section "Bytes versus characters" below.

The content of a "plain text file" is a plain sequence of bytes without any additional information which encoding is meant by this sequence of bytes.

Therefore it is not possible to autodetect which encoding is meant for the bytes in a "plain text file". There are some tools to guess which encoding could be meant but their result is only a guess.

For a particular "plain text file" the term "plain text" per se is meaningless without additional information which encoding is meant by the sequence of bytes in the particular "plain text file".

If at all the term "plain text" could be used to describe vaguely "any sequence of bytes which is meant as text in whatever encoding" for example to distinguish it from arbitrary binary data or from text which is stored in a higher-level format like HTML or PDF or as whatever kind of office document.

In contrast for a particular file of text the word "plain" should be replaced by its actual encoding like "ASCII text", "ISO-8859-1 text", or "UTF-8 text". Strictly speaking the word "text" is superfluous because sequences of bytes which are "ASCII", "ISO-8859-1", or "UTF-8" are text in the particular encoding.

Consequences

When programs process "plain text files", the user who runs the program must set up the locale environment to match the encoding of the "plain text file" before he runs the program.

A "locale" is a set of language and cultural rules, e.g. character sets, lexicographic conventions, etc. which are specified via various environment variables, in particular LC_ALL and LANG (see "man 7 locale"). The command "locale" shows the current values of the locale environment variables. The command "locale --all-locales" outputs all available locales which can be set as values for the locale environment variables.

To set a "traditional" Unix/Linux locale environment, use

export LC_ALL=POSIX ; export LANG=POSIX

When you like to process your "plain text files" as you did "since ever" with various "traditional" Unix/Linux tools, you must use the POSIX locale, otherwise you will get weird results and unexpected side-effects.

To set an UTF-8 locale environment, use something like one of the following:

export LC_ALL=en_US.utf8 ; export LANG=en_US.utf8

export LC_ALL=en_GB.utf8 ; export LANG=en_GB.utf8

export LC_ALL=de_DE.utf8 ; export LANG=de_DE.utf8

It is the user who runs the program who must somehow know the right encoding of his "plain text files".

A special case is what the program does when it gets bytes which are illegal according to the encoding which the user has specified by the locale environment.

For example a non-ASCII byte when the locale environment is "POSIX" or a sequence of bytes which is impossible in UTF-8 when the locale environment is "...utf8".

In this case the result is undefined because it is an implementation detail what the program does in this case. The program may abort or skip the illegal bytes or do whatever else.

Bytes versus characters

Depending on the locale, same byte values could mean different characters and same characters could be encoded as different byte values.

In ISO-8859-1 and ISO-8859-15 the hexadecimal byte value 0xE4 means the character ä (Latin small letter a with diaeresis, e.g. the German a-umlaut). In UTF-8 this character is encoded with a sequence of two bytes with hexadecimal byte values 0xC3 and 0xA4. But in ISO-8859-1 those two bytes mean the characters à (Latin capital letter A with tilde) and ¤ (currency sign) and in ISO-8859-15 those two bytes mean the characters à (Latin capital letter A with tilde) and € (Euro sign).

In ISO-8859-7 the hexadecimal byte value 0xE4 means the character δ (Greek small letter delta). In UTF-8 this character is encoded with a sequence of two bytes with hexadecimal byte values 0xCE and 0xB4. But in ISO-8859-7 those two bytes mean the character Ξ (Greek capital letter XI) and the accent ΄ (Greek tonos).

In ISO-8859-1 the hexadecimal byte value 0xA4 means the character ¤ (currency sign). In UTF-8 this character is encoded with a sequence of two bytes with hexadecimal byte values 0xC2 and 0xA4. But in ISO-8859-1 those two bytes mean the characters  (Latin capital letter A with circumflex) and ¤ (currency sign).

In ISO-8859-15 the hexadecimal byte value 0xA4 means the character € (Euro sign). In UTF-8 this character is encoded with a sequence of three bytes with hexadecimal byte values 0xE2, 0x82, and 0xAC. But in ISO-8859-15 those three bytes mean the characters â (Latin small letter a with circumflex), 0x82 is an unprintable "BPH" (break permitted here) ISO-8859 control character, and finally ¬ (not sign).

Only for ASCII text (the 7-bit hexadecimal byte values 0x00 up to 0x7F, see "man ascii") there is the same one to one mapping between byte values and characters for the "usual" encodings (in particular for ISO-8859 encodings and for UTF-8) so that only for ASCII text the locale environment does not make a difference.

The ASCII character set consists of 33 non-printable control characters (with hexadecimal byte values from 0x00 up to 0x1F and the 0x7F) and the following 95 printable characters beginning with the space character (with hexadecimal byte values from 0x20 up to 0x7E):

  ! " # $ % & ' ( ) * + , - . /
0 1 2 3 4 5 6 7 8 9 : ; < = > ? @
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ `
a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~

Single byte versus multibyte characters

ASCII and ISO-8859 encodings use one single byte to encode a character.

A single byte can store the decimal values 0 up to 255 so that a single byte encoding can support only up to 256 different characters.

Therefore with ASCII and ISO-8859 encodings it is not possible to have particular characters in one same text. For example one cannot have ä (Latin small letter a with diaeresis) and δ (Greek small letter delta) in one same text because one same text cannot be both in ISO-8859-1 and ISO-8859-7 encoding.

But with UTF-8 it is possible to combine arbitrary characters in one same text, but there is a price to pay: UTF-8 is a multibyte encoding.

Only for ASCII characters (see "man ascii") UTF-8 uses the same single byte encoding as ASCII so that UTF-8 is compatible with ASCII.

But for all non-ASCII characters UTF-8 uses two or more bytes so that UTF-8 is incompatible with all ISO-8859 encodings.

UTF-8 versus ISO-8859 incompatibility causes unexpected results

The German word for binary is "binär" with an ä (German a-umlaut) as last but one character.

In ISO-8859-1 and ISO-8859-15 the German a-umlaut ä is encoded as the hexadecimal byte value 0xE4 which is the octal byte value \0344.

In UTF-8 the German a-umlaut ä is encoded as the hexadecimal byte values 0xC3 0xA4 which are the octal byte values \0303 \0244.

Depending on the locale environment the "wc" tool which can count characters results different numbers of characters.

The non-ASCII character a-umlaut ä is input as backslash escape sequence "\0nnn" of octal byte values so that it is possible to enter it with any keyboard regardless if there is the ä key or not. Furthermore "\0nnn" is safe against unexpected keyboard input results, see the section "Keyboard input depends on the locale environment" below.

user@host$ export LC_ALL=en_GB.iso885915 ; export LANG=en_GB.iso885915

user@host$ echo -en "bin\0344r" | wc --chars

 5

user@host$ echo -en "bin\0303\0244r" | wc --chars

 6

user@host$ export LC_ALL=en_GB.utf8 ; export LANG=en_GB.utf8

user@host$ echo -en "bin\0344r" | wc --chars

 4

user@host$ echo -en "bin\0303\0244r" | wc --chars

 5

The "wc" tool can only calculate the right number of characters when the locale environment in which the "wc" tool runs matches the encoding of the input (here the word "binär").

Processing UTF-8 is slower than ISO-8859 and ASCII

The following sequence of commands create two files each with 100000 lines of the German word for binary "binär" in ISO-8859-15 encoding and in UTF-8 encoding.

Then the "wc" tool is run in the appropriate locale environment to count the right number of characters in each file (the newline control character which marks the end of each line is also counted).


user@host$ for i in $( seq 100000 ) ; do echo -e "bin\0344r" >>/tmp/text.iso885915 ; done

user@host$ for i in $( seq 100000 ) ; do echo -e "bin\0303\0244r" >>/tmp/text.utf8 ; done

user@host$ ls -l /tmp/text.*

 ... 600000 ... /tmp/text.iso885915
 ... 700000 ... /tmp/text.utf8

user@host$ export LC_ALL=en_GB.iso885915 ; export LANG=en_GB.iso885915

user@host$ time cat /tmp/text.iso885915 | wc --chars

 600000

 real    0m0.005s
 ...

user@host$ export LC_ALL=en_GB.utf8 ; export LANG=en_GB.utf8

user@host$ time cat /tmp/text.utf8 | wc --chars 

 600000

 real    0m0.050s
 ...

The "useful use of cat" makes sure that "wc" gets its input from a pipe so that "wc" has only the plain input to count the characters and cannot "cheat" by using another source of information (e.g. the file size in case of single byte encoding).

In this particular example counting the right number of characters in the UTF-8 file needs 10 times more computing time compared to counting the right number of characters in the ISO-8859-15 file.

The actual numbers depend very much on the particular system, the particular version of the particular program, and the particular version of the system libraries which are used by the particular program.

A minor reason is that UTF-8 files are bigger than ISO-8859 files because UTF-8 is a multibyte encoding.

In the above example the UTF-8 file is about 17 percent bigger than the ISO-8859 file so that this is not the main reason why processing the UTF-8 file is 10 times slower.

The main reason is that the UTF-8 multibyte encoding is more complicated than the ISO-8859 and ASCII single byte encodings.

Therefore processing UTF-8 is more complicated than processing single byte encodings so that processing UTF-8 needs more computing time than processing ISO-8859 and ASCII.

What is shown on the screen additionally depends on the font

Which characters are shown on the screen depend on the locale environment and additionally it depends on the font which is used by the tool which shows the characters.

A particular font contains a particular glyph for a character. A glyph is a graphical representation of a character. For example one same character 'A' can be shown as various glyphs like

  • A (normal)
  • A (bold)
  • A (italic/slanted)
  • A (bold and italic/slanted)

The command "xlsfonts" lists all fonts which are available for the currently running X server (usually several thousand fonts). For example

xlsfonts -fn "-*-fixed-*-*-*-*-*-*-*-*-*-*-iso8859-15"

lists all currently available ISO-8859-15 fonts with style "fixed".

In the following commands the ASCII character ' (apostrophe) with hexadecimal byte value 0x27 is needed as single quotation mark. Do not confuse it with similar looking but different characters like the ASCII character ` (grave accent) or the ISO-8859-1 character ´ (acute accent) or various other accent characters like the ISO-8859-7 character ΄ (Greek tonos). Even when characters (more precisely when glyphs for characters) look exactly the same on the screen (this depends on the font which is used) the characters are different when their byte values differ.

The commands

export LC_ALL=en_GB.iso885915 ; export LANG=en_GB.iso885915

xterm -fn "-*-fixed-*-*-*-*-*-*-*-*-*-*-iso8859-15" -e "echo -e 'bin\0344r' ; sleep 9"

launch a xterm window in an ISO-8859-15 environment using an ISO-8859-15 font which shows the ISO-8859-15 encoded word "binär" as expected:

binär

In contrast the commands

export LC_ALL=en_GB.iso885915 ; export LANG=en_GB.iso885915

xterm -fn "-*-fixed-*-*-*-*-*-*-*-*-*-*-iso8859-15" -e "echo -e 'bin\0303\0244r' ; sleep 9"

also launch a xterm window in an ISO-8859-15 environment using an ISO-8859-15 font which shows the UTF-8 encoded word "binär" as

binÀr

If a font is used which does not match the locale environment like in

export LC_ALL=en_GB.iso885915 ; export LANG=en_GB.iso885915

xterm -fn "-*-fixed-*-*-*-*-*-*-*-*-*-*-iso8859-1" -e "echo -e 'bin\0303\0244r' ; sleep 9"

which launch a xterm window in an ISO-8859-15 environment but using an ISO-8859-1 font it shows the UTF-8 encoded word "binär" as

binär

To show the UTF-8 encoded word correctly xterm must run in the locale environment which matches the encoding of the word and xterm must use a font which matches the locale environment in which xterm runs. UTF-8 is defined for example in ISO-10646-1 so that an ISO-10646-1 font matches an UTF-8 locale environment:

export LC_ALL=en_GB.utf8 ; export LANG=en_GB.utf8

xterm -fn "-*-fixed-*-*-*-*-*-*-*-*-*-*-iso10646-1" -e "echo -e 'bin\0303\0244r' ; sleep 9"

which shows

binär

In contrast the commands

export LC_ALL=en_GB.utf8 ; export LANG=en_GB.utf8

xterm -fn "-*-fixed-*-*-*-*-*-*-*-*-*-*-iso10646-1" -e "echo -e 'bin\0344r' ; sleep 9"

may show something like

bin?

when the ISO-8859-1 or ISO-8859-15 encoded word "binär" should be displayed in an UTF-8 locale environment using an UTF-8/ISO-10646-1 font.

It is is undefined what the program does when there is a sequence of bytes which is impossible in UTF-8. In such a case the program may show something like a ? character for the illegal UTF-8 byte sequence "\0344r". One could test this using

export LC_ALL=en_GB.utf8 ; export LANG=en_GB.utf8

xterm -fn "-*-fixed-*-*-*-*-*-*-*-*-*-*-iso10646-1" -e "echo -e 'bin\0344rXX' ; sleep 9"

which may show show something like

bin?XX

It does not matter if you do not get the exact same results on your particular system when your locale environment does not match the encoding of your text and/or when your font does not match your locale environment because it is an implementation detail what a particular version of a particular program does in such cases.

Only when your locale environment matches the encoding of your text and when your font matches your locale environment, you can get results which match your expectations.

Keyboard input depends on the locale environment

Assume you like to enter the German word for binary "binär".

When there is a key on the keyboard which is labeled with the a-umlaut ä it depends on the current locale environment which byte values it results when the ä key is pressed, i.e. which encoding of the a-umlaut ä character is applied when the ä key is pressed.

To enter the word "binär" with the a-umlaut ä character in ISO-8859-15 encoding launch a xterm window in an ISO-8859-15 environment using an ISO-8859-15 font:

export LC_ALL=en_GB.iso885915 ; export LANG=en_GB.iso885915

xterm -fn "-*-fixed-*-*-*-*-*-*-*-*-*-*-iso8859-15" &

In this xterm window type (in particular do not copy and paste the word "binär" from somewhere, otherwise you may get unexpected results):

export LC_ALL=en_GB.iso885915 ; export LANG=en_GB.iso885915

echo -n "binär" >/tmp/somefile

od -t x1 /tmp/somefile

The "export LC_ALL=... ; export LANG=..." in the xterm window is needed to make sure that the locale environment for the programs which are started from within the xterm (here "echo" and "od") match the locale environment in which the xterm itself runs.

The "od" tool dumps the byte values in a file as hexadecimal numbers which results this output

62 69 6e e4 72

which represent the characters of the word "binär" in ISO-8859-15 encoding.

To enter the word "binär" with the a-umlaut ä character in UTF-8 encoding launch a xterm window in an UTF-8 environment using an UTF-8/ISO-10646-1 font:

export LC_ALL=en_GB.utf8 ; export LANG=en_GB.utf8

xterm -fn "-*-fixed-*-*-*-*-*-*-*-*-*-*-iso10646-1" &

In this xterm window enter (in particular type "binär" manually):

export LC_ALL=en_GB.utf8 ; export LANG=en_GB.utf8

echo -n "binär" >/tmp/somefile

od -t x1 /tmp/somefile

Now "od" results this output

62 69 6e c3 a4 72

which represent the characters of the word "binär" in UTF-8 encoding.

Non-ASCII characters in file names

If you really like to get into trouble: Use non-ASCII characters in file names.

Because keyboard input depends on the locale environment and what is shown on the screen additionally depends on the font, an easy way how to get mad is to use non-ASCII characters in file names.

Launch a xterm window using

export LC_ALL=en_GB.iso885915 ; export LANG=en_GB.iso885915

xterm -fn "-*-fixed-*-*-*-*-*-*-*-*-*-*-iso8859-15" &

In this xterm window enter (in particular type "binär" manually):

export LC_ALL=en_GB.iso885915 ; export LANG=en_GB.iso885915

touch /tmp/binär

ls /tmp/binär

As expected "ls" results the output

/tmp/binär

Launch a second xterm window using a different locale:

export LC_ALL=en_GB.utf8 ; export LANG=en_GB.utf8

xterm -fn "-*-fixed-*-*-*-*-*-*-*-*-*-*-iso10646-1" &

In the second xterm window enter (in particular type "binär" manually):

export LC_ALL=en_GB.utf8 ; export LANG=en_GB.utf8

ls /tmp/binär

Now "ls" results this output

ls: cannot access /tmp/binär: No such file or directory

The reason is that the filename /tmp/binär on the disk is stored in ISO-8859-15 encoding as this byte values (hexadecimal numbers)

2f 74 6d 70 2f 62 69 6e e4 72

but in the second xterm window the filename /tmp/binär for the "ls" command is input in UTF-8 encoding as that byte values (hexadecimal numbers)

2f 74 6d 70 2f 62 69 6e c3 a4 72

and both do not match. There is no file with name "2f 74 6d 70 2f 62 69 6e c3 a4 72" on the disk. The file on the disk has name "2f 74 6d 70 2f 62 69 6e e4 72".

Because there is no file with name "2f 74 6d 70 2f 62 69 6e c3 a4 72" on the disk, it can be created in the second xterm window (type "binär" manually to get the file name in UTF-8 encoding)

touch /tmp/binär

so that now there are two separated files with same filename characters

/  t  m  p  /  b  i  n  ä  r

that are stored on the disk under different filename byte values.

For the operating system a filename is a plain sequence of bytes without any additional information what characters are meant by this sequence of bytes, see above: There is no such thing as "plain text".

Some side notes for the fun of weirdness

Use whitespace characters in file names to fool others

touch '/tmp/lostinspaces '

touch '/tmp/lostinspaces  '

ls /tmp
...
lostinspaces 
lostinspaces  
...

Explore fantastic opportunities with various non-ASCII (preferably UTF-8/Unicode) whitespace characters! ;)

Use non-ASCII characters in usernames and passwords to lock yourself out

In addition to your keyboard input depending on your locale environment the final outcome whether or not your input will be recognized as your valid username and password also depends on the locale environments of whatever subsequent tools and services that deal with your input.

Be excited about the thrill whether or not login works or fails here and there! ;s

See also https://openprinting.github.io/cups/faq.html which reads (excerpt):

Unicode passwords are poorly supported by web browsers, the Hypertext Transfer Protocol (HTTP), and by UNIX in general. Many browsers simply truncate password characters at 8 bits (!) and there is no way to know which character set is being used by the Pluggable Authentication Module (PAM) provided by the operating system. Thus, there is no way to reliably support Unicode passwords today.

The same would also apply for Unicode usernames but various documents explain what characters can or should be used for usernames. For example POSIX requires characters in usernames to be only from the so called "portable filename character set" but "man useradd" is even more strict:

Usernames must start with a lower case letter or an underscore, followed by lower case letters, digits, underscores, or dashes. They can end with a dollar sign. In regular expression terms: [a-z_][a-z0-9_-]*[$]?

Accordingly non-ASCII characters in usernames are invalid.

Use non-ASCII characters in technical values (like URLs) to help attackers

While homographs in ASCII are relatively easily visible, homographs in Unicode are beyond recognition to the naked eye. By using UTF-8/Unicode in technical values (like URLs) your users must use such values and then attackers can play with homographs (cf. https://en.wikipedia.org/wiki/IDN_homograph_attack).

Feel happy that such attacks are not your problem but your user's problem! :s

Bottom Line

In particular because the number of bytes for one character can be different for ASCII and ISO-8859 encodings (one byte for one character) compared to UTF-8 (one or more bytes for one character), and because an UTF-8 locale environment is set by default nowadays, you can get any kind of weird looking result when you process a non-UTF-8 text file with whatever program if you do not set up your locale environment to match the encoding of your "plain text file" before you run the program.

But setting the matching locale environment alone could be insufficient to get the expected correct results. For example in case of output on the screen the font must also match the locale environment.

See also