SDB:Converting Files or File Names to UTF-8 Encoding

From openSUSE


Version: 9.1

Symptom

Special characters in files or file names are not properly displayed.

Cause

As of SUSE LINUX version 9.1, the default system character set is "lang_LANG.UTF-8". Further, if you have files from Windows filesystems, their filename might not be encoded correctly.

Solution

There are several approaches for the conversion to UTF-8 encoding:

If you have problems concerning the incorrect representation of file names, use the script "convmv" to convert these names to UTF-8. For example:

sudo zypper in convmv

to install the script.

convmv --notest -r -f latin1 -t utf-8

If the files are from a Windows filesystem, you can change their encoding with:

convmv -r -f cp1252 -t utf-8

Add a --notest to the line above, to actually change the filenames. Without it, convmv will just show you what it would do, i.e. before you convert, make sure you have chosen the correct input-encoding. Other English encoding could be cp437

If you have problems concerning the incorrect representation of file contents, use the command "iconv" to convert them to UTF-8. For example:

iconv -f latin1 -t utf-8 document.txt >> document_new.txt

To switch back to the ISO encoding, open the "Language selection" module in the "System" section in "YaST Control Center". The language currently in use is preselected after launching the module. Click "Details" and disable the use of UTF-8 encoding in the displayed dialog. Accept the modified settings and finish YaST.


The release notes for SUSE LINUX version 9.1 include a section on this subject:

UTF-8 Encoding Is Default

See http://www.suse.de/~mfabian/suse-cjk/locales.html

Non-UTF-8 File Names

Files from file systems created with SUSE LINUX versions up to 9.0 do not use UTF-8 encoding for the file names (unless otherwise specified). If these files include non-ASCII characters, they are not properly displayed with SUSE LINUX 9.1 or newer versions. To avoid this, use the script convmv to convert the files to UTF-8.