éncöding hèll (part 3)

The (currently) last part of my encoding hell series. To finish up I’ll show some samples.

force_encoding and encode

Ruby is smart enough to not encode a string if it is already in the target encoding. This might not be what you want if you have data which has been encoded wrongly in the first place. You can use force_encoding in such cases:

data = "\xF6\xE4\xFC"
p data.encoding
# => "UTF-8"

p data.encode('utf-8')
# => "\xF6\xE4\xFC"

p data.force_encoding('iso-8859-1').encode('utf-8')
# => "öüä"

transcoding

Data read from a file is expected to be in UTF-8 by default. You can change that using the encoding option. This will lead to Strings that are encoded in something non-UTF-8 though. Ruby offers an easy way to transcode, so you only will have to deal with UTF-8 Strings


data = File.read('file.txt')
puts data.encoding
# => "UTF-8"

data = File.read('file.txt', encoding: 'iso-8859-1')
puts data.encoding
# => "ISO-8859-1"

data = File.read('file.txt', encoding: 'iso-8859-1:utf-8')
puts data.encoding
# => "UTF-8"

String concatenation

This works as long as both Strings are in the same or in a compatible encoding. This can happen in places where you don’t expect it. For example when writing a CSV file or just print out some log information.


utf = "öäü"
iso_1 = utf.encode('iso-8859-1')
iso_2 = "oau".encode('iso-8859-1')
ascii = "oau".encode('ascii')

puts utf + utf
# => öäüöäü

puts utf + iso_1
# => CompatibilityError: incompatible character encodings: UTF-8 and ISO-8859-1

puts utf + iso_2
# => öäüoau

puts utf + ascii
# => öäüoau

puts and p

Ruby calls #inspect when passing an object to p. This leads to some interesting behaviour when printing out Strings of different encodings.

p "öäü".encode('iso-8859-1')
# => "\xF6\xE4\xFC"

puts "öäü".encode('iso-8859-1')
# => ���
# Note: if you run this from Sublime, then you might see the following message:
# [Decode error - output not utf-8]

puts "öäü".encode('iso-8859-1').inspect
# => "\xF6\xE4\xFC"

éncoding hèll (part 2)

How does Ruby deal with encoding? Here are some important parts.

There are multiple encoding settings and multiple ways they are initialized. And of course this differs depending on the Ruby version used. Most of this has been found out by trial and error as I could not find a concise documentation of all those values. Corrections are welcome.

__ENCODING__

This is the encoding used for created strings. The locale is determined by the encoding of the source file (defaulted to US-ASCII in 1.9, now defaults to UTF-8) which can be changed with an encoding comment on the first line (e.g. #encoding: CP1252)

Encoding.default_internal

The default internal encoding. Strings read from files, CSV, ARGV and some more are transcoded to this encoding if it is not nil. According to the docs the value can be changed using the -E option. This did not work for me though, neither with 1.9.3 nor with 2.2.0.

Encoding.default_external

Data written to disk will be transcoded to this encoding by default. It is initialized by

  1. -E Option
  2. LANG environment setting
And it seems to default to ASCII-8BIT

String#encoding

Returns the current encoding of the String. This is usually the __ENCODING__ for created strings.

String#force_encoding

Changes the encoding. Does not re-encode the string but changes what #encoding will return.

String#encode

Re-encode the string in the new encoding. Does not change the string if it is already in the target encoding. You can pass transcode options.

I18n.transliterate

Replaces non US-ASCII characters with an US-ASCII approximation.

In the next part I’ll show some example code. It should be online soon.

éncöding hèll (part 1)

If you ended up reading this, then you know what i am talking about. It’s this garbled up text. That umlaut which got lost. Those diacritical marks that don’t show up. And then first you blame the accent-grave-french, the umlaut-germans, the diacritic-czechs and so on (not even mentioning chinese/japanese/…languages) .

Why can’t we all live with ASCII. Surely 8 bit ought to be enough for everyone;-)

So this is the first post on my encoding hell. I plan to follow up with more posts on this topic.

Bildschirmfoto 2014-12-27 um 13.21.28 The situation

At Simplificator we recently worked on a ETL application. This application loads data from various sources (databases, files, services, e-mails), processes it (merge, filter, extract, enrich) and stores in various destinations (databases and files). This application is a central tool for data exchange between multiple companies. Data exchanged ranges from list of employees to warranty coverage of refrigerators. Not all sources are under our control and neither are the targets.

And this is where the problems started. Some sources are delivering UTF-8, some are using CP-1252, some are in ISO-8859-1 (a.k.a Latin-1). Some destinations are expecting ISO-8859-1 and some are expecting UTF-8.

ISO 8859-1 (ISO/IEC 8859-1) actually only specifies the printable characters, ISO-8859-1 defined by IANA (notice the dash) adds some control codes.

The problem

While in most programming languages it is easy to change the encoding of a string this sometimes includes more troubles than visible at first sight. Those encodings can contain from 256 to more than 1’000’000 code points. In other words: UTF-8 is a superset of CP-1252 and ISO 8859-1. Going from those 8 bit encodings to a (variable length) 4 byte encoding is always possible while for the other way it depends on the content. If an UTF-8 String contains characters which can not be encoded in 8 bit then you have a problem. Say hi to your new best friend the Encoding::UndefinedConversionError (or whatever your programming language of choice throws at you in such cases).

Bildschirmfoto 2014-12-27 um 13.17.43 The solutions

There are two solutions for this. Both are relatively easy. And both might cause trouble by the consumers of your output. As mentioned the problem only shows when the content (or parts of it) are not covered by the destination encoding. CP-1252, ISO 8859-1 and UTF-8 share characters between 0020 and 007E (ASCII, without some control codes). As long as your content is within that range there is actually nothing to change when changing the encoding. But if your content lies outside this range then you either have to:

  1.  Use UTF-8 for your output everywhere: Going from UTF-8/CP1252/ISO-8859-1 to UTF-8 is easy. As long as you stay in the current encoding or move from a “small” encoding to a “big” encoding you are safe. If possible, then this is the desirable solution.
  2. Use transliteration: This means mapping from one encoding to another. This can be achieved by replacing unknown characters with something similar or a special mark. So “Petr Čech” could become “Petr Cech” or “Petr ?ech”. Depending on your use case one or another might be more appropriate.

The new problems

I told you… both solutions might cause troubles.

If the consumer of your output can not deal with UTF-8, then this is not an option. It would just move the problem out of your sight (which might be good enough ;-))

If you have changed the name of “Petr Čech” to “Petr Cech” and later on this data is imported again into another system, then it might or might not match up. I.e. If the other system is looking for user “Cech” but only knows about a user “Čech”.

Also then transliteration (in our case) is irreversible. There is no way of going back to the original form.

Summary

If possible then stay within one encoding from input to processing to output. In my experience you’ll have fewer problems if you chose UTF-8 as it can cover a wide range of foreign languages, unlike ASCII, CP1252, ISO-8859-1.

According to this Graph (source) UTF-8 is used more and more on the web. Hopefully one day we don’t have to think about ISO-8859-1 anymore.

Growth of UTF-8