Internationalization and Localisation

The is lecture is about Internationalization and Localisation, which are usually abbreviated I18N and L10N (count the letters) in the profession.

The motivation behind i18m and l10n is that not everyone in the world is an English-speaking American. And while this fact adds interest to life for the traveler and writer, it is a nuisance for programmers. Up until now we have been writing our programs for an american audiences. Some things that we would have to worry about to generalize our programs for an international audience are.

Languages and Locales

To localize an application (instantiate a particular set of formats, characters,and messages) we need to know the language/locale pair we are targetting. Locale specifies place whereas language specifies, well, language. There are ISO standards (ISO 639 and ISO 3166) which represent languages and locals as two character codes for example, American English is en.US, British English is en.UK.

Character Sets and Encodings

Character sets and encodings are related but different concepts. A character set is a list of characters required for a language. An encoding is a mapping of characters into binary values. There are, unfortunately, a variety of character sets and encodings in use, some overlapping. Some of the more important are described below: One of the concerns around character encodings is to determine the format a text file (or XML) document is in so it can be translated into the native encoding of a programming language (UCS-2 Unicode in the case of Java). The format of files can either be guessed from the default format of the system (US Linix and Windows default to Latin-1). Alternatively the file tagged externally (for example in the HTTP header information for a file to be downloaded). Alternatively, the file could be self describing, with the first bytes of the file devoted to the encoding type.

Support in Java

Java has a number of facilities for l10n. As we saw in the stream IO lecture, The Java Reader and Writer classes attempt to deal with the character encoding issue and save the programmer much hassle on IO (at the price of more hassle for procesing ASCII).

Java has a Locale class which contains a number of locale specific utilities. To get a Locale for a given language/country pair, use the constructor.

Locale loc = new Locale("da","DK"); // get the locale for Danish/Denmark
This Locale object can be passed to factory methods on the Java NumberFormat and DataFormat classes to return NumberFormat and DateFormat objects targetted to the given local for numbers, currencies, dates, etc.
// get localized currency formatter
NumberFormat form = NumberFormat.getCurrencyInstance(loc);
The Format classes have parse() and format() methods which will parse and print currency amounts in the local format.

String localization

The support for String localization is a little more primitive. For each String in the UI you must assign a key or index, then produce all of the language variants for that string. Now we need a way to organize all of these strings and automatically load the correct set.

To do this, one must put all the strings for each language into a class that inherits from ResourceBundle. These classes should have methods that return the desired string given the key (ie a lookup method). These classes must be named according to the following scheme MyResourceClass_language_country. Now we have all of the strings (and images and whatever other resources need to be localized) for each language in a class, we just need to load the right one for the locale. Java has a support method to do just this.

ResourceBundle bundle = ResourceBundle.getBundle("MyResourceClass",loc);
where loc is our Locale var. This method will look for classes of the form MyResourceClass_language_country, then MyResourceClass_language, then MyResourceClass. In other words, it tries to load the most specific resource bundle class it can find for that locale. Once the class is loaded, the lookup methods can be used to retrieve Strings based on their keys.