Internationalization and Localization: The Challenges
This article summarizes the challenges that the development team will encounter when implementing software with localization and internationalization support
Join the DZone community and get the full member experience.
Join For FreeIn a globalized world, our software companies serve the needs of customers that happen to have their business in multiple geographical regions. Hence we developers need to ensure that the software we build is usable in different languages and cultural contexts. In other words, our software must be designed with internationalization (i18n) in mind. This includes the employment of Unicode character sets, flexible layouts (allowing bi-directional texts), or externalizing strings that vary across languages. Localization (l10n) is a process of using i18n to adapt the software to the locale, language, and cultural requirements of a given market. Very good definitions of internationalization and localization can be found in this W3C article.
Moreover, UX, Engineering, Product Management, and other relevant stakeholders must ensure that the level/depth of i18n and l10n is consistent across the product suite. For example, it can be very user-unfriendly if one part of the UI is in Spanish while another is in English. It can be even misleading if one product uses commas as a thousand separator while the other as a decimal separator. Not to mention the American date format mm/dd/yyyy vs. the European dd/mm/yyyy. This is especially problematic in embedded use cases, where the user has no way of telling where one UI ends and the other begins.
This article outlines multiple localization challenges, discusses their importance, and describes techniques that can be used when addressing them.
Scope
The scope of the document is limited to user interfaces, reports, exports (pdf or similar), and other user-facing features of our suite. Machine-readable endpoints generally should not be localized, and the focus should be on the usage of standardized formats and dictionaries.
Libraries that Can Be Used
Many aspects described in the following sections can be handled with the support of open-source libraries. I highly suggest that you always base your internationalization effort on battle-proven libraries and frameworks. The bulk of work in web and enterprise development is nowadays performed by the front end. But as you will see, there are also multiple challenges to be addressed on the backend.
The Basics
Unicode
First of all, the application must use the Unicode character set. Unicode contains characters used in modern and historical languages and the characters used in science — in total more than 140 000 characters. Using Unicode is the basic stepping stone to having applications translated into many languages, even from exotic languages. For example:
Nechť již hříšné saxofony ďáblů rozezvučí síň úděsnými tóny waltzu, tanga a quickstepu.
The quick brown fox jumps over the lazy dog.
我简直不能相信
დილა მშვიდობისა
Externalization of Strings
In order to translate the product into another language, all the strings used must be externalized. Externalization means that the language-specific strings should not be hardcoded in the application, and the application should internally only use dictionary keys. The dictionary for a given language is then provided as an additional file (or set of files).
en_GB.properties
greetingPageHeading=Welcome Page
greeting=Hello
cs_CZ.properties
greetingPageHeading=Uvítací stránka
greeting=Ahoj
page.template
<h1>${localization.greetingPageHeading}</h1>
<p>${localization.greeting}....</p>
Common Pitfalls
- Declension
- Conjugation
- Singular/Plural
- Word order differs across languages.
As a result, we need to ensure that the library we use supports different forms of the words (and word ordering) depending on additional context.
Locale Support
Locale is a set of parameters that define language and regional preferences. Locale is defined using the ISO/IEC 15897 identifiers. The locale identifiers use the following format: [language[_territory][.codeset][@modifier]]. Locales are a well-defined and widely used building block of internationalization, and many libraries are built on them.
General Formatting
Numbers
Numbers — both integers and decimals — are formatted differently across locales. For example, the number “3500.1” is formatted as “3,500.1” in the en_US locale, while in the German (de_DE), it is “3.500,1”. And for example, the Czech locale (cs_CZ) uses a non-breaking space as a grouping separator.
Date Format
The American short date format is mm-dd-yyyy, which respects the order in expanded date form (June 27, 1998). In most countries, however, the dd-mm-yyyy order is used (from the smallest time units to the largest). Also, in the short format, the delimiter of time units differs across locales; for example, in the Czech locale, the same date is written as “27. 6. 1998”, whereas British English uses slash as a delimiter “27/06/1998”. For obvious reasons (different names of months in different languages), the expanded form of the data differs widely across locales.
In APIs and machine-readable formats, one often encounters the standardized ISO 8601 format yyyy-mm-dd.
Currency Formatting
If your products handle monetary data, you need to have a single and unambiguous visual representation of monetary values. The most straightforward way would be to use symbols, such as “$.” The issue with symbols is that these may be ambiguous; for example, the dollar symbol is used for the US dollar, Canadian dollar, Australian dollar, Chilean peso, Colombian peso, Mexican peso, and Brazilian real (though sometimes used as R$). This may be especially misleading in multi-currency situations, in which some transactions were executed in US dollars while others in Canadian ones. The Intl (ECMAScript 2022) formats USD as “$123” in the en_US locale, while “123 US$” in, for example, the Czech locale.
As alluded to in the previous paragraph, the symbol location may also differ — for example, the US dollar symbol in en_US is placed before the value without space as a separator — “$123”, the Canadian dollar symbol is placed before the value “$123” in case the English notation is used, while after the value with space as a separator in the French notation “123 $”. The Czech koruna symbol is placed after the value and space as a separator “123 Kč”.
Another option in currency formatting is to use the ISO 2417 codes, which leads to “123 USD” or “123 CZK”. This approach is naturally easier to implement; it may, on the other hand, be perceived as unnecessarily verbose by users from a single-currency context.
Collation and Sorting
Note that alphabetical order is not the same across languages (even in those using Latin characters). For example, in the Czech language, “ch” is a digraph, and words starting with “ch” come after those starting with “h”. Another implication is that if only a partially sorted resultset is returned by the backend, the locale information must be processed by the backend as part of the data retrieval. This is one of the places where backend support is required.
Time Zones
Closely related to localization are time zones. Unless there is an implicit expectation of a timezone, always be explicit about it. Be aware that the company may have users across multiple time zones (and corporations usually have). Also, a single user may travel across time zones (business trips, daylight savings). Finally, the server may be located in a different zone from the users.
If you need a standard format in your APIs to accommodate the timezone-aware date/time format, I suggest looking at ISO 8601.
Precision and Scale of Numbers
This is not an integral part of any generic localization package. Still, suppose your software operates with many different currencies and business verticals. In that case, the precision and scale of the numbers of your products may differ widely (selling gold by the metric ton in Indonesian rupiahs vs. selling needles by piece in Swiss francs). It is thus vital that products support configurable precision and scale of numbers (at least on the UI level). The products may also support the shortening of numbers (kilo, mega, giga).
International System of Units and Other Unit Systems
The International System of Units (SI; Système international (d’unités)) is a modern implementation of the metric system. SI is used in nearly every country of the world except for the United States, Liberia, and Myanmar. The imperial system is partially used in countries of the British Commonwealth (but is gradually replaced by the SI system).
Since both the business verticals of our customers and the regions in which they operate vary, we need to be able to support any unit (even a completely custom one). For example, pharma companies sell their products in boxes of various sizes, and wholesale businesses may use pallets or containers as a unit. Another company may sell milk in both the EU and the USA (liters vs. gallons).
Advanced Features
While the previous section focused on more or less basic software features in the globalized world, the complexity does not end there. As some of these features may be expensive to implement (or take a lot of time to research), they should always be carefully evaluated if they pay off in your context.
Multi-Lingual Master Data
At our company, Vendavo, multiple products support multi-lingual master data. This means that customers may send us data in different languages — such as “pipe”@en and “potrubí”@cs. Depending on the user locale choice, the UI then renders all the data with labels valid in the current language and supports the collation of the data based on this choice.
This is one of the features that tend to be expensive and complex to implement. In addition, it significantly complicates the database layer and creates challenges in the UI when multiple products are combined to support a richer functionality. This is especially true when a widget originating in one product is embedded by some other.
Localized Icons and Pictograms
Icons and pictograms are used to simplify user orientation within a product. However, the biggest pitfall is that not all pictograms are universally recognized and may have some unintentional meaning in different cultures and languages. For example, in many cultures, the rhombus symbol is associated with the “diamond” meaning, but in Czech/Slovak/Hungarian languages, it is used in …a rather different context. So be careful when designing symbols in your UI, and in case you know of a prominent market your product will be localized for, check if your icons and pictograms will be understood correctly.
Right-to-Left
When it comes to the translation of a product into other languages, it is important to realize that not all languages are written left to right. Instead, languages such as Arabic, Azeri, Hebrew, Persian, or Urdu are written right-to-left. For developers, this does not only mean reversing all labels, but it also means mirroring the whole layout and certain graphics elements (such as timelines or navigation arrows).
Calendars
While the majority of countries (168) use the Gregorian calendar, there are five countries that use a different civil calendar — Afghanistan, Ethiopia, Eritrea, Iran, and Nepal. Four other countries use variants of the Gregorian calendar — Japan, North Korea, Thailand, and Taiwan.
There are also multiple groups (religious, regional, ethnical…) that use non-Gregorian calendars. For example, the Julian, Hebrew, Hindu, and Chinese calendars.
Documentation in Multiple Languages
Last but not least, the software itself is not the only deliverable of a software company. There are also other artifacts and one prominent example is documentation. Depending on your users’ language skills, you may need to translate your documentation into other languages, which may create some significant costs, especially if the translated documentation is expected to follow your continuous delivery of the software closely.
The need for translation of your documentation may reappear in case you decide to use fragments of the documentation as inline help within your application.
Summary
In this article, we have discussed multiple internationalization and localization challenges. Some do not require big investments at all, as they can be addressed within the day-to-day development, provided one knows about the pitfalls (being explicit about the timezone, using Unicode, externalization of strings). Some issues can be relatively easily fixed by using dedicated libraries for your programming language (number, currency formatting, plurals, declensions…). Yet others will require additional implementation work (right-to-left, multi-lingual master data, documentation). Having at least the basics in place will save you a lot of time when the actual need occurs (a client in a different region, a client with employees from a different region), as backporting basic support to an existing app is a laborious and time-consuming process. The more advanced or less frequent cases should always be considered, as the cost of implementation may outweigh the actual benefits for the business.
Published at DZone with permission of Pavel Micka. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments