MySQL's UTF-8 Isn't Real
In this blog, we will figure out why you should avoid using UTF-8 in MySQL and what to choose instead.
Join the DZone community and get the full member experience.
Join For FreeIf you are familiar with MySQL, you probably already know that it comes with multiple character encodings. One of the main encodings in the MySQL world and on the web, in general, is UTF-8 - it is used in almost every web and mobile application, and is widely considered to be the "default" option as far as MySQL is concerned. UTF-8 also supports multiple character sets and has a couple of other features unique to itself: in this blog, we are going to go through them, and also we are going to go through one feature in particular - the fact that MySQL's "UTF-8" is not considered to be the "real" UTF-8. Confused? Read on!
What Is UTF-8?
To start with, UTF-8 is one of the most common character encodings. In UTF-8, each character that we have is represented by a range of one to four bytes. As such, we have a couple of character sets:
- utf8 which, in previous times, was considered the "de-facto" standard for MySQL in the past. Essentially, utf8 can also be considered to be the "alias" for utf8mb3.
- utf8mb3 which uses one to three bytes per character.
- utf8mb4 which uses one to four bytes per character.
UTF8 was the default character set in the past when MySQL was starting out and everything was great. However, talk to MySQL database administrators these days, and you will quickly realize that now that is no longer the case. Simply put, utf8, as such, is not the default character set anymore - utf8mb4 is.
utf8 vs. utf8mb4
The core reason for the separation of utf8 and utf8mb4 is that UTF-8 is different from proper UTF-8 encoding. That's the case because UTF-8 doesn't offer full Unicode support, which can lead to data loss or even security issues. UTF-8's failure to fully support Unicode is the real kicker - the UTF-8 encoding needs up to four bytes per character, while the "utf8" encoding offered by MySQL only supports three. See the issue on that front? In other words, if we want to store smilies represented like so:
We cannot do it - it's not that MySQL will store it in a format of "???" or similar, but it won't store it altogether and will respond with an error message like the following:
Incorrect string value: '\x77\xD0' for column 'demo_column' at row 1
With this error message, MySQL is saying "well, I don't recognize the characters that this smiley is made out of. Sorry, nothing I can do here" - at this point, you might be wondering what is being done to overcome such a problem. Is MySQL even aware of its existence? Indeed, it would be a lie to say that MySQL is not aware of this issue - rather, they are, but the MySQL developers never got around to fixing it. Instead, they released a workaround more than a decade ago along with MySQL 5.5.3.
That workaround is called "utf8mb4". utf8mb4 is pretty much the same as its older counterpart - utf8 - it's just that the encoding uses one to four bytes per character which essentially means that it's able to support a wider variety of symbols and characters.
Use MySQL 8.0, work with data a little, and you will quickly notice that indeed, utf8mb4 is the default character set available in MySQL - moreover, it is speculated that in the near future utf8mb4 will become a reference to the utf8 standard in MySQL.
Flavors of utf8mb4
As time goes by and utf8 is being outpaced by utf8mb4 on almost all fronts, it's natural that there are a couple of variations of collations that can be used. Essentially, these collations act as sort of a "set" of sorting rules that are designed to better fit specific data sets. utf8mb4 has a couple as well:
utf8mb4_general_ci
is geared towards a more "general" use of MySQL and utf8. This character set is widely regarded to take "shortcuts" towards data storage which may result in sorting errors in some cases to improve speed.utf8mb4_unicode_ci
is geared towards "advanced" users - that is, it's a set of collations that is based on Unicode and we can rest assured that our data will be dealt with properly if this collation is in use.
In this case, do note the "_ci
" ending towards the collation: that stands for "case insensitive." Case insensitivity is related to sorting and comparison.
These two "flavors" of utf8mb4 are used more and more - as newer versions of MySQL are also being released, we can also see that the utf8mb4_unicode_ci
collation is the collation of choice for most people working with MySQL today. One fact is for certain - not all people using MySQL in this way know the functionalities and the upsides presented by utf8mb4 as opposed to its counterpart utf8, but they will certainly see a difference when they import data having unusual characters! Did we convince you to head over to the realm of utf8mb4 yet?
Properly Working with utf8mb4-based Data
Here's how some developers go about creating databases and tables based on utf8mb4:
Except that this query gives us an error (below the query) which is a frequent point of confusion to beginners and advanced developers alike - MySQL is essentially saying that when we use a collation based on utf8mb4, we should also use a compatible character set, and in this case, latin1 isn't valid, so whatever you do, keep these points in mind:
- utf8mb4 is not the real utf8 in MySQL and its flavors (MariaDB and Percona Server): utf8 only supports 3 bytes of data, utf8mb4 supports 4 which is what utf8 should do in the first place. If utf8 is in use, some characters may not be displayed properly.
- When we elect to use utf8mb4 instead of utf8 in MySQL, we should also make sure that we use an appropriate character set (utf8mb4.) Note the success message underneath the query:
Now we are good to go - we can store all kinds of characters inside of our database and not have MySQL error out with an "Incorrect string value" error! Woohoo!
Summary
UTF-8 in MySQL is broken - it is not able to support four bytes per character as UTF-8 is supposed to. "utf8mb4" can be used to solve this problem and it's generally pretty easy to work with - simply choose a specific collation (in this case, choose either general if you're using MySQL for a personal project or a small website, or a Unicode collation or if you're using it for something more technical or if you want to push MySQL to its limits.)
Published at DZone with permission of Lukas Vileikis. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments