BMOW title
Floppy Emu banner

WordPress Latin1 and UTF-8, Part 2

Yesterday I wrote about some BMOW blog troubles displaying special characters and international characters, which was apparently triggered by a recent update to MySQL 8 at my web host. Old pages containing special characters like curly quotes, accented letters, or non-Latin characters were suddenly rendering as garbled combinations of random-looking symbols, whereas they were previously OK. If you read the follow-up comments, you saw that I was eventually able to resolve the problem (mostly) by adding these lines to my wp-config.php file:

define(‘DB_CHARSET’, ‘latin1’);
define(‘DB_COLLATE’, ”);

But I didn’t fully understand what those lines changed, or exactly why this problem appeared in the first place. After some digging in the MySQL database, I think I have a slightly better understanding now.

 
Back to Kristian Möller

I returned to the example of Kristian Möller, whose name contains the letter o with umlaut. After the MySQL update, the name was appearing incorrectly as Möller. This is what you’d expect to see if the UTF-8 bytes 0xC3 0xB6 for ö were incorrectly interpreted as two separate Latin1 bytes, 0xC3 for à and 0xB6 for ¶.

Using phpmyadmin, I was able to connect to the live WordPress DB, and examine the wp_comments table where this name is stored. The result for comment_id 233746 is shown above, displaying the author’s name as both text and as raw hex bytes. You can see the hex bytes contain the sequence C3B6, which is the correct UTF-8 byte sequence for ö. That’s great news. It means the contents of my database text are correct and uncorrupted UTF-8 bytes. But all is not well – the metadata associated with the table is wrong. It thinks the text is Latin1, and displays it as such in the myphpadmin UI. I was able to confirm this by executing the SQL command:

show create table wp_comments

This echoes back the SQL command that was originally used to create this table, way back in 2007. And lo and behold, part of that original SQL command specified CHARSET=latin1. Ever since then, WordPress has been storing and retrieving UTF-8 text into a Latin1 table in the database. This is bad practice, but it worked fine for 14 years until the MySQL update earlier this month.

 
Why Does DB_CHARSET latin1 Help?

Defining WordPress’ DB_CHARSET variable to be latin1 sounds like it’s telling WordPress what type of character set is used by the database. But if you think it through, that doesn’t fit the evidence here. If I tell WordPress that my DB data is in Latin1 format, even though as we’ve seen it’s really UTF-8, then I would expect WordPress to convert the data bytes from Latin1 to UTF-8 as it loads them during a page render. That would do exactly the wrong thing; it would cause the very problem that I’m trying to prevent.

I searched for a detailed explanation of precisely what the DB_CHARSET setting does, but couldn’t find one that made sense to me. Most references just say to change the value, without fully explaining what it does.

While I don’t have any strong evidence to support this, my guess is that a MySQL client has a choice of connecting to the MySQL database in Latin1 mode or UTF-8 mode, and this is what DB_CHARSET controls for WordPress. If the client connects as a UTF-8 client but the table is marked as being Latin1, my guess is MySQL automatically translates the data. Normally that would be a good thing, but if UTF-8 data were stored in a table improperly marked as being Latin1, it would cause unwanted and unnecessary character conversions, causing the types of problems I saw on the blog.

 
Why Did This Break Now?

So what changed during the recent MySQL update to suddenly break this? Why did the problem appear now? Initially I suspected the underlying data bytes had become corrupted during the update, but the hex display from phpmyadmin showed the data bytes are OK.

I can’t say for certain whether the problem was caused by exporting and reimporting my database, or whether it’s due to new behavior in MySQL 8. Now that I think about it, I’m not even certain whether the result I saw from that show create table wp_comments was actually the original SQL command from 2007, or the SQL command from eight days ago when the database was migrated to MySQL 8.

If these database tables were always explicitly marked Latin1, going all the way back to 2007, then I think this character set conversion problem would always have happened too. Or at least it would have happened as soon as I updated to my current version of WordPress, instead of when I updated to a new version of MySQL.

One possibility is that with the old database and old version of MySQL, the character set for the database tables wasn’t explicitly defined. It relied on some database-wide default which just happened to be UTF-8, so everything worked when WordPress connected to the DB as a UTF-8 client. Then during the MySQL 8 update, somehow the tables were explicitly set to Latin1 and the problem appeared.

Another possibility is that the tables were already explicitly Latin1, but WordPress was previously connecting to the database as a Latin1 client, so it worked OK. Since my version of WordPress hasn’t changed recently, this would mean the default database connection type for WordPress must somehow come from the database itself, or the database server, and that’s what changed during the MySQL 8 update.

Whatever the explanation, changing DB_CHARSET now seems like only a temporary solution. I still have UTF-8 data stored in tables that say they’re Latin1, which seems likely to cause more problems down the road. If nothing else, it makes the output display incorrectly in the phpmyadmin UI. A full solution will probably require some more significant database maintenance, but I hope to postpone that for a while.

Read 3 comments and join the conversation 

3 Comments so far

  1. Emil - October 16th, 2021 12:56 pm

    I think you’ve more or less understood it correctly. MySQL has the concept of a connection charset, and it’ll convert from the table charset to the connection charset when querying. This will of course break if the data in the fields aren’t in the specified encoding. It’s been a few years since I worked with MySQL on a recurring basis, but I typically set up my code to run “SET NAMES utf8” after establishing the connection, which sets the connection charset for utf8.

  2. Carl - October 16th, 2021 7:16 pm

    The relevant concept is that MySQL uses the character set setting when *searching* (eg to return an entry for café when searching for cafe) but for storage, there are no invalid Latin-1 bytes, so UTF-8 input gets stored as is, unchanged. It’s valid Latin-1, so there’s no way for the database to know to change it to something other than what comes in on the wire.

  3. Carl - October 16th, 2021 7:20 pm

    Incidentally, back in 2007, MySQL had terrible Unicode support, so it was common for many years after to use the Latin-1 setting while sending UTF-8 to work around the problems in MySQL. See https://mathiasbynens.be/notes/mysql-utf8mb4 for example.

Leave a reply. For customer support issues, please use the Customer Support link instead of writing comments.