I recently stumbled upon encoding issue in WordPress where user’s site post_content
column of wp_posts
table still uses utf8
charset instead of utf8mb4
which makes saved content comparison against $_POST
‘s post content check fails when there is an emoji used inside the post_content
given.
Thus, here’s how to check what charset is used in given column in WordPress:
global $wpdb;
// Let's assume you want to check `post_content` column of `wp_posts`
// $charset value here is either `utf8` or `utf8mb4`
$charset = $wpdb->get_col_charset( $wpdb->posts, 'post_content' );
As in why charset matters, here’s a quote of utf8
charset compared to utf8mb4
charsets:
The difference between
Source: https://make.wordpress.org/core/2015/04/02/the-utf8mb4-upgrade/utf8
andutf8mb4
is that the former can only store 3 byte characters, while the latter can store 4 byte characters. In Unicode terms,utf8
can only store characters in the Basic Multilingual Plane, whileutf8mb4
can store any Unicode character. This greatly expands the language usability of WordPress, especially in countries that use Han character sets. Unicode isn’t without its problems, but it’s the best option available.
There’s also larger story behind why this charset update matters: The Trojan Emoji.
Photo by Fikret tozak on Unsplash