MySQL and character encoding

Character encoding when using Bitweaver with MySQL

Created by: Combat Wombat, Last modification: 08 Dec 2008 (22:08 UTC) by laetzer

Character encoding issues

If you are using MySQL, you might run into character encoding problems on your site or in your database: non-English characters might appear as question marks or look strange (ö, ß, ü, ä, or Ãœ, instead of characters like ä, Ü, ß, ç, or é).

Database connection setup

The reason for crunched special characters is most commonly seen as: encoding of the data is not the same all the way between website and database. The shortest and best solution, then, is to have an UTF-8 system end-to-end.

However, it is perfectly normal to have a database storing its data in Latin1, and displaying that data with correct UTF-8 characters on a webpage in a browser. If a database stores a character Latin1 encoded, but an application requests that character expecting UTF-8 encoding, the database sends that character UTF-8 encoded. The actual problems are bugs and wrong settings between database applications, database abstraction applications and web applications, or the lack of access to configuration files to actually apply the correct settings.

Unless told otherwise, Bitweaver is using your server's defaults. If your server is generally set up for UTF-8, especially your database and database tables, but the database connection is not, try to edit the script that AdoDB (a 3rd party application for database abstraction) uses to connect to MySQL databases:

yourbiteaver/util/adodb/drivers/adodb-mysql.inc.php


<?php
function _connect($argHostname$argUsername$argPassword$argDatabasename)
    {

    
// ... some stuff ...

        
if ($this->_connectionID === false) return false;

    
// edit adodb-mysql.inc.php around line 366 or 373
    // (depends on the version of that file)
    // function "_connect"
    //----------------------------------------------
    // THE FOLLOWING LINE IS THE ONE TO ADD:
        
if (mysql_query("SET NAMES 'UTF8'") === false) return false;
    
//-----------------------------------------------
        
if ($argDatabasename) return $this->SelectDB($argDatabasename);
        return 
true;
    }
?>


Database admin tools

MySQL usually determines character encoding via the environment (the web server's software). Only if "SET NAMES utf8" is set, the PHP application uses UTF-8 encoding to connect to the database. However, if UTF-8 is not set, but used regardlessly (and if there are no bugs interfering, depending on versions and setup), there still shouldn't be problems, as long as PhpMyAdmin (or similiar) is not used … because all valid UTF-8 byte sequences are also valid Latin1 characters.

When you install Bitweaver with a standard MySQL server, the database connections will all be Latin1. Bitweaver writes UTF-8 characters to that connection. MySQL will store them to the tables correctly. Now, when requesting them, they are reported to be Latin1 characters, but your browser, displaying the pages served by Bitweaver, interprets them as UTF-8 just fine. The problem is, PhpMyAdmin sets the connection to UTF-8, which makes MySQL try to convert the UTF-8 characters, thinking they are Latin1, to UTF-8, which results in weird characters. That means, you can never use any admin tool in UTF-8 mode to write to your Bitweaver database tables, but Bitweaver works fine if you just ignore it. It's just not the proper way to go about, setting the database connection to Latin1, but writing UTF-8 characters, and confusing when using PhpMyAdmin (or similiar).

That means, if you input data through a Bitweaver form like wiki page, and special characters seem to be written to, as well as read from the database correctly, everything is fine. Eventhough, when looking at the database with a tool like PhpMyAdmin, special characters seem wrongly decoded. In that case, PhpMyAdmin (or similiar tools) should not be used to transfer databases between systems, or input data directly through the input forms of these tools. If that is done anyway, the special characters will get messed up in the process.

Database and server setup

Databases can be set to UTF-8 upon creation, as well as database tables. They can also be changed afterwards, even with tools like PhpMyAdmin. If the tables contain already user data, though, there will be yet again correct ASCII characters but wrong non-ASCII characters.

If your whole server OS, server software, or database software is not set up for UTF-8 yet, it's a good idea to do that. Consider some of the following steps. Bear in mind, that this is web server configuration info. There are many extensive tutorials about settings like these online. This can only be a starting point.

my.cnf (MySQL config file)


[mysqld]
# --------------------------------------------
# add the lines below to enforce utf8 encoding
# then restart the MySQL engine
# --------------------------------------------
collation_server=utf8_general_ci
character_set_client=utf8
character_set_server=utf8
skip-character-set-client-handshake


.htaccess (bitweaver root)


AddDefaultCharset UTF-8


php.ini


default_charset = "UTF-8"


Also: Your database may not be UTF-8 in it's collation settings. Consider dumping the SQL to a file, and use an UTF-8-compliant text editor to chang the CHARSET=Latin1 to CHARSET=utf8). Create a new database, with Collation set to utf8_general_ci, and SET NAMES 'UTF8'. Using the text editor, find/replace every bad character with the correct character. This can take hours. There are also scripts available to do that for you. Search for convert + charset etc.

Then upload the cleaned database to the new, empty database. Adjust Bitweaver's /kernel/config_inc.php file to reflect any change in database name.

To see how your database is set up, run SHOW VARIABLES; at the MySQL command prompt on your server (or find the link in PhpMyAdmin).

More information

work for you