Difference between revisions of "Character Encodings"

From Dreamwidth Notes
Jump to: navigation, search
m (Step 2: Convert any 'unknown8bit' entries to known UTF-8: Fixing a typo.)
m (Step 2: Convert any 'unknown8bit' entries to known UTF-8: Fix another typo.)
Line 49: Line 49:
  
 
* We could just assume a character set like Windows-1252 or UTF-8. This has the disadvantage that entries which are actually in a different encoding will not be converted correctly.
 
* We could just assume a character set like Windows-1252 or UTF-8. This has the disadvantage that entries which are actually in a different encoding will not be converted correctly.
* We could do a limited form of character set detection; for example, we could test to see if the post is parseable as UTF-8, and if not, assume Windows-1252. This is probably the best option, as some clients are known to be able to send UTF-8 even though they also send ver=0 oe no version (for example, [http://umlautllama.com/projects/perl/#jlj jlj]), and this will ensure that these posts are still readable. (It is theoretically possible that a valid non-UTF8 post *might* be wrongly interpreted as UTF-8, but this is highly unlikely due to the way UTF-8 sequences are constructed.)
+
* We could do a limited form of character set detection; for example, we could test to see if the post is parseable as UTF-8, and if not, assume Windows-1252. This is probably the best option, as some clients are known to be able to send UTF-8 even though they also send ver=0 or no version (for example, [http://umlautllama.com/projects/perl/#jlj jlj]), and this will ensure that these posts are still readable. (It is theoretically possible that a valid non-UTF8 post *might* be wrongly interpreted as UTF-8, but this is highly unlikely due to the way UTF-8 sequences are constructed.)
  
 
Note that the Windows-1252 character set was chosen as an alternative encoding above as it is a superset of ISO-8859-1, and as such any valid ISO-8859-1 entry is also a valid Windows-1252 entry with the same meanings. In addition, Windows-1252 adds characters such as the Euro sign, smart quotes, and various other characters, and is the default encoding used in earlier versions of Windows. (The default encoding in later ones is UTF-8.)
 
Note that the Windows-1252 character set was chosen as an alternative encoding above as it is a superset of ISO-8859-1, and as such any valid ISO-8859-1 entry is also a valid Windows-1252 entry with the same meanings. In addition, Windows-1252 adds characters such as the Euro sign, smart quotes, and various other characters, and is the default encoding used in earlier versions of Windows. (The default encoding in later ones is UTF-8.)

Revision as of 17:22, 26 May 2010

As Dreamwidth was forked from LiveJournal, which initially did not have UTF-8 support and even now has to be able to still support entries that are not necessarily UTF-8, the situation surrounding the character encodings used by the code is less than optimal.

The Problem

Ideally, Dreamwidth should only work with entries in UTF-8. However, as of this writing, it is still possible to use an old client to create entries in the DW database that have the 'unknown8bit' entryprop set and that are not UTF-8. This is possible if:

  • The client uses the flat protocol, AND
  • The client sends its version as 0.

If both of these conditions are true, Dreamwidth currently treats the entry as a stream of bytes which are inserted verbatim into the database, and the 'unknown8bit' entryprop is set on the entry, signalling that the encoding of the entry is unknown. Normally, entries set as 'unknown8bit' would be displayed according to the "old character encoding" setting, but this was removed from Dreamwidth, and as such characters above 0x7F in these entries are displayed as "?", and the entry is uneditable.

(It may also be possible to insert non-UTF8 characters via the XML-RPC interface, but I haven't tested this; all XML parsers and constructors are required to understand UTF-8 in any case. The 'unknown8bit' entryprop will probably still be set if ver=0, though, so the symptoms above will still apply.)

The Proposed Solution

As such, a plan is needed to convert DW to using UTF-8 entirely. We have an advantage over LiveJournal in this regard, as most clients nowadays send entries in UTF-8 and with ver=1 in any case. In addition, our database is smaller, making it easier to do the necessary conversions. (We need to convert the database because we cannot remove all of the legacy non-UTF8-handling code before this is done.)

There are two projects that I ([info]sophie) propose to handle this.

  1. The complete conversion of the entries tables (log2 and logtext2) to a known UTF-8 format, and the removal of legacy non-UTF8-handling code. This is already partially underway (see bug 443), but there are additional concerns, too. This project must be completed before the second project can be considered.
  2. The conversion of the codebase to utilise the innate knowledge in Perl v5.8 and later of character encodings (including UTF8-aware scalars), and a change to the schema of the MySQL tables to allow MySQL to be aware that UTF-8 is being used. This is a *big* project, but will greatly simplify the encoding aspects of the codebase and will make things much easier.

Both projects are detailed below.

Project 1: The complete conversion to UTF-8

This project will completely convert the Dreamwidth database to UTF-8, and remove the legacy code to handle non-UTF8 entries, including the removal of the 'unknown8bit' entryprop.

The proposed plan for this project is as follows:

Step 1: Disallow entries from clients that send ver=0

Currently, it is still possible for clients to send entries using version 0 of the protocol, which does not enforce UTF-8. (Note that clients which do not send a version number are also assumed to be ver=0.) As a result, this is the first thing that needs to be removed before we can proceed, as the removal of this will ensure that no further entries can be created with the 'unknown8bit' entryprop set.

This step will need approval from Mark and Denise, as it will stop some old clients from being able to post, and as such is a removal of functionality. However, as most clients these days are UTF8-aware, the impact is probably going to be minimal.

Step 2: Convert any 'unknown8bit' entries to known UTF-8

This step should be implemented in the database update scripts, as it also needs to take into account sites that are upgrading their version of DW, or even switching from LJ to DW. It should, for each cluster:

  1. Find the entries on the cluster which have the 'unknown8bit' entryprop set, and group them by user.
  2. Then, for each user identified by the above step:
    1. Find the value of the 'old character encoding' setting of that user. (For details on what to do if this value is missing, see below.)
    2. Then, for each entry by that user identified in the first step as having the 'unknown8bit' entryprop set:
      1. Convert the entry from the character set identified in the above step to UTF-8. (Any errors in conversion in this stage should result in the permanent replacement of the offending character(s) with question marks - or, alternatively (and perhaps preferably), the UTF-8 replacement character, U+FFFD - see below for details.)
      2. Unset the 'unknown8bit' entryprop.

If the 'old character encoding' setting is missing - for example, if this has not been set (quite likely on DW-based sites, as it is no longer possible to set it), there are a few options that could be used, which we should decide on:

  • We could just assume a character set like Windows-1252 or UTF-8. This has the disadvantage that entries which are actually in a different encoding will not be converted correctly.
  • We could do a limited form of character set detection; for example, we could test to see if the post is parseable as UTF-8, and if not, assume Windows-1252. This is probably the best option, as some clients are known to be able to send UTF-8 even though they also send ver=0 or no version (for example, jlj), and this will ensure that these posts are still readable. (It is theoretically possible that a valid non-UTF8 post *might* be wrongly interpreted as UTF-8, but this is highly unlikely due to the way UTF-8 sequences are constructed.)

Note that the Windows-1252 character set was chosen as an alternative encoding above as it is a superset of ISO-8859-1, and as such any valid ISO-8859-1 entry is also a valid Windows-1252 entry with the same meanings. In addition, Windows-1252 adds characters such as the Euro sign, smart quotes, and various other characters, and is the default encoding used in earlier versions of Windows. (The default encoding in later ones is UTF-8.)

In the case of an encoding error, we should not attempt to keep the original bytes as this will result in invalid UTF-8, which the user would not be able to edit. Instead, we should replace the offending characters with either question marks or U+FFFD replacement characters. (Preferably the latter, as this is their purpose, and it makes it easy for users to identify and correct encoding errors; more information can be found at Wikipedia.)

Step 3: Remove the 'unknown8bit' entryprop and the 'old character encoding' setting

At this point, the database should have no remaining entries that have the 'unknown8bit' entryprop set, and it would be no longer possible to create them. As such, it is now possible to remove this entryprop along with the 'old character encoding' setting completely.

This step would have two parts:

  • The removal of the settings themselves from the database. Like step 2, this would be performed by the database upgrade script and would apply to old DW versions upgrading to the newest version, and to sites that convert from LJ to DW.
  • The removal of the code behind these settings, which would now be cruft.

There should be no impact from this step at all, since by now there would be no entries that it *could* impact.

Conclusion

This should complete the conversion to UTF-8, and paves the way for the second project, should it be desired.

Project 2: Conversion of codebase to UTF8-aware scalars

More info coming soon...