Thursday 7 April 2011

Unicode revisited

Unicode, UTF8, UTF16, UTF32, UCS2, locales and code pages are areas that I looked at 15 years ago.  It has been worth a revisit.
As a reminder old computers used 7 bit ASCII character sets, which was sufficient to store the roman alphabet (both upper and lower case), numbers, some control characters and a bit of puncuation.  This left every other language out in the cold.  Most countries then used the 8th bit in the byte to add their locale specific characters, e.g. the £ symbol in the UK.  However these characters meant different things to different countries, these differences then had to be differenciated by the operating system (as "Code Pages") to display them properly.  To avoid this problem Unicode was invented, it contained all the other characters and diacratics that people wanted in a format that was undersood in all countries.
Unicode came in several formats UCS2, UTF8, UTF16, UTF32.
Back then in the mid 90's UCS2 was the standard to go by, being made of a 16bit character set it was considered sufficient to carry all the characters that the world needed, and that back in the early 90's is what 32 bit Windows NT was designed to use, the legacy of which is present in the latest Windows 7.  UCS2 is easy and fast, and the use of memory is directly proportional to the number of characters making coding for it easy. However it is now becoming clear that that is now not sufficient.  UCS2 can only hold upto 65000 different characters, however more are being required all the time and it is not enough, more memory or an overflow is required, this is where the UTF* implementation is now showing its strength.
The UTF standard uses an overflow principal, with UTF8 a character is made of one or more than one bytes.  The last bit in a byte is a flag indicating that the character needs at least another byte to complete that character.  The same goes for the second byte when a third byte is needed, and so on.  UCS16 is the same, but the flag is only used on the last 16th bit, rather than every eigth as in UTF8.  I am sure that UFT32 also has an overflow bit, but currently that is not needed.
My nose wrinkled when I first looked at is as I did not want this extra complexity that UTF8 provides, I considered UCS2 is the way to go for my future development.  UCS2 is easy you know exactly how much memory you need for a given character length, so you do not need to pre calculate the length of a buffer before you use it.  However unless you are converting between 16 bit and multibyte strings or working with fixed length data formats, e.g. flat files or ODBC, this worry is rarely a consideration.  Secondly UTF8 in practice is more efficient with space.  Even if you are Chinese the spaces, numbers and the occational western word that are common in conteporary text use only one byte and in average use less space than the equivelant 16 bit unicode formats.
So what is important about moving from UCS2 to UTF8?  Microsoft does not seem to care, internally it uses UCS2 in its operating systems (an architecture descision made in the 90's), so it must not care about this difference?  The problem for Microsoft is all its systems cannot be easily changed from UCS2 to UTF16 or some other encoding standard as the change will be traumatic, and costly, so they are choosing to not to make a big deal of it.  The issue is though the web extensively uses UTF8, and with the arrival of new operating systems e.g. Android that directly use it, Microsoft will increasingly look out of step.  This is especially the case in China where only a subset of the chinese character set is available in UCS2.
If you are writing portable internationalised applications then UCS2 is probably not the way to go forward.
Links:
http://site.icu-project.org/
http://utfcpp.sourceforge.net/

No comments:

Post a Comment