Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

  The Winapi (aka Win32) is the only commonly used API that 
  regularly requires something other than UTF-8
Sadly, that's not really correct. Core Foundation and Cocoa on Mac OS X use UTF-16. Qt uses UTF-16. Java uses UTF-16. JavaScript uses UTF-16. Many of these APIs have easy methods for converting to and from other encodings, and offer a certain amount of abstraction over the underlying encoding, but it still shows through in that string length and character indexing work via UTF-16 code units, not Unicode code points.

  the Mac class NSString or the C# String class use UTF-16 
  internally, you don't normally need to care what they do 
  internally since any time you access the internal 
  characters, you specify the desired encoding.
While it's true that they do abstract over the encoding somewhat, and they offer conversions, the abstraction is fairly leaky, as string length and indexing all happen in terms of UTF-16 code units.

I'm a huge fan of UTF-8, and I agree that new APIs should generally favor it, but there is a lot of legacy code that has a lot of UTF-16 assumptions baked in that you can't really say that only the Winapi uses UTF-16.



There's a reason a ton of legacy code is UCS-2^/UTF-16 and not UTF-8.

The code is older than UTF-8.

Windows NT was first released in '93, UTF-8 didn't exist (much less have wide adoption) for most of it's development. Likewise, NeXTSTEP had an initial release in '89. Java and JavaScript (both '95) could have adopted UTF-8, but they're almost certainly running on an OS that expects UTF-16 soooo... yeah.

^UCS-2 was superceeded by UTF-16, so you'll find them used interchangeably a lot.


It doesn't matter when the projects started, it matters when they added Unicode support. Windows NT didn't support Unicode until 4.0, released in 1996. NeXTSTEP may have been released in '89, but Unicode itself wasn't finished until '91, it couldn't possibly have supported Unicode upon release. The original releases of NeXTSTEP just used C strings; it wasn't until OpenStep in 1994 that they introduced NSString based on UCS-2.

UTF-8 was publicly released in January 1993. So by the time these projects became Unicode enabled, UTF-8 had already existed for at least a year.

Java and JavaScript had no underlying platform constraints to choose UCS-2/UTF-16, since the underlying platforms didn't support Unicode during their development.

Qt 2.0 was the first release of Qt to introduce Unicode support in QString, and it was released in 1999.

No, the real problem was just the fundamental design mistake that the Unicode consortium made when first developing Unicode. They thought that 16 bits would be enough to fit all of the world's actively used writing systems, and the simplest way to support an extended character set would be to just switch the underlying character type from 8 bit integers to 16 bit integers. This was a mistake in many ways; 16 bits is not sufficient, especially when CJK is taken into account, and so they had to do a lot of unification that wasn't really appropriate and led to a lot of resistance to using Unicode from CJK users. Changing to 16 bit integers for the fundamental character type meant that every API had to be duplicated to provide a wide character version. Some APIs already had wide character support for legacy wide character sets, but differences in existing wide character support between NT (which used 16 bit wide characters) and many Unices (which used 32 bit wide characters) meant that writing portable code is quite difficult. Using 16 bit integers for an internal representation means that there's a native endianness, but once you need to interchange data endianness becomes a big issue. And so on.

UTF-8 was the solution to many of these problems, and it was introduced before Unicode support had become widespread, but the idea that 16 bit types should be used for Unicode had already permeated people's consciousness and likely early development efforts. It's too bad that more people didn't learn from Plan 9's experience switching to UTF-8, which happened all the way back in 1992 (they switched Plan 9 to UTF-8 before publicly announcing it, which acted as a very good proof of concept).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: