The Winapi (aka Win32) is the only commonly used API that regularly requires som...

kmontrose · on Jan 2, 2014

There's a reason a ton of legacy code is UCS-2^/UTF-16 and not UTF-8.

The code is older than UTF-8.

Windows NT was first released in '93, UTF-8 didn't exist (much less have wide adoption) for most of it's development. Likewise, NeXTSTEP had an initial release in '89. Java and JavaScript (both '95) could have adopted UTF-8, but they're almost certainly running on an OS that expects UTF-16 soooo... yeah.

^UCS-2 was superceeded by UTF-16, so you'll find them used interchangeably a lot.

lambda · on Jan 2, 2014

It doesn't matter when the projects started, it matters when they added Unicode support. Windows NT didn't support Unicode until 4.0, released in 1996. NeXTSTEP may have been released in '89, but Unicode itself wasn't finished until '91, it couldn't possibly have supported Unicode upon release. The original releases of NeXTSTEP just used C strings; it wasn't until OpenStep in 1994 that they introduced NSString based on UCS-2.

UTF-8 was publicly released in January 1993. So by the time these projects became Unicode enabled, UTF-8 had already existed for at least a year.

Java and JavaScript had no underlying platform constraints to choose UCS-2/UTF-16, since the underlying platforms didn't support Unicode during their development.

Qt 2.0 was the first release of Qt to introduce Unicode support in QString, and it was released in 1999.

No, the real problem was just the fundamental design mistake that the Unicode consortium made when first developing Unicode. They thought that 16 bits would be enough to fit all of the world's actively used writing systems, and the simplest way to support an extended character set would be to just switch the underlying character type from 8 bit integers to 16 bit integers. This was a mistake in many ways; 16 bits is not sufficient, especially when CJK is taken into account, and so they had to do a lot of unification that wasn't really appropriate and led to a lot of resistance to using Unicode from CJK users. Changing to 16 bit integers for the fundamental character type meant that every API had to be duplicated to provide a wide character version. Some APIs already had wide character support for legacy wide character sets, but differences in existing wide character support between NT (which used 16 bit wide characters) and many Unices (which used 32 bit wide characters) meant that writing portable code is quite difficult. Using 16 bit integers for an internal representation means that there's a native endianness, but once you need to interchange data endianness becomes a big issue. And so on.

UTF-8 was the solution to many of these problems, and it was introduced before Unicode support had become widespread, but the idea that 16 bit types should be used for Unicode had already permeated people's consciousness and likely early development efforts. It's too bad that more people didn't learn from Plan 9's experience switching to UTF-8, which happened all the way back in 1992 (they switched Plan 9 to UTF-8 before publicly announcing it, which acted as a very good proof of concept).