Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The guarantee exists to speed up UTF-8 processing, so that it can safely assume working with whole codepoints/sequences (without extra out of bounds checks for every byte) and to ensure you can always losslessly roundtrip every string to and from other Unicode encodings without introducing any special notion of a broken character. There's also a security angle in this: text-processing algorithms may have different strategies for recovering from broken UTF-8, which could be exploited to fool parsers (e.g. if a 4-byte UTF-8 sequence has only 3 bytes matching, do you advance by 3 or 4 bytes?).

Having the "valid UTF-8" state being part of the type system means it needs to be checked only once when the instance is created (which can be compile-time for constants), and doesn't have to be re-checked later, even if the string is mutated. Unlike a generic bag of bytes, the pubic interface on string won't allow making it invalid UTF-8.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: