Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Okay so some googling found me that the "xn--" means the rest of the hostname will be unicode, but why does é become -fsa in www.xn--googl-fsa.com.

Google failed on the second part.



Because that's the Punycode representation:

https://en.wikipedia.org/wiki/Punycode

https://www.punycoder.com/


I wasn't aware of this, I'd seen those URLs before but only in the context of Chinese ones and thought it was Chinese-specific.

It's interesting because I just went down an apparent rabbit hole inplementing Byte-level encoding for using language models with unicode. There each byte in a unicode character is mapped to a printable character that goes up to 255 < ord(x) < 511 (I don't remember the highest but the point is each byte is mapped to another printable unicode character.

See https://github.com/openai/gpt-2/blob/9b63575ef42771a015060c9...

And the actual list of characters:

https://github.com/rbitr/llm.f90/blob/dev/phi2/phi2/pretoken...


To expand on the sibling comments: This encoding (called Punycode) works by combining the character to encode (é) and the position the character should be in (the 7th position out of a possible 7) into a single number. é is 233, there are 7 possible positions, and it is in position 6 (0-indexed) so that single number is 233 * 7 + 6 = 1637. This is then encoded via a fairly complex variable-length encoding scheme into the letters "fsa".

See https://en.wikipedia.org/wiki/Punycode#Encoding_the_non-ASCI...


This system seems pretty weird to me.

I was wondering, can that clash with a "normal" domain registered as "xn--....."? Apparently there is another specific rule in RFC 5891 saying "The Unicode string MUST NOT contain "--" (two consecutive hyphens) in the third and fourth character positions" [0]

Also, if I was forced to represent Unicode as ASCII, punycode encoding is not the obvious one - it's pretty confusing. But, I don't know much about how and why it was chosen, so I assume there's good reason.

[0] https://datatracker.ietf.org/doc/html/rfc5891#section-4.2.3....


IDNs and Punycode were basically bolted-on extensions to DNS that were added after DNS was already widely deployed. Because there was no "proper" extension mechanism available, it was a design requirement that they can be implemented "on top" of the standard DNS without having to change any of the underlying components. So I think most of the DNS infrastructure can be (and is) still completely unaware that IDNs and Punycode exists.

Actually, I wonder what happens if you take a "normal" (i.e. non-IDN, ascii-only) domain and encode it as Punycode. Should the encoded and non-encoded domains be considered identical or separate? (for purposes of DNS resolutions, origin separation, etc)

Identical would be more intuitive and would match the behavior of domain names with non-ascii characters - on the other hand, this would require reworking of ALL non-punycode-aware DNS software, which I'm doubtful is possible.

So this seems like a tricky thing to get right.


Python has idna-encoding built in these days, so I figured I'd do a quick check to see what happens:

    >>> "foo".encode("idna")
    b'foo'
    >>> "fooé".encode("idna")
    b'xn--foo-dma'
So indeed a punycode'd ascii domain would remain unchanges by the looks of it.

There's also the "punycode" encoding available, but that does something subtly different that's not quite how domains get encoded:

    >>> "foo".encode("punycode")
    b'foo-'
    >>> "fooé".encode("punycode")
    b'foo-dma'


According to the current Python documentation the 'idna' encoding in Python only does IDNA 2003, not IDNA 2008:

https://docs.python.org/3.12/library/codecs.html#module-enco...

The recommend the 3rd party 'idna' module for this:

https://pypi.org/project/idna/

IDNA 2003 is a particular annoyance of mine: The IDNA 2003 algorithm didn't encode the german 'ß' character, or rather 'wrongly', through overeager use of Unicode normalisation in the nameprep part. Then the browser makers for a long time stood still and didn't upgrade to IDNA 2008, which fixed that bug among other things. The WhatWG in its self-appointed role as stenograph of the browser cartel didn't change its weird URL spec. But that seems to have changed in recent years. Of course the original sin of IDNA was making it client-side. :/


I mean, yeah, but the odds of someone using "xn--" on the start of a domain are pretty small. The double dash is pretty uncommon.


It’s called an IDN. This is an encoding format called puny code that transforms international domains into ascii


FWIW I find this is the perfect question for ChatGPT/Gemini. Whenever the knowedge is somewhere on the web but hard to Google, I use these LLMs.

In this case, Gemini correctly points to Punycode


Regarding the quality of Google search results - I copied this comment verbatim into GPT 3.5, Claude 1, and Mistral small (the lowest quality LLMs from each provider available through Kagi) and each one explained Punycode encoding.


In case anyone else is confused as to why the domain in the example provided needs to be unicode (compared to the filename which is obvious): it's because the hyphen is the shorter '‑' char, which is extended ASCII 226 not the standard '-' (which would be ASCII 45).


The first character you pasted is U+2011 (8209 in decimal), does not appear in the document and cannot be ASCII as it goes beyond the codepoint 127/7F. Also, U+2011 is meant to be a non-breaking hyphen.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: