I would bet like a dollar that the supposed em-dash usage (which I'm not convinced is an accurate take in the first place) would have come from an enterprising dev somewhere being like "Well, we probably don't need multiple tokens for hyphens" and coercing every dash type thing to just one hyphen like token.
But I'm also showing off my ignorance with how these machines turn text into tokens in practice.
I think all the em-dashes came from scraping Wordpress blogs. Wordpress editor does "typography", then thus introduced em-dashes survive HTML to Markdown process used to scrap them, and end up in datasets.
But I'm also showing off my ignorance with how these machines turn text into tokens in practice.