For Mixture of Expert models (like GPTs are), they can produce different results for an input sequence if that sequence is retried together with a different set of sequences in its inference batch, because of the model (“expert”) routing depends on the batch, not the single sequence: https://152334h.github.io/blog/non-determinism-in-gpt-4/
And in general, binary floating point arithmetic cannot guarantee associativity - i.e. `(a + b) + c` might not be the same as `a + (b + c)`. That in turn can lead to the model picking another token in rare cases (and it’s auto-regressive consequences, that the entire remainder of the generated sequence might differ): https://www.ingonyama.com/blog/solving-reproducibility-chall...
Edit: Of course, my answer assumes you are asking about the case when the model lets you set its token generation temperature (stochasticity) to exactly zero. With default parameter settings, all LLMs I know of randomly pick among the best tokens.
And in general, binary floating point arithmetic cannot guarantee associativity - i.e. `(a + b) + c` might not be the same as `a + (b + c)`. That in turn can lead to the model picking another token in rare cases (and it’s auto-regressive consequences, that the entire remainder of the generated sequence might differ): https://www.ingonyama.com/blog/solving-reproducibility-chall...
Edit: Of course, my answer assumes you are asking about the case when the model lets you set its token generation temperature (stochasticity) to exactly zero. With default parameter settings, all LLMs I know of randomly pick among the best tokens.