IMO the best design would be to keep the flag with the data. Give each register an extra bit indicating whether it’s sensitive. Any data-dependent-timing operation can’t possibly leak the data until the data is available to it, and that’s exactly when the ALU would find out that the data is sensitive anyway. No pipeline stalls.
Sorry for necro-bumping, but there is a paper doing exactly that besides various other things to eliminate timing channels claiming also to prevent attacks based on speculative execution etc: "BLACKOUT : Data-Oblivious Computation with Blinded Capabilities" https://arxiv.org/abs/2504.14654. They basically utilize another bit of CHERI for "blinded capability" and methods to mitigate potential problems you identified.