> The paper explains this in detail, but here is a summary: an explanation is good if you can recover actual neuron behavior from the explanation.
To be clear, this is only neuron activation strength for text inputs. We aren't doing any mechanistic modeling of whether our explanation of what the neuron does predicts any role the neuron might play within the internals of the network, despite most neurons likely having a role that can only be succinctly summarized in relation to the rest of the network.
It seems very easy to end up with explanations that correlate well with a neuron, but do not actually meaningfully explain what the neuron is doing.
To be clear, this is only neuron activation strength for text inputs. We aren't doing any mechanistic modeling of whether our explanation of what the neuron does predicts any role the neuron might play within the internals of the network, despite most neurons likely having a role that can only be succinctly summarized in relation to the rest of the network.
It seems very easy to end up with explanations that correlate well with a neuron, but do not actually meaningfully explain what the neuron is doing.