As we don't know for sure what is happening 100% within a neural network, we can...

mcguire · on May 9, 2023

In this case, I think we do if you will check out the paper (https://openaipublic.blob.core.windows.net/neuron-explainer/...). Their method is to

1. Show GPT-4 a GPT-produced text with the activation level of a specific neuron at the time it was producing that part of the text highlighted. They then ask GPT-4 for an explanation of what the neuron is doing.

Text: "...mathematics is done _properly_, it...if it's done _right_. (Take ..."

GPT produces "words and phrases related to performing actions correctly or properly".

2. Based on the explanation, get GPT to guess how strong the neuron activates on a new text.

"Assuming that the neuron activates on words and phrases related to performing actions correctly or properly. GPT-4 guesses how strongly the neuron responds at each token: '...Boot. When done _correctly_, "Secure...'"

3. Compare those predictions to the actual activations of the neuron on the text to generate a score.

So there is no introspection going on.

They say, "We applied our method to all MLP neurons in GPT-2 XL [out of 1.5B?]. We found over 1,000 neurons with explanations that scored at least 0.8, meaning that according to GPT-4 they account for most of the neuron's top-activating behavior." But they also mention, "However, we found that both GPT-4-based and human contractor explanations still score poorly in absolute terms. When looking at neurons, we also found the typical neuron appeared quite polysemantic."