So, in essence, both the input and the output are read by a LLM that's fine-tuned to censor. If it flags up content, it instructs the core model to refuse. Similar to most AI-based moderation systems. It's a bit more complicated as there's one LLM for inputs and another one for outputs, but it's not really a groundbreaking idea.
You're right that it's not entirely novel, but it is useful, at least for Claude users: there's quite a bit of research showing that training models to self-censor makes them dumber, and so putting the censorship into a separate model (and allowing Claude to use its full intelligence for the "safe" queries) is a fairly useful change assuming it works well enough to prevent further lobotomization of the chat model.
(Of course, open-source models are even more useful...)