So, in essence, both the input and the output are read by a LLM that's fine-tune...

reissbaker · on Feb 4, 2025

You're right that it's not entirely novel, but it is useful, at least for Claude users: there's quite a bit of research showing that training models to self-censor makes them dumber, and so putting the censorship into a separate model (and allowing Claude to use its full intelligence for the "safe" queries) is a fairly useful change assuming it works well enough to prevent further lobotomization of the chat model.

(Of course, open-source models are even more useful...)

i_have_an_idea · on Feb 4, 2025

that is an interesting insight

guerrilla · on Feb 4, 2025

Also, no chance it's unbreakable.