Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So, in essence, both the input and the output are read by a LLM that's fine-tuned to censor. If it flags up content, it instructs the core model to refuse. Similar to most AI-based moderation systems. It's a bit more complicated as there's one LLM for inputs and another one for outputs, but it's not really a groundbreaking idea.


You're right that it's not entirely novel, but it is useful, at least for Claude users: there's quite a bit of research showing that training models to self-censor makes them dumber, and so putting the censorship into a separate model (and allowing Claude to use its full intelligence for the "safe" queries) is a fairly useful change assuming it works well enough to prevent further lobotomization of the chat model.

(Of course, open-source models are even more useful...)


that is an interesting insight


Also, no chance it's unbreakable.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: