Hacker Newsnew | past | comments | ask | show | jobs | submit | Sevii's commentslogin

Are agents actually capable of answering why they did things? An LLM can review the previous context, add your question about why it did something, and then use next token prediction to generate an answer. But is that answer actually why the agent did what it did?

It depends. If you have an LLM that uses reasoning the explanation for why decisions are made can often be found in the reasoning token output. So if the agent later has access to that context it could see why a decision was made.

Reasoning, in majority of cases, is pruned at each conversation turn.

The cursor-mirror skill and cursor_mirror.py script lets you search through and inschpekt all of your chat histories, all of the thinking bubbles and prompts, all of the context assembly, all of the tool and mcp calls and parameters, and analyze what it did, even after cursor has summarized and pruned and "forgotten" it -- it's all still there in the chat log and sqlite databases.

cursor-mirror skill and reverse engineered cursor schemas:

https://github.com/SimHacker/moollm/tree/main/skills/cursor-...

cursor_mirror.py:

https://github.com/SimHacker/moollm/blob/main/skills/cursor-...

  The German Toilet of AI

  "The structure of the toilet reflects how a culture examines itself." — Slavoj Zizek

  German toilets have a shelf. You can inspect what you've produced before flushing. French toilets rush everything away immediately. American toilets sit ambivalently between.

  cursor-mirror is the German toilet of AI.

  Most AI systems are French toilets — thoughts disappear instantly, no inspection possible. cursor-mirror provides hermeneutic self-examination: the ability to interpret and understand your own outputs.

  What context was assembled?
  What reasoning happened in thinking blocks?
  What tools were called and why?
  What files were read, written, modified?

  This matters for:

  Debugging — Why did it do that?
  Learning — What patterns work?
  Trust — Is this skill behaving as declared?
  Optimization — What's eating my tokens?

  See: Skill Ecosystem for how cursor-mirror enables skill curation.
----

https://news.ycombinator.com/item?id=23452607

According to Slavoj Žižek, Germans love Hermeneutic stool diagnostics:

https://www.youtube.com/watch?v=rzXPyCY7jbs

>Žižek on toilets. Slavoj Žižek during an architecture congress in Pamplona, Spain.

>The German toilets, the old kind -- now they are disappearing, but you still find them. It's the opposite. The hole is in front, so that when you produce excrement, they are displayed in the back, they don't disappear in water. This is the German ritual, you know? Use it every morning. Sniff, inspect your shits for traces of illness. It's high Hermeneutic. I think the original meaning of Hermeneutic may be this.

https://en.wikipedia.org/wiki/Hermeneutics

>Hermeneutics (/ˌhɜːrməˈnjuːtɪks/)[1] is the theory and methodology of interpretation, especially the interpretation of biblical texts, wisdom literature, and philosophical texts. Hermeneutics is more than interpretive principles or methods we resort to when immediate comprehension fails. Rather, hermeneutics is the art of understanding and of making oneself understood.

----

Here's an example cursor-mirror analysis of an experiment with 23 runs with four agents playing several turns of Fluxx per run (1 run = 1 completion call), 1045+ events, 731 tool calls, 24 files created, 32 images generated, 24 custom Fluxx cards created:

Cursor Mirror Analysis: Amsterdam Fluxx Championship -- Deep comprehensive scan of the entire FAFO tournament development:

amsterdam-flux CURSOR-MIRROR-ANALYSIS.md:

https://github.com/SimHacker/moollm/blob/main/skills/experim...

amsterdam-flux simulation runs:

https://github.com/SimHacker/moollm/tree/main/skills/experim...


Just an update re German toilets: No toilet set up in the last 30 years (I know of) uses a shelf anymore. This reduces water usage by about 50% per flush.

But then what do you have to talk about all day??!

LLMs often already "know" the answer starting from the first output token and then emulate "reasoning" so that it appeared as if it came to the conclusion through logic. There's a bunch of papers on this topic. At least it used to be the case a few months ago, not sure about the current SOTA models.

Wait, that's not right, let me think through this more carefully...

of course not, but it can often give a plausible answer, and it's possible that answer will actually happen to be correct - not because it did any - or is capable of any - introspection, but because it's token outputs in response to the question might semi-coincidentally be a token input that changes the future outputs in the same way.

Well, the entire field of explainable AI has mostly thrown in the towel..

LLMs are continuously improving. So something that didn't work a year ago became possible in November. If you tried to build Openclaw in 2024 it wouldn't have worked. Openclaw isn't groundbreaking, but it is extremely on the edge of the LLM capability curve.

The industrial revolution was extremely hard on individual craftspeople. Jobs became lower paying and lower skilled. People were forced to move into cities. Conditions didn't improve for decades. If AI is anything comparable it's not going to get better in 5-10 years. It will be decades before the new 'jobs' come into place.

Seriously, it took nearly ~150 years before the people actually benefited from the industrial revolution. Saying that we need to condemn two lifetimes worth of suffering to benefit literally a few thousand people out of billions is absolutely ludicrous.

But think about corporate aristocracy and their children!

This is basically not true. It's hard to debate this when we don't start from a position of truth.

It pretty much is, unless you think it's totally cool to work in highly dangerous jobs that paid poorly while being treated like chattel slaves. There is a reason why the 1800s had the most violent labor actions in the US, it wasn't because they were treated "well."

Completely disingenuous, learn your labor history.


People didn't feel the benefits for 150 years? Just absolute nonsense.

I think the AI sales orgs are just immature. It's hard to say this but Google's Gemini sales team might be more professional.

What do you like about Gemini sales team?

AI isn’t good enough to do consulting yet.


What I don't get is, how are these free LLMs getting funded? Who is paying $20-100 million to create an open weights LLM? Long term why would they keep doing it?


I see what you're saying, but it doesn't matter that much in the long run. If everything stopped right now, the state-of-the-art open source models can still solve a lot of problems. They may never solve coding, per se, but they're good enough.


Billionaires trying to hurt each other. Facebook released LLaMa hoping to hasten OpenAI's bankruptcy.


But it's not open, and in fact AFAIK it's not possible to use commercially.


It's possible, just not legal if they find out and you're worth suing.


Thanks for the pointless correction!


It now takes 3 button presses to switch tabs in mobile safari. It used to take just two before glass.


100%


it's insane. people just don't want to use their brains to communicate anymore i guess. you've just experienced something traumatic like a layoff, and you can't even just take a few hours to internalize it and be vulnerable online, rather than jumping immediately onto social media to use the opportunity to sound like a market analyst


FAANG has been engaged in mass layoffs for two years now. How can you possibly make the claim that there is a surplus of people who can pass the interview loops? Obviously, there isn't because they are firing people who passed those loops.


You’re ignoring the part where FAANG massively overhired in the years preceding.

Meta and Amazon doubled their headcount in the 2-3 years of the pandemic.

Others like Google increased by 60+%.

You’re also forgetting about this little thing popularly called AI that happened in the intervening years.

There may be an argument that H1B isn’t fit to purpose in a post AI world (although that argument is also false if we think software engineering will remain a viable job going forward, but that’s a different topic).

But it’s much harder to argue that H1B hurt US employers when thr industry they hired the majority of H1B employees in the first 2 decades of the 2000s, also saw some of the highest growth in jobs while simultaneously posting the highest growth in salaries (there may have been certain minor industries hiring a few thousand people, like Oceanographer that had a slightly higher increase, but even that was likely not true because BLS data doesn’t factor compensation in the form of stock options which disproportionally provided wealth for SW engineers relative to other workers).


>You’re ignoring the part where FAANG massively overhired in the years preceding.

Yes, because overhiring is a lie generated to justify layoffs. I'd hope by year 3 that we'd see through this. If they "overhired", why is hiring still up globally while down in the US?

>You’re also forgetting about this little thing popularly called AI that happened in the intervening years.

What about it? Hiring numbers are still up. Its clearly not replacing workers as of now.


Vibecoding is great for open source. Open source is already dominated by strong solo programmers like antirez, linus, etc. People with very strong motivations to create software they see as necessary. Vibecoding makes creating open source projects easier. It makes it easier to get from an idea to "Hey guys check this out!" The only downside to open source is the fly by PRs vibecoding enables which are currently draining maintainer time.


I think the solution to the latter is simply to maintain high standards in terms of structure and organization. I've always been a fan of KISS should override any other non-requirement of software. Any my non-requirement, I mean anything that is just subjective. Don't create complexity you don't actually need, or that doesn't make an outsized contribution to making other areas of the code easier to reason with.

Sometimes having dozens of one-off scripts is easier/simpler than trying to create the uber-flexibly one tool does all solution.


And make one PR after another, i can see how happy Linus & Co. would be of all the garbage features ;-)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: