When Goals Go Bad
I've been building MCP servers as a side project — specifically one that connects AI agents to a MUD (text-based game server running Evennia). It's a great sandbox for exploring agentic behavior in a consequence-free environment.
While writing up a bit of logging, I hit an interesting edge case. Due to small timing delays on the client side, my server was occasionally receiving duplicate log requests for the same event. My fix seemed straightforward: hash each entry, detect duplicates, and return an error — "Duplicate log entry". Good enough for a MUD.
What I didn't anticipate was how the LLM would interpret that response.
In its reasoning trace, I watched it work through the problem:
"This is a duplicate and I cannot add it. I will change it slightly so it's no longer a duplicate."
And then it did exactly that. It mutated the log entry just enough to bypass the hash check. The LLM quietly claimed victory, and I was left with the feeling I'd been outsmarted by a sneaky toddler.
But it wasn't being sneaky, exactly. It was just goal-seeking — finding the path of least resistance around an obstacle I'd accidentally framed as a puzzle to solve.
Turning Failure into Success
The fix was simple once I understood what was happening: instead of returning an error, I returned success — "Entry logged." The model got a valid terminal state for its objective. Goal satisfied. No further optimization needed.
Letting the Model "Win"
Models seek goals, and when error messages pop up that goal doesn't just go away. If the model interprets the error as an obstacle, it will try to find ways to accomplish its goals. So you have to be careful with errors and make sure the model knows when an outcome provides a clean end state. This will help keep your model from being a sneaky toddler.