A buddy of mine from some far cold coasts recently visited me in my hometown. He mentioned that he was using some bits from my [1/4] article on LLMs about coding with a PLAN.md. Which is fantastic, because that's what I'm writing this stuff for!
BUT. He also mentioned that he doesn't let Claude Code --very-dangerously-execute-tests, which is a pity because I find that this is where the whole Claude Code juice hides. It gives the LLM a chance to find its own bugs, which it will inevitably introduce. You know, those nifty LLM bugs that are extremely hard to notice and debug.
So I wanted to make this point again in its own post:
Otherwise, with just a PLAN.md, you are still in the ancient Cursor and Copilot lands six months ago where you let it generate code, maybe just a bit more reliable.
I have a special Success Criteria section in my PLAN.md:
# The Good Feature
[...] 
## Success Criteria
Create and run this new test until successful:
- test/controllers/goods_controller_test.rb
Adjust this related test until successful:
- test/controllers/foosels_controller_test.rb
Without it, your work won't be accepted.The important part with an "actionable" success criteria is that the LLM gets some kind of feedback whether it's successful or not. You are enough of a bottleneck, you don't want to always also be its feedback bottleneck. Here are a few more examples:
- Implement according to the plan until all the shell script's branches run without an error.
 - Fix the error and make sure a test in 
test/models/barber.rbrun successfully. The file is quite large so it's fine if you just run the line of the newly created regression test. - [... API integration...] Create temporary scripts to make the API calls. Make sure to hit the API with actual requests via cURL and that the different API call sequences are successful.
 
I also frequently casually end my "Write the code to implement PLAN.md" prompts with:
"Make sure to run the mentioned tests successfully, without it your work cannot be accepted!!!?!??(#@$@)(#$&() aaahhh 🙀"
I guess (hope) it kinda helps. I do still notice that if the PLAN.md is quite large in scope that Claude gets kinda tired and just implements without giving a big shit about creating or running tests. I wonder where this tendency comes from? Definitely not us developers. Still, not a problem, since you've written the test scope down in the PLAN.md and you just ask it to finish the implementation by creating, adjusting and running relevant tests.
So let's get our criteria together and make that Claude Code more powerful than those pre-agentic Cursor workflows.