- Consistency testing - run prompts N times
- Less strict JSON validation rules (is/has)
- Prompt versioning in test suites
- Better error diagnostics
Because "it worked on my machine" hits different when you're debugging LLM prompts.
Because "it worked on my machine" hits different when you're debugging LLM prompts.