Single-question tests don't reflect how users actually interact with AI assistants. Real conversations involve follow-ups, context references, and multi-step tasks. That's why we built multi-turn conversation testing in Probefish.
The Problem with Single-Turn Tests
Traditional prompt testing evaluates one question at a time:
Input: "What's the capital of France?"
Expected: Contains "Paris"
This works for simple Q&A, but fails to test:
- Memory and context retention
- Multi-step reasoning
- Conversation flow handling
- Session state management
What We Built
Probefish now supports conversation test cases - sequences of messages that maintain context, just like real chat interactions.
For LLM Prompts
Each turn builds on the previous conversation history:
Turn 1: User → "I'm planning a trip to Japan"
Assistant → [LLM responds with travel suggestions]
Turn 2: User → "What about in spring?"
Assistant → [LLM responds knowing the context is Japan travel]
Turn 3: User → "How much should I budget?"
Assistant → [LLM responds with Japan spring trip budget]
The full message history is sent to the LLM at each turn, maintaining context.
For API Endpoints
Session state is preserved between requests via:
- Cookie persistence - Automatically captures and sends session cookies
- Token extraction - Extracts auth tokens from responses and injects them into subsequent requests
- Variable extraction - Pulls values from responses (like order IDs) for use in later turns
Real-World Use Cases
1. Customer Support Bot Testing
Test that your support bot handles multi-step troubleshooting:
Turn 1: "My internet isn't working"
→ Validate: Asks diagnostic questions
Turn 2: "The router lights are all green"
→ Validate: Suggests next steps based on previous info
Turn 3: "I already tried restarting it"
→ Validate: Doesn't repeat the restart suggestion
Why it matters: Support bots that forget context frustrate users. Test that your bot remembers what was already tried.
2. E-commerce Checkout Flow
Test complete purchase journeys:
Turn 1: "I want to buy the blue sneakers in size 10"
→ Extract: product_id, validates product found
Turn 2: "Add to cart"
→ Extract: cart_id, validates item added
Turn 3: "Checkout with express shipping"
→ Validate: Correct total, shipping option applied
Turn 4: "Use my saved payment method"
→ Validate: Order confirmation, references correct items
Why it matters: Each step depends on previous state. Testing individual endpoints misses integration issues.
3. Onboarding Flow Testing
Validate that your AI assistant guides users through setup:
Turn 1: "I'm new here, help me get started"
→ Validate: Welcomes user, asks about goals
Turn 2: "I want to track my fitness"
→ Validate: Acknowledges goal, asks follow-up
Turn 3: "I run 3 times a week"
→ Validate: Creates appropriate plan, references running
Why it matters: Onboarding sets the tone. Test that your AI builds a coherent user profile across turns.
4. Context Window Stress Testing
Test how your prompt handles long conversations:
Turns 1-5: [Simulated] Set up complex context
Turn 6: "What did I say in my first message?"
→ Validate: Correctly recalls early context
Turn 7: "Summarize our entire conversation"
→ Validate: Accurate summary of all topics
Why it matters: Context window limits can cause AI to "forget" early messages. Find out where it breaks.
5. Authentication Flow Testing
Test login → action → session expiry flows:
Turn 1: POST /login with credentials
→ Extract: auth_token from response
Turn 2: GET /profile (with token)
→ Validate: Returns user data
Turn 3: POST /settings/update (with token)
→ Validate: Settings updated successfully
Session management handles token injection automatically between turns.
6. Error Recovery Testing
Test that your AI handles mistakes gracefully:
Turn 1: "Book a flight to Paris"
→ Validate: Asks for dates
Turn 2: "Actually, I meant London"
→ Validate: Corrects destination, doesn't lose other context
Turn 3: "December 15th to 22nd"
→ Validate: Confirms London (not Paris) for those dates
Why it matters: Users change their minds. Your AI should adapt without starting over.
7. Multi-Language Context Switching
Test that your AI maintains context across language changes:
Turn 1: "Quiero reservar una mesa para dos" (Spanish)
→ Validate: Responds in Spanish
Turn 2: "Actually, can we switch to English?"
→ Validate: Responds in English, remembers reservation context
Turn 3: "Make it for 7 PM"
→ Validate: Confirms reservation details in English
Why it matters: Multilingual users switch languages. Context shouldn't be lost.
8. API Rate Limit & Retry Testing
Test endpoint behavior under session constraints:
Turn 1: POST /api/generate (1st request)
→ Extract: request_id
Turn 2: POST /api/generate (2nd request)
→ Extract: remaining_quota
Turn 3: POST /api/generate (hits limit)
→ Validate: Returns 429, includes retry-after
Why it matters: Test rate limiting behavior across a realistic request sequence.
Validation Options
Per-Turn Validation
Validate each response as it happens. In the UI, expand each turn to add validation rules:
Turn 1: "What's 2+2?"
- Validation: Contains "4"
Turn 2: "Multiply that by 3"
- Validation: Contains "12"
Catch issues immediately when they occur.
Final-Only Validation
Only validate the last response - useful when intermediate responses don't need strict validation:
Turns 1-4: Build up context (no validation)
Turn 5: "Now give me the final recommendation"
- LLM Judge: "Recommendation is coherent and considers all previous context"
Simulated Responses: Test Specific Scenarios
Sometimes you need to set up a specific conversation state without making real LLM calls. Add an Assistant turn with pre-defined content:
Turn 1 (Assistant - simulated):
"I found 3 flights to Paris. The cheapest is $450 on Air France."
Turn 2 (User - real LLM call):
"Book the cheapest one"
- Validation: Contains "Air France", Contains "$450"
This lets you:
- Test specific edge cases
- Reduce API costs during development
- Create deterministic test conditions
Getting Started
- Create a test case and select "Conversation" mode
- Add turns - User messages that will be sent to the AI
- Configure validation - Per-turn or final-only
- For endpoints: Enable session config if needed (cookies, tokens)
- Run and see turn-by-turn results
Conclusion
Multi-turn testing catches bugs that single-turn tests miss. Context drift, session issues, and conversation flow problems only surface when you test like real users interact.
Start testing your AI conversations today with Probefish.