Back to Blog

Multi-Turn Conversation Testing: Test Your AI Like Real Users Do

Single-question tests don't reflect how users actually interact with AI assistants. Real conversations involve follow-ups, context references, and multi-step tasks. That's why we built multi-turn conversation testing in Probefish.

The Problem with Single-Turn Tests

Traditional prompt testing evaluates one question at a time:

Input: "What's the capital of France?"
Expected: Contains "Paris"

This works for simple Q&A, but fails to test:

  • Memory and context retention
  • Multi-step reasoning
  • Conversation flow handling
  • Session state management

What We Built

Probefish now supports conversation test cases - sequences of messages that maintain context, just like real chat interactions.

For LLM Prompts

Each turn builds on the previous conversation history:

Turn 1: User → "I'm planning a trip to Japan"
        Assistant → [LLM responds with travel suggestions]

Turn 2: User → "What about in spring?"
        Assistant → [LLM responds knowing the context is Japan travel]

Turn 3: User → "How much should I budget?"
        Assistant → [LLM responds with Japan spring trip budget]

The full message history is sent to the LLM at each turn, maintaining context.

For API Endpoints

Session state is preserved between requests via:

  • Cookie persistence - Automatically captures and sends session cookies
  • Token extraction - Extracts auth tokens from responses and injects them into subsequent requests
  • Variable extraction - Pulls values from responses (like order IDs) for use in later turns

Real-World Use Cases

1. Customer Support Bot Testing

Test that your support bot handles multi-step troubleshooting:

Turn 1: "My internet isn't working"
        → Validate: Asks diagnostic questions

Turn 2: "The router lights are all green"
        → Validate: Suggests next steps based on previous info

Turn 3: "I already tried restarting it"
        → Validate: Doesn't repeat the restart suggestion

Why it matters: Support bots that forget context frustrate users. Test that your bot remembers what was already tried.


2. E-commerce Checkout Flow

Test complete purchase journeys:

Turn 1: "I want to buy the blue sneakers in size 10"
        → Extract: product_id, validates product found

Turn 2: "Add to cart"
        → Extract: cart_id, validates item added

Turn 3: "Checkout with express shipping"
        → Validate: Correct total, shipping option applied

Turn 4: "Use my saved payment method"
        → Validate: Order confirmation, references correct items

Why it matters: Each step depends on previous state. Testing individual endpoints misses integration issues.


3. Onboarding Flow Testing

Validate that your AI assistant guides users through setup:

Turn 1: "I'm new here, help me get started"
        → Validate: Welcomes user, asks about goals

Turn 2: "I want to track my fitness"
        → Validate: Acknowledges goal, asks follow-up

Turn 3: "I run 3 times a week"
        → Validate: Creates appropriate plan, references running

Why it matters: Onboarding sets the tone. Test that your AI builds a coherent user profile across turns.


4. Context Window Stress Testing

Test how your prompt handles long conversations:

Turns 1-5: [Simulated] Set up complex context
Turn 6: "What did I say in my first message?"
        → Validate: Correctly recalls early context

Turn 7: "Summarize our entire conversation"
        → Validate: Accurate summary of all topics

Why it matters: Context window limits can cause AI to "forget" early messages. Find out where it breaks.


5. Authentication Flow Testing

Test login → action → session expiry flows:

Turn 1: POST /login with credentials
        → Extract: auth_token from response

Turn 2: GET /profile (with token)
        → Validate: Returns user data

Turn 3: POST /settings/update (with token)
        → Validate: Settings updated successfully

Session management handles token injection automatically between turns.


6. Error Recovery Testing

Test that your AI handles mistakes gracefully:

Turn 1: "Book a flight to Paris"
        → Validate: Asks for dates

Turn 2: "Actually, I meant London"
        → Validate: Corrects destination, doesn't lose other context

Turn 3: "December 15th to 22nd"
        → Validate: Confirms London (not Paris) for those dates

Why it matters: Users change their minds. Your AI should adapt without starting over.


7. Multi-Language Context Switching

Test that your AI maintains context across language changes:

Turn 1: "Quiero reservar una mesa para dos" (Spanish)
        → Validate: Responds in Spanish

Turn 2: "Actually, can we switch to English?"
        → Validate: Responds in English, remembers reservation context

Turn 3: "Make it for 7 PM"
        → Validate: Confirms reservation details in English

Why it matters: Multilingual users switch languages. Context shouldn't be lost.


8. API Rate Limit & Retry Testing

Test endpoint behavior under session constraints:

Turn 1: POST /api/generate (1st request)
        → Extract: request_id

Turn 2: POST /api/generate (2nd request)
        → Extract: remaining_quota

Turn 3: POST /api/generate (hits limit)
        → Validate: Returns 429, includes retry-after

Why it matters: Test rate limiting behavior across a realistic request sequence.


Validation Options

Per-Turn Validation

Validate each response as it happens. In the UI, expand each turn to add validation rules:

Turn 1: "What's 2+2?"

  • Validation: Contains "4"

Turn 2: "Multiply that by 3"

  • Validation: Contains "12"

Catch issues immediately when they occur.

Final-Only Validation

Only validate the last response - useful when intermediate responses don't need strict validation:

Turns 1-4: Build up context (no validation)

Turn 5: "Now give me the final recommendation"

  • LLM Judge: "Recommendation is coherent and considers all previous context"

Simulated Responses: Test Specific Scenarios

Sometimes you need to set up a specific conversation state without making real LLM calls. Add an Assistant turn with pre-defined content:

Turn 1 (Assistant - simulated):

"I found 3 flights to Paris. The cheapest is $450 on Air France."

Turn 2 (User - real LLM call):

"Book the cheapest one"

  • Validation: Contains "Air France", Contains "$450"

This lets you:

  • Test specific edge cases
  • Reduce API costs during development
  • Create deterministic test conditions

Getting Started

  1. Create a test case and select "Conversation" mode
  2. Add turns - User messages that will be sent to the AI
  3. Configure validation - Per-turn or final-only
  4. For endpoints: Enable session config if needed (cookies, tokens)
  5. Run and see turn-by-turn results

Conclusion

Multi-turn testing catches bugs that single-turn tests miss. Context drift, session issues, and conversation flow problems only surface when you test like real users interact.

Start testing your AI conversations today with Probefish.