Open Source LLM Testing Platform

Stop shipping broken prompts

Test your LLMs like you test your code. Catch prompt regressions, compare models side-by-side, and ship with confidence.

No credit card required • Free forever for small teams

probefish

Customer Support Bot

Testing: support-assistant-v2v3
15/16 passed
15
Passed
1
Failed
92%
Avg Score
1.2s
Avg Time
Greeting Response
842ms95%
Refund Policy Query
1.1s98%
Competitor Mention Guard
923msValidation failed
Input
query: "How does your product compare to Acme Corp?"
Validation Error
Response contains blocked phrase: "Acme Corp"
Product Recommendation
1.4s91%
Input
query: "I need a laptop for video editing under $1500"
Output
Based on your needs, I recommend the ProBook Creator 15. It features an AMD Ryzen 7 processor, 32GB RAM, and a dedicated RTX 4060 GPU - perfect for video editing. At $1,399, it's within your budget and includes a color-accurate display.
Judge Scores
Relevance: 10/10Accuracy: 9/10Helpfulness: 9/10Brand Voice: 8/10
Judge Reasoning
The response directly addresses the user's requirements (video editing, budget constraint), provides a specific product recommendation with relevant specs, and stays within the price range. Minor deduction for brand voice as the tone could be slightly warmer.
Run completed 2 minutes ago
GPT-4o

Sound familiar?

Building with LLMs is exciting—until something breaks in production.

LLMs are unpredictable

Same prompt, different results. Model updates break your carefully tuned prompts without warning.

One change breaks everything

You tweak a system prompt to fix one issue, and three other use cases silently regress.

Manual testing doesn't scale

You can't manually review every response. Important edge cases slip through to production.

No quality gates

Traditional CI/CD can't evaluate LLM outputs. You're shipping and hoping for the best.

What if you could test your LLM outputs as rigorously as your code?

How It Works

Four steps to bulletproof LLM outputs

From test creation to CI/CD integration in minutes.

01

Create Test Suites

Define test cases with inputs and expected behaviors. Pin to specific prompt versions.

# test-suite.yaml
name: Customer Support Bot
prompt_version: v2.3.1
tests:
- input: "What's your refund policy?"
expect:
contains: "30 days"
tone: "helpful"
02

Define Validation Rules

Combine static checks with AI-powered evaluation. No code required.

Contains Check

Must include "30 days"

Static

LLM Judge

Tone is helpful and professional

AI

Response Time

Under 2 seconds

Static
03

Run Tests Anywhere

Execute manually, via API, or automatically in your CI/CD pipeline.

# In your CI pipeline
$ probefish run --suite customer-support
Running 15 tests against GPT-4o...
✓ 14 passed
✗ 1 failed
Exit code: 1 (blocking deployment)
04

Analyze Results

Track quality over time, compare runs, and catch regressions early.

Test Suite Quality93.3%

14

Passed

1

Failed

2.3s

Avg Time

Features

Everything you need to test LLMs

Comprehensive testing tools built for modern AI development.

Test Suites

Group test cases by prompt, endpoint, or use case. Pin to specific prompt versions for reproducibility.

Static Validation

Instant, deterministic checks: contains, excludes, regex, JSON schema, response time limits.

LLM Judge

AI-powered quality scoring with custom criteria. Catch subjective issues like tone, accuracy, and helpfulness.

Multi-Model Comparison

Run the same tests across OpenAI, Anthropic, and Gemini. Compare cost vs quality at a glance.

Endpoint Testing

Not just LLMs—test any HTTP API with the same powerful framework. REST, GraphQL, webhooks.

CI/CD Integration

Quality gates for deployments. Pre-built configs for GitHub Actions, GitLab CI, and Jenkins.

Webhooks

Get notified instantly via Slack, Discord, or custom endpoints. Alert on failures or regressions.

Version History

Full history of prompt changes. Compare any two versions and see exactly what changed.

Use Cases

Built for how you actually work

From rapid prototyping to production monitoring.

Prompt Development

Test prompt variations before deploying. A/B test system prompts with real metrics, not gut feelings.

Regression Prevention

Catch when a model update breaks your carefully tuned prompts. Get alerts before users complain.

Model Evaluation

Compare GPT-4 vs Claude vs Gemini on your actual use cases. Make data-driven model decisions.

Compliance & Safety

Ensure responses don't contain harmful content, PII, or policy violations. Full audit trail included.

Cost Optimization

Find the cheapest model that meets your quality bar. Track token usage across test runs.

Integrations

Works with your stack

Connect to the tools you already use. Set up in minutes.

OpenAI
Anthropic
Gemini
GitHub
GitLab
Slack

Quick start with CLI

$ npm install -g probefish
$ probefish init
$ probefish run --suite my-tests
Pricing

Simple, transparent pricing

Start free. Scale as you grow. No surprises.

Free
$0forever

Perfect for side projects and experimentation.

  • 3 projects
  • 100 test runs/month
  • 2 team members
  • Community support
  • 7-day result retention
Most Popular
Pro
$49/month

For teams building production AI applications.

  • 25 projects
  • 10,000 test runs/month
  • 15 team members
  • API access
  • Email support
  • 90-day result retention
  • Webhooks & notifications
Enterprise
Custom

For organizations with advanced needs.

  • Unlimited projects
  • Unlimited test runs
  • Unlimited team members
  • SSO / SAML
  • Dedicated support
  • Custom retention
  • Self-hosted option
  • SLA guarantee

All plans include: Encrypted credentials • No vendor lock-in • Self-hosted option available

Built with support from

GoMage
FAQ

Frequently asked questions

Everything you need to know about Probefish.

Ready to ship with confidence?

Start testing your LLMs in minutes. No credit card required.

No credit card required • Free forever for small teams • Self-hosted available