Open Source LLM Testing Platform

Stop shipping broken prompts

Name: Probefish
Author: Probefish

Test your LLMs like you test your code. Catch prompt regressions, compare models side-by-side, and ship with confidence.

Get Started Free View on GitHub

No credit card required • Free forever for small teams

probefish

Customer Support Bot

Testing: support-assistant-v2v3

15/16 passed

Passed

Failed

92%

Avg Score

1.2s

Avg Time

Greeting Response

842ms95%

Refund Policy Query

1.1s98%

Competitor Mention Guard

923msValidation failed

Input

query: "How does your product compare to Acme Corp?"

Validation Error

Response contains blocked phrase: "Acme Corp"

Product Recommendation

1.4s91%

Input

query: "I need a laptop for video editing under $1500"

Output

Based on your needs, I recommend the ProBook Creator 15. It features an AMD Ryzen 7 processor, 32GB RAM, and a dedicated RTX 4060 GPU - perfect for video editing. At $1,399, it's within your budget and includes a color-accurate display.

Judge Scores

Relevance: 10/10Accuracy: 9/10Helpfulness: 9/10Brand Voice: 8/10

Judge Reasoning

The response directly addresses the user's requirements (video editing, budget constraint), provides a specific product recommendation with relevant specs, and stays within the price range. Minor deduction for brand voice as the tone could be slightly warmer.

Run completed 2 minutes ago

GPT-4o

Sound familiar?

Building with LLMs is exciting—until something breaks in production.

LLMs are unpredictable

Same prompt, different results. Model updates break your carefully tuned prompts without warning.

One change breaks everything

You tweak a system prompt to fix one issue, and three other use cases silently regress.

Manual testing doesn't scale

You can't manually review every response. Important edge cases slip through to production.

No quality gates

Traditional CI/CD can't evaluate LLM outputs. You're shipping and hoping for the best.

What if you could test your LLM outputs as rigorously as your code?

How It Works

Four steps to bulletproof LLM outputs

From test creation to CI/CD integration in minutes.

Create Test Suites

Define test cases with inputs and expected behaviors. Pin to specific prompt versions.

# test-suite.yaml

name: Customer Support Bot

prompt_version: v2.3.1

tests:

- input: "What's your refund policy?"

expect:

contains: "30 days"

tone: "helpful"

Define Validation Rules

Combine static checks with AI-powered evaluation. No code required.

Contains Check

Must include "30 days"

Static

LLM Judge

Tone is helpful and professional

Response Time

Under 2 seconds

Static

Run Tests Anywhere

Execute manually, via API, or automatically in your CI/CD pipeline.

# In your CI pipeline

$ probefish run --suite customer-support

Running 15 tests against GPT-4o...

✓ 14 passed

✗ 1 failed

Exit code: 1 (blocking deployment)

Analyze Results

Track quality over time, compare runs, and catch regressions early.

Test Suite Quality93.3%

Passed

Failed

2.3s

Avg Time

Features

Everything you need to test LLMs

Comprehensive testing tools built for modern AI development.

Test Suites

Group test cases by prompt, endpoint, or use case. Pin to specific prompt versions for reproducibility.

Static Validation

Instant, deterministic checks: contains, excludes, regex, JSON schema, response time limits.

LLM Judge

AI-powered quality scoring with custom criteria. Catch subjective issues like tone, accuracy, and helpfulness.

Multi-Model Comparison

Run the same tests across OpenAI, Anthropic, and Gemini. Compare cost vs quality at a glance.

Endpoint Testing

Not just LLMs—test any HTTP API with the same powerful framework. REST, GraphQL, webhooks.

CI/CD Integration

Quality gates for deployments. Pre-built configs for GitHub Actions, GitLab CI, and Jenkins.

Webhooks

Get notified instantly via Slack, Discord, or custom endpoints. Alert on failures or regressions.

Version History

Full history of prompt changes. Compare any two versions and see exactly what changed.

Use Cases

Built for how you actually work

From rapid prototyping to production monitoring.

Prompt Development

Test prompt variations before deploying. A/B test system prompts with real metrics, not gut feelings.

Regression Prevention

Catch when a model update breaks your carefully tuned prompts. Get alerts before users complain.

Model Evaluation

Compare GPT-4 vs Claude vs Gemini on your actual use cases. Make data-driven model decisions.

Compliance & Safety

Ensure responses don't contain harmful content, PII, or policy violations. Full audit trail included.

Cost Optimization

Find the cheapest model that meets your quality bar. Track token usage across test runs.

Integrations

Works with your stack

Connect to the tools you already use. Set up in minutes.

OpenAI

Anthropic

Gemini

GitHub

GitLab

Slack

Quick start with CLI

$ npm install -g probefish

$ probefish init

$ probefish run --suite my-tests

Pricing

Simple, transparent pricing

Start free. Scale as you grow. No surprises.

Free

$0forever

Perfect for side projects and experimentation.

3 projects
100 test runs/month
2 team members
Community support
7-day result retention

Get Started

Frequently asked questions

Everything you need to know about Probefish.

Ready to ship with confidence?

Start testing your LLMs in minutes. No credit card required.

Get Started Free

No credit card required • Free forever for small teams • Self-hosted available