Running Evals

LLMs are probabilistic—the same prompt might produce different results each time. Running a test once tells you it can work, not that it reliably works. Evals run the same test many times and measure statistical accuracy.

Why Run Evals?

A single test pass can be misleading:

The LLM might get lucky on one attempt
Temperature introduces randomness
Different phrasings might fail where others succeed

Running 30+ iterations gives you confidence in real-world performance:

“This tool is called correctly 97% of the time”
“Arguments are correct in 90% of cases”
“Average latency is 1.2 seconds”

EvalTest: Single Test Scenario

EvalTest runs one test function multiple times:

import { EvalTest } from "@mcpjam/sdk";

const test = new EvalTest({
  name: "addition-accuracy",
  test: async (agent) => {
    const result = await agent.prompt("Add 2 and 3");
    return result.hasToolCall("add"); // Return true/false
  },
});

const result = await test.run(agent, { iterations: 30 });

console.log(`Accuracy: ${(test.accuracy() * 100).toFixed(1)}%`);
// "Accuracy: 96.7%"
console.log(`${result.successes}/${result.iterations} passed`);

Writing Test Functions

Test functions receive an EvalAgent (implemented by TestAgent and mock agents) and return boolean:

// Simple: did the right tool get called?
test: async (agent) => {
  const result = await agent.prompt("Add 5 and 3");
  return result.hasToolCall("add");
}

// Detailed: check arguments too
test: async (agent) => {
  const result = await agent.prompt("Add 10 and 20");
  const args = result.getToolArguments("add");
  return args?.a === 10 && args?.b === 20;
}

// Complex: multi-step workflow
test: async (agent) => {
  const r1 = await agent.prompt("Create a project called 'Test'");
  const r2 = await agent.prompt("Add a task to it", { context: r1 });
  return r1.hasToolCall("createProject") && r2.hasToolCall("createTask");
}

Run Options

await test.run(agent, {
  iterations: 30,      // How many times to run
  concurrency: 5,      // Parallel runs (careful with rate limits)
  retries: 2,          // Retry failures
  timeoutMs: 30000,    // Per-test timeout
  mcpjam: {
    // Auto-save is enabled when MCPJAM_API_KEY is available
    suiteName: "SDK eval smoke",
    strict: false, // warn by default; true to fail CI on upload errors
  },
  onProgress: (done, total) => {
    console.log(`${done}/${total}`);
  },
});

Metrics

After running, access various metrics:

test.accuracy();           // Success rate (0.0 - 1.0)
test.precision();          // Precision metric
test.recall();             // Recall metric
test.averageTokenUse();    // Avg tokens per iteration

EvalSuite: Multiple Tests

Group related tests together:

import { EvalSuite, EvalTest } from "@mcpjam/sdk";

const suite = new EvalSuite({ name: "Math Operations" });

suite.add(new EvalTest({
  name: "addition",
  test: async (agent) => {
    const r = await agent.prompt("Add 5 and 3");
    return r.hasToolCall("add");
  },
}));

suite.add(new EvalTest({
  name: "multiplication",
  test: async (agent) => {
    const r = await agent.prompt("Multiply 4 by 6");
    return r.hasToolCall("multiply");
  },
}));

suite.add(new EvalTest({
  name: "division",
  test: async (agent) => {
    const r = await agent.prompt("Divide 20 by 4");
    return r.hasToolCall("divide");
  },
}));

const result = await suite.run(agent, { iterations: 30 });
console.log(`Overall: ${(result.aggregate.accuracy * 100).toFixed(1)}%`);

Save Results to MCPJam

Both EvalTest and EvalSuite can automatically save results to MCPJam when a run completes. Set MCPJAM_API_KEY in your environment and results are saved automatically:

await test.run(agent, {
  iterations: 30,
  mcpjam: {
    suiteName: "Addition Eval",
    passCriteria: { minimumPassRate: 90 },
  },
});

For manual save APIs, CI metadata, and artifact uploads, see the Save Results to MCPJam guide.

Tool errors vs. `passed` on upload

By default, when results are built from traces (auto-save and reporter helpers), a failed tool execution (MCP isError, errored tool spans, or error tool-results in messages) sets passed: false for that iteration even if your test function returned true. That keeps CI pass rate aligned with real server behavior. To disable that for a run (for example you only assert which tools were called, not that every call succeeded), set failOnToolError: false on the mcpjam block:

await test.run(agent, {
  iterations: 30,
  mcpjam: {
    suiteName: "Addition Eval",
    passCriteria: { minimumPassRate: 90 },
    failOnToolError: false,
  },
});

See Eval reporting — Tool execution and passed for full detail and createEvalRunReporter overrides.

Suite Results

// Overall accuracy
console.log(`Suite: ${(suite.accuracy() * 100).toFixed(1)}%`);

// Per-test breakdown
for (const test of suite.getAll()) {
  console.log(`  ${test.getName()}: ${(test.accuracy() * 100).toFixed(1)}%`);
}

// Get specific test
const addTest = suite.get("addition");
console.log(`Addition accuracy: ${addTest.accuracy()}`);

Choosing Iteration Count

Scenario	Iterations	Why
Quick smoke test	10	Fast feedback during development
Regular testing	30	Good statistical significance
Pre-release	50-100	High confidence before shipping
Benchmarking	100+	Comparing models or changes

Best Practices

Use Low Temperature

More deterministic results for testing:

const agent = new TestAgent({
  // ...
  temperature: 0.1,
});

Handle Rate Limits

Reduce concurrency for rate-limited APIs:

await suite.run(agent, {
  iterations: 30,
  concurrency: 2, // Avoid hitting rate limits
});

Test Edge Cases

Don’t just test the happy path:

suite.add(new EvalTest({
  name: "handles-empty-input",
  test: async (agent) => {
    const r = await agent.prompt("Add numbers"); // No numbers given
    return !r.hasError(); // Should handle gracefully
  },
}));

suite.add(new EvalTest({
  name: "handles-large-numbers",
  test: async (agent) => {
    const r = await agent.prompt("Add 999999999 and 1");
    return r.hasToolCall("add");
  },
}));

Set Quality Thresholds

Fail CI if accuracy drops below a threshold:

await suite.run(agent, { iterations: 30 });

if (suite.accuracy() < 0.90) {
  console.error(`Accuracy ${suite.accuracy()} below 90% threshold`);
  process.exit(1);
}

Generate evals from the Inspector

You can also generate eval code from the MCPJam Inspector. Click ⋮ → Copy markdown for server evals on any server card, then paste it into an LLM. See the Quickstart for details. If you have a MCPJAM_API_KEY, the generated code will automatically save results to the Evals tab in the Inspector. Go to Settings > Workspace API Key to get your key.

Next Steps

Testing Across Providers

Compare performance across LLMs

EvalTest Reference

Full EvalTest API

EvalSuite Reference

Full EvalSuite API

Save Results to MCPJam

Auto-save, manual APIs, CI metadata, and artifact upload

Overview

Inspector Features

CLI

SDK

Guides

Troubleshooting

Why Run Evals?

EvalTest: Single Test Scenario

Writing Test Functions

Run Options

Metrics

EvalSuite: Multiple Tests

Save Results to MCPJam

Tool errors vs. `passed` on upload

Suite Results

Choosing Iteration Count

Best Practices

Use Low Temperature

Handle Rate Limits

Test Edge Cases

Set Quality Thresholds

Generate evals from the Inspector

Next Steps

Testing Across Providers

EvalTest Reference

EvalSuite Reference

Save Results to MCPJam

Overview

Inspector Features

CLI

SDK

Guides

Troubleshooting

Documentation Index

​Why Run Evals?

​EvalTest: Single Test Scenario

​Writing Test Functions

​Run Options

​Metrics

​EvalSuite: Multiple Tests

​Save Results to MCPJam

​Tool errors vs. passed on upload

​Suite Results

​Choosing Iteration Count

​Best Practices

​Use Low Temperature

​Handle Rate Limits

​Test Edge Cases

​Set Quality Thresholds

​Generate evals from the Inspector

​Next Steps

Testing Across Providers

EvalTest Reference

EvalSuite Reference

Save Results to MCPJam

Why Run Evals?

EvalTest: Single Test Scenario

Writing Test Functions

Run Options

Metrics

EvalSuite: Multiple Tests

Save Results to MCPJam

Tool errors vs. `passed` on upload

Suite Results

Choosing Iteration Count

Best Practices

Use Low Temperature

Handle Rate Limits

Test Edge Cases

Set Quality Thresholds

Generate evals from the Inspector

Next Steps