We're validating the Agent Workbench with QA & engineering leaders

A QA interface for coding agents.

The Agent Workbench lets coding agents delegate validation tasks to webmate and get back structured evidence from the running product, instead of raw browser noise.

The bigger goal From AI-generated code to trusted product evolution

What it is

What we mean by Agent Workbench.

It is not a new prompt UI for human testers. It is the agent-facing side of webmate: a structured way for coding agents and agent harnesses to delegate QA tasks and receive evidence-backed feedback from the running product.

Who uses it?

Coding agents and agent harnesses.

What gets tested?

The running web or mobile product.

What does webmate provide?

QA tasks, validation, findings, evidence, and feedback.

What it is not

Not a prompt UI for humans, not a generic agent platform, not a replacement for specs or CI.

The coding agent is the user. The running product is what gets tested.

The gap

Agents can write code, but do they know if it works?

Code, prompts, and tests live in symbols. Quality lives in reality. Bridging the two takes a chain of translations, and each one quietly adds assumptions.

Symbolic world

Where agents work

codepromptsspecificationsrepository filesunit tests
the gap

Running reality

Where quality is decided

browsersdevicesuser flowsruntime statesedge cases

The concept

Coding agents need a QA interface, not just browser tools.

Browser control isn't enough. An agent can click buttons, read the DOM, inspect logs, or drive DevTools. But it still needs to know what to check, what evidence matters, and whether the result can be trusted. The Agent Workbench gives agents a structured way to delegate QA tasks to webmate and get useful feedback from the running product.

Instead of low-level commands

  • click button
  • read DOM
  • take screenshot
  • parse a11y tree
  • diff network logs

Agents ask QA questions

  • validate this change
  • investigate this issue
  • compare behavior
  • check whether specs still hold
  • identify missing coverage

From commands to QA questions.

The analogy

Why we call it a Workbench.

Humans need an environment to inspect, test, and understand software. Coding agents need something similar, but programmatic, task-based, and evidence-oriented. In webmate, humans use the Workbench interactively. Agents use the Agent Workbench through APIs, skills, MCP, CLI, or other harness integrations.

Human QA interface

webmate Workbench

A person drives the session, inspects, and judges.

Live session · Chrome on Pixel 8 running
Inspect element · checkout button selected
Capture finding · attach screenshot saved
Reviewer sign-off manual
Agent QA interface · programmatic

Agent Workbench

Called from a coding agent via API, skill, MCP, or CLI.

request
workbench.validate({
  target: "signup",
  spec:   "password-rules",
  on:     ["iOS Safari", "Pixel 8"]
})
response
  • verdictfail
  • findings1
  • evidencevideo · DOM · steps
  • notespecial-char rule not enforced on iOS Safari
One webmate platform 1000+ real devices already run validation for human testers. Agents plug into the same infrastructure.

The bigger picture

Building with agents is a chain of assumptions. The Agent Workbench adds control points.

Coding agents don't turn intent into truth in one step. Every transition from intent to spec to code to running product to evidence adds assumptions, and uncertainty grows unless something pushes back. Spec-driven development walks the whole chain. Vibe coding skips the spec and hopes reality cooperates. webmate adds a control at every transition either path takes.

Now

Runtime validation: does the running product behave as intended?

Next

Control uncertainty across the full chain, from product intent to runtime evidence.

Accumulating uncertainty
assumptionsdriftmissing casesunverified scope
01

Human Intent / Prompt

What the user or organization wants

"Fix the login flow""Password requirements must be met"
Interpretation Did we understand the intent correctly?
02

Interpreted Intent / Specification

The agent's hypothesis about what was meant

The spec is an interpretation of intent, not the intent itself.

Implementation Does the code really implement the spec?
03

Code / Implementation

A model of how the spec could become software

Code is a proposal for how intent could become behavior.

Execution Does the code behave correctly in the real environment?
04

Running Product

The product as users experience it

browsersdevicestiminga11ynetworkedge cases
Observation Is the evidence strong enough?
05

Evidence & Findings

What was observed, proven, or left uncertain

screenshotslogstracesvideosreports

Agent Workbench control mechanisms the right control at each uncertainty point

Under Interpretation

  • Assumption extraction
  • Ambiguity detection
  • Clarification questions
  • Spec review

Under Implementation

  • Traceability
  • Spec-to-code checks
  • Generated tests
  • Reviewable claims

Under Execution

  • Runtime validation
  • Browser / device execution
  • Journey checks
  • Accessibility checks

Under Observation

  • Findings
  • Artifacts
  • Reproducibility
  • Coverage metadata
  • Residual uncertainty
Feedback / Correction Evidence updates the next step

Residual uncertainty

What remains unknown after validation. The Agent Workbench makes the gaps visible.

Safari not checkedAutofill not checkedScreen reader behavior unknownSpec assumption unconfirmedEvidence covers one flow

Without control mechanisms

Prompt Code Hope

With Agent Workbench

Intent Spec Code Reality Evidence Feedback

Stable product semantics

The test script is no longer the source of truth. Product semantics are.

When a checkout breaks, agents need to answer whether the product still meets its specs. A green test isn't enough. webmate anchors that question to a stable Test Object that survives changes in tests, code, and UI.

The Test Object stays stable while tests, implementations, and interfaces evolve.

Anchor Test Object
+ Attached Specs
Spec versions
Runtime States
User Flows & Interactions
Findings & Evidence

Use cases

What agents can delegate to webmate

Validate a change

“Check whether this PR broke checkout.”

Investigate an issue

“Find out why login sometimes fails after redirect.”

Compare behavior

“Compare this release candidate with the previous baseline.”

Surface what wasn't tested

“Show me which user flows aren’t covered by the tests on this change.”

Let's talk

Tell us where you'd push back.

We're validating the Agent Workbench concept with QA and engineering leaders across DACH. If this problem sounds familiar, we'd like to talk. Especially if you think we've framed it wrong.

  • We're validating this concept with QA, engineering, and AI adoption leaders.
  • Exploring how teams will validate agent-assisted software.
  • Already using coding agents, or planning to? We want to hear how you think about validation, trust, evidence, and quality.

Join a 20-minute concept discussion

Exploratory, not sales. We're gathering perspectives from practitioners. The embedded Pipedrive form is blocked until you allow it in the cookie settings.

We respect your inbox. No newsletter, no sales sequence, just a conversation.