32 Tests, Zero Dollars: Visual E2E Testing with a VLM Running on My Laptop

I've been building TAC for a while now. Chrome extension plus Rails SPA shipping together as one product. And for most of that time, the testing story for the most critical surface has been: hope it works.

TAC has two UI surfaces that ship together but live in completely different worlds. The Rails SPA is where users manage recordings, view their account, see their plan. The Chrome extension injects a floating recording widget via shadow DOM into every page the user visits. Teachers use it on Canvas, Google Classroom, Google Docs, Outlook. Testing either surface in isolation is fine. Playwright handles the SPA. The extension has unit tests. But testing them together — verifying the extension floater actually appears on the SPA, that it's clickable, that it's in the right position — has been a persistent gap.

The core problem: launchPersistentContext, Playwright's mechanism for loading unpacked extensions, crashes reliably on headless Ubuntu CI runners. We've tried the workarounds. They don't hold up.

Beyond the CI problem, there's a more fundamental issue. The floater lives inside a shadow DOM injected by the extension's content script. Playwright can technically pierce shadow DOM, but you need exact selectors and they're fragile. What we actually want to know is simpler: does the UI look right? Is the floater visible? Is the paid-plan badge showing correctly? Is the record button where it should be?

These are visual questions. So I started wondering whether a vision-language model could just answer them directly from screenshots.

It can. And running it locally turned out to be better than the cloud.

The Stack

Three tools make this work.

Holo3 is H Company's open-weights VLM, purpose-built for computer use and UI grounding. The 35B-A3B variant (35 billion total parameters, 3 billion active via sparse mixture-of-experts, based on Qwen3.5) hits 78.85% on OSWorld-Verified, which is current state-of-the-art for desktop computer-use tasks. Apache 2.0 licensed. Fits in ~21GB of unified memory on Apple Silicon via LM Studio. H Company also offers a free inference API tier (10 RPM, $0.25/1M input tokens, $1.80/1M output tokens) if you want to try it without local setup.

dev-browser by SawyerHood connects to a running Chrome instance over CDP and executes Playwright-compatible scripts in a QuickJS sandbox. The key move: instead of launching a new browser context where extensions don't load, we connect to a real Chrome that already has our extension installed. The floater is there because the extension is there.

A Python orchestrator (holo3-tac-eval.py) ties it together: use dev-browser to navigate and screenshot, send screenshots to Holo3 with structured prompts, validate responses against ground truth assertions.

Building the Eval Harness

The harness defines tasks. Each task has a dev-browser setup script, a natural-language instruction for the VLM, a JSON schema for structured output, and optional ground truth assertions.

The extension floater task looks like this:

{
    "id": "extension-floater",
    "setup": """
        const page = await browser.getPage("tac-eval");
        await page.goto("{base_url}/app", { waitUntil: "networkidle", timeout: 15000 });
        await page.waitForTimeout(3000);
        const buf = await page.screenshot({ type: "png" });
        const path = await saveScreenshot(buf, "app-with-ext.png");
        console.log("screenshot:" + path);
    """,
    "instruction": "Look for a floating widget or overlay element injected "
                   "by a browser extension. It may be a small circular button "
                   "or recording widget near the edge of the viewport.",
    "schema": {
        "type": "object",
        "properties": {
            "floater_found": {"type": "boolean"},
            "description": {"type": "string"},
            "position": {"type": "string"},
            "x": {"type": "integer"},
            "y": {"type": "integer"},
            "confidence": {"type": "string", "enum": ["high", "medium", "low"]},
        },
        "required": ["floater_found", "confidence"],
    },
    "ground_truth": {
        "floater_found": True,
        "x_range": [900, 1400],
        "y_range": [250, 550],
    },
}

The setup script navigates to the TAC app, waits for the extension's content script to inject the floater, and screenshots. The orchestrator base64-encodes the image and sends it to Holo3. For structured output, the remote API uses H Company's structured_outputs parameter. Local inference falls back to embedding the schema in the prompt and parsing the response JSON, which turns out to be perfectly reliable with Holo3.

The Bugs (The Best Part)

Every one of these was a 10-to-30-minute detour. Documenting them because they're the kind of thing that eats your afternoon if you don't know what to look for.

Double-Brace Escaping

The first version used Python's .format() for template substitution in JavaScript setup strings, with {{ }} to escape literal braces. At some point the code switched to .replace("{base_url}", base_url) and nobody updated the brace escaping. So {{ }} survived as literal double braces in the JavaScript: syntax error. Trivial once identified, invisible until then.

Screenshot Path Handling

dev-browser's saveScreenshot() returns absolute paths on macOS. The script assumed relative filenames and tried joining with a base directory. Result: FileNotFoundError on a path like /Users/zak/.dev-browser/tmp//Users/zak/.dev-browser/tmp/app-main.png. Fixed by checking for a leading /:

for line in output.splitlines():
    if line.startswith("screenshot:"):
        raw = line.split(":", 1)[1].strip()
        src = Path(raw) if raw.startswith("/") else DEV_BROWSER_TMP / raw

Logo vs Record Button Confusion

First eval run looked perfect until we checked coordinates. Holo3 identified the "record button" roughly centered in the header. That's the TAC logo, which happens to be a microphone icon. The actual record button is the extension's floating widget near the right edge of the viewport.

The fix was prompt engineering: explicit negative instructions. "This is NOT the app logo in the header. It is a clickable mic/record button, likely a floating widget near the edge of the viewport injected by a browser extension." After that, correct identification every time.

Coordinate Space Mismatch

The most educational one. Screenshots on a Retina display are 2x CSS pixel dimensions: a 1352px-wide viewport produces a 2704px-wide PNG. Holo3 returns coordinates in CSS space because it's trained on UI interaction, thinking in terms of where you'd click.

Our ground truth ranges were calibrated to screenshot pixel dimensions. Everything failed until we measured the actual viewport via CDP:

console.log(window.innerWidth);   // 1352
console.log(devicePixelRatio);    // 2

All coordinate ranges had to be recalibrated. The lesson: always know which coordinate space your VLM is working in, and never assume it matches your screenshot resolution.

Structured Output Token Explosion

During one remote API call, the response took 327 seconds and consumed 61,767 output tokens. The structured output was valid JSON opening followed by approximately 60,000 whitespace characters. Cost: $0.11 for garbage. The model entered a degenerate state where it filled the output buffer with whitespace while technically maintaining valid JSON structure. Couldn't reproduce it reliably, which makes it worse: the kind of thing that silently blows up a CI budget.

Remote vs Local: Local Wins

We ran the full 6-task eval suite through both the remote H Company API and local LM Studio. The numbers:

Metric	Remote (H Company API)	Local (LM Studio, Apple Silicon)
Model	holo3-35b-a3b	holo3-35b-a3b (21GB VRAM)
Avg latency per task	2.6s (when working)	4.6s
Rate limit	10 RPM (7s sleep between tasks)	None
Wall clock (6 tasks)	~77s best case	~28s
Cost per run	~$0.009	$0.00
Structured output	Native parameter	Schema-in-prompt + JSON parsing
Stability	Token explosion risk	Deterministic, stable

Local won on every axis except raw per-task latency, and even there the difference vanishes once you factor in rate-limit sleeps.

LM Studio configuration on M-series Mac: GPU Offload 40 layers, Number of Experts 8 (correct top-k for Qwen3.5 MoE), Flash Attention ON, Unified KV Cache ON, Context Length 16384 minimum.

The argument for local inference goes beyond economics. The remote API has rate limits that slow your iteration cycle, a structured output bug that can burn tokens, and a dependency on someone else's uptime. Local inference is free, deterministic, and available at 3am when you're debugging a deploy. For an automated pipeline, those properties matter more than a 2-second latency advantage.

Ground Truth: Making It a Real Test

Ground truth validation is what separates a test harness from a party trick. Each task defines expected coordinate ranges, exact value matches, and substring checks:

def check_ground_truth(task, result):
    gt = task.get("ground_truth")
    r = result.get("result", {})
    failures = []
    for coord in ("x", "y"):
        range_key = f"{coord}_range"
        if range_key in gt and coord in r:
            lo, hi = gt[range_key]
            if not (lo <= r[coord] <= hi):
                failures.append(f"{coord}={r[coord]} outside [{lo}, {hi}]")
    for key in ("plan_name", "floater_found"):
        if key in gt and key in r:
            if r[key] != gt[key]:
                failures.append(f"{key}: expected {gt[key]!r}, got {r[key]!r}")
    return failures

Coordinate ranges need to be generous. VLMs have spatial bias — roughly a 300px pull toward center for elements near viewport edges. The floater at CSS x=1326 is consistently reported around x=985-1050. We widened ranges to [900, 1400], which accommodates the bias while catching gross errors like the model pointing at the logo at x=200.

For non-spatial assertions, exact matching works well. The model reliably identifies the expected paid-plan badge, correctly reports floater_found: true, and identifies navigation elements by description.

Beyond Screenshots: Computer-Use Agent Loops

The eval script proved the VLM could look at screenshots and answer questions. But Holo3 was trained to drive computers, not just describe them. The natural next step: close the loop. Screenshot, ask the VLM where to click, click there, screenshot again, verify.

const result = await executeAgentFlow(vlm, {
  name: 'play-recording',
  startUrl: `${BASE_URL}/app`,
  steps: [
    {
      findInstruction: 'Find the play button for an audio recording.',
      waitAfterClick: 2000,
      verifyInstruction: 'Is an audio recording currently playing?',
      verifySchema: { /* ... */ },
    },
  ],
});

No selectors, no data-testid attributes, no shadow DOM piercing. The VLM finds everything by looking at the screen, exactly like a human would. We shipped four agent flows: sidebar navigation (library to settings and back), audio playback, pricing page access, and copy-link. All four pass.

Per-step overhead is roughly 4-6 seconds of VLM inference plus 1-2 seconds of browser interaction. A four-step flow takes 25-30 seconds. Not fast enough for a unit test, but fast enough for post-deploy verification.

The navigation flow is the most interesting. It's a multi-step round trip: click from library to settings, verify the settings page loaded, click back to library, verify the library reappeared. The VLM has to correctly identify UI elements across page transitions, on pages it's never seen before in this session. It handles this without hesitation. The model doesn't memorize element positions from the first screenshot. It re-evaluates the screen from scratch each time, which means page transitions, loading states, and layout shifts don't throw it off the way cached selectors would.

Testing the Extension on Real Sites

This is the capability that didn't exist before. TAC's extension injects its floater on every page. Teachers use it on Canvas, Google Classroom, Google Docs. We've never been able to systematically test whether the floater renders correctly across these sites, because each has different layouts, z-index stacks, and DOM structures.

With the VLM harness, this becomes parameterized:

const SITES = [
  { name: 'Google Classroom', url: 'https://classroom.google.com' },
  { name: 'Canvas LMS', url: 'https://canvas.instructure.com' },
  { name: 'Google Docs', url: 'https://docs.google.com' },
  { name: 'Outlook', url: 'https://outlook.office.com' },
];

All four pass. The VLM identifies the floater as an extension-injected widget distinct from the host page, reports position, and flags whether it overlaps important content. If Google Classroom ships a layout change that pushes the floater off-screen, this test catches it.

One implementation detail: external sites with auth gates (Classroom redirects to Google login) can hang on Playwright's networkidle wait. We switched to domcontentloaded with a try-catch fallback. Screenshot whatever loaded, even a login page. The floater injects regardless of page content, so the test still validates what we care about. The VLM correctly ignores the login form and focuses on the floating widget, which is exactly what a human tester would do.

TAC Time Machine: Session Replay

When a VLM-driven flow fails, you need to see what the model saw. Staring at a test failure that says "expected is_settings_page to be true" tells you nothing. So we built a session replay viewer.

Each agent test writes a session.json (structured data with base64-embedded screenshots) and a session.html (self-contained viewer) to test-results/vlm-sessions/. The viewer has a horizontal filmstrip with thumbnails, a large main screenshot with a pulsing red circle at click coordinates, a VLM observation panel with confidence badges, and keyboard navigation to step through the flow.

The replay turned debugging from "re-run and stare at logs" to "scrub through the filmstrip and spot the problem in 10 seconds." In one case it immediately showed the model clicked "Account & Plan" instead of "Settings" because the instruction was ambiguous. Fix the prompt, move on.

Scaling Up

Once the core harness worked, we identified five extensions and built them in parallel using a swarm of agents. Each agent got a specific brief, read the existing fixtures, and shipped its piece independently. The results landed without conflicts.

The complete product loop E2E is the most important one. TAC's reason for existing is: record a voice note, share a link, the recipient hears it. If that flow is broken, the product is dead. Until now, no automated test covered the full circle. The product loop test drives Chrome through the entire flow: navigate to the app, VLM finds the extension's record button, clicks it, waits 3 seconds while a fake media stream records, VLM finds the stop button, clicks it, waits for upload, verifies the success state and share link, then opens the recording's dedicated page and verifies the player renders. Two tests, 1.4 minutes, full product loop covered.

The other four extensions: a viewport matrix (Chromebook 1024x768, laptop 1366x768, desktop 1920x1080 — nine tests catching responsive regressions), floater recording on real sites (not just visibility but actually triggering the recording UI on Google Docs), synthetic monitoring (the smoke gate on a scheduled cadence with team alerting on failure), and visual PR review (before/after screenshot pairs sent to the VLM for structured change descriptions, posted as PR comments).

The agent swarm is worth pausing on. Five independent coding agents, each given a specific brief: "read the existing fixtures, build this one extension, ship it." They ran in parallel, reading the same source files, writing to non-overlapping output paths. Total wall-clock time for all five was roughly the same as building one of them sequentially would have been.

This worked because the fixture architecture was modular from the start. vlm-eval.ts handles screenshot capture and VLM communication. vlm-agent.ts handles the click-verify loop. vlm-session-viewer.ts handles replay generation. Each agent composed with these primitives without needing to understand or modify them. No merge conflicts, no coordination overhead, no "wait, which agent is editing which file" problems.

The pattern generalizes. When people ask whether coding agents can do real work, the answer depends entirely on the codebase they're working in. Monoliths with tangled state don't parallelize. You'd spend more time on conflict resolution than on the actual work. But modular systems with clean interfaces and well-defined boundaries? Those parallelize beautifully. The agents don't need to be brilliant. They need the system to be structured so that bounded, independent contributions compose into something coherent. That's a design constraint worth optimizing for regardless of whether you're working with human engineers or AI agents.

The visual PR review extension deserves a specific mention. It screenshots key TAC surfaces, compares against a stored baseline from main, and sends before/after pairs to the VLM in a single API call. The model returns a structured diff: change summary, risk level, specific details. With --post-comment --pr 123, it posts the report directly to a GitHub PR. Every PR becomes a visually-reviewed PR, automatically. For surfaces with no changes, the VLM reports "No visual changes detected, risk: none." For actual changes, it describes exactly what moved, what's new, what's missing.

Final Numbers

Suite	Tests	Time	Cost
Core visual eval	6	~54s	$0.00
Sites + state matrix + player	8	~1.9m	$0.00
Agent flows (computer-use)	4	~1.5m	$0.00
Product loop (record + share)	2	~1.4m	$0.00
Viewport matrix (3 sizes)	9	~3m	$0.00
Real-site recording (Docs)	3	~1.5m	$0.00
Post-deploy smoke	6	~40s	$0.00
Total	32	~10m	$0.00

pnpm --filter @tac/test-runner test:vlm:all           # Everything
./scripts/post-deploy-vlm-smoke.sh                    # CI smoke gate
./scripts/vlm-monitor.sh --install                    # 15-min synthetic monitoring
./scripts/vlm-visual-pr-review.py --local             # Visual PR diff
./scripts/vlm-replay.sh latest                        # Open session replay

The DSL: Tests as YAML

After the agent swarm, we had 32 tests split across 6 TypeScript files, plus inline Python task dicts. Adding a new test meant editing code in two places. So we built a YAML DSL.

version: 1
name: identify-plan-badge
tags: [core, eval]
base_url: ${VLM_BASE_URL:-http://localhost:3000}

steps:
  - navigate: ${base_url}/app
    wait: 2000

  - screenshot: plan-badge.png
    ask: "What subscription plan is the user on?"
    schema:
      plan_name: string!
      confidence: enum(high, medium, low)!
    ground_truth:
      plan_name_contains: "Teacher"

One file per test. The DSL supports variables (${VAR:-default}), schema shorthand (string!, enum(a,b,c), array(string)), matrix parameterization for batch tests, and VLM-guided clicking (click.find: "Find the Settings link"). Both the TypeScript test runner and the Python eval harness load the same YAML files. Adding a new visual check is now: write a YAML file, run pnpm test:vlm:dsl.

The matrix feature is particularly useful. Testing the floater across four external sites is one file:

matrix:
  site:
    - { name: Google Classroom, url: "https://classroom.google.com", slug: classroom }
    - { name: Canvas LMS, url: "https://canvas.instructure.com", slug: canvas }
    - { name: Google Docs, url: "https://docs.google.com", slug: gdocs }
    - { name: Outlook, url: "https://outlook.office.com", slug: outlook }

steps:
  - navigate: ${site.url}
    wait: 3000
  - screenshot: floater-${site.slug}.png
    ask: "Is there a floating recording widget at the right edge of the viewport?"
    schema:
      floater_found: bool!
      confidence: enum(high, medium, low)!

Four tests from one YAML file. Seven total DSL tasks, all passing.

Making It Portable

The harness started TAC-specific but nothing about the core is product-specific. So we extracted it into a Claude Code plugin called visual-eyes. Any project can install it and get: /look slash command for quick screenshots, a verify-app agent for full visual verification, PostToolUse hooks that auto-screenshot after UI edits, a smoke gate script, session dashboard, and replay viewer. The project just needs to create a visual-eyes.json config pointing to its own URLs and a few YAML task files for its surfaces.

The most interesting hook is PostToolUse auto-screenshot. Every time you (or Claude) edit a CSS, JSX, or ERB file, the hook fires, maps the edited file to the affected page URL, takes a screenshot via dev-browser, and prints the path. Claude reads the screenshot and sees the change immediately. No asking, no remembering to check. The verification loop becomes automatic.

Try It Yourself

Prerequisites: Chrome (or Canary) with --remote-debugging-port=9222, LM Studio with holo3-35b-a3b loaded (21GB, context length 16384+), and dev-browser.

# Install dev-browser
npm install -g dev-browser && dev-browser install

# Launch Chrome with debug port
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222

# Take your first screenshot
cat <<'SCRIPT' | dev-browser --connect
const page = await browser.getPage("test");
await page.goto("http://localhost:3000", { waitUntil: "domcontentloaded", timeout: 15000 });
await page.waitForTimeout(2000);
const buf = await page.screenshot({ type: "png" });
const p = await saveScreenshot(buf, "my-app.png");
console.log("path:" + p);
SCRIPT

# Ask the VLM about it
python3 -c "
import base64, json
from openai import OpenAI

b64 = base64.b64encode(open('$HOME/.dev-browser/tmp/my-app.png', 'rb').read()).decode()
client = OpenAI(base_url='http://localhost:1234/v1/', api_key='lm-studio')
resp = client.chat.completions.create(
    model='holo3-35b-a3b',
    messages=[{'role': 'user', 'content': [
        {'type': 'text', 'text': 'Describe this web page. What are the main sections? Any visual issues?'},
        {'type': 'image_url', 'image_url': {'url': f'data:image/png;base64,{b64}'}},
    ]}],
    temperature=0.0,
)
print(resp.choices[0].message.content)
"

That's it. You now have local, free, VLM-powered visual inspection of your web app. From here you can write YAML task files with ground truth assertions, wire up session replay, and automate the whole thing. The infrastructure is open source: dev-browser is MIT, Holo3 weights are Apache 2.0, LM Studio is free for local use.

What This Means

Thirty-two tests. Ten minutes. Zero dollars. Running entirely on the same laptop I develop on.

The pieces that make this possible are all recent: an open-weights VLM that fits on a laptop and genuinely understands UI layout, a CDP bridge that connects to real Chrome with real extensions, and structured output schemas that make VLM responses machine-parseable. None of these existed in usable form a year ago.

Prompt engineering turned out to be real work. The difference between "find the record button" (which finds the logo) and a carefully disambiguated instruction (which finds the floater) is the difference between a passing and failing test. Every ambiguity in your instruction is a potential false positive. But once you get the prompts right, they're stable. The same prompt produces the same result across runs because the model is deterministic at temperature 0 on local inference.

The spatial bias toward viewport center is a real limitation. Elements near edges get reported ~300px closer to center. Ranges need to accommodate this. It's not a dealbreaker — it's a calibration detail, like any other measurement instrument.

The thing I keep coming back to is what this represents as a milestone. I've used LLMs to write code, review PRs, draft content, triage support tickets. But this is the first time I've handed off an entire production workflow — visual QA across every surface of a real product — to a model running locally on my machine. The model looks at every screen, checks every element, reports every discrepancy. It runs on a schedule. It costs nothing. It doesn't get tired.

Holo3 is a 35B sparse model with 3B active parameters, and it's already state-of-the-art for computer use. The trajectory of model compression and hardware improvement suggests frontier reasoning models will fit in the same 21GB envelope within 18 months. By November 2026, the model running your visual QA suite might also be the one reviewing your architecture decisions and catching logical bugs in your business logic. Same laptop. Same $0.00 bill.

I keep a list of workflows I've fully delegated to local models. Code generation was first. Then PR review. Then support triage. Visual QA is the latest addition and in some ways the most significant, because it requires genuine multimodal understanding. The model isn't pattern-matching on text or filling in code templates. It's looking at a rendered UI, understanding spatial relationships, identifying interactive elements, and making judgment calls about whether what it sees matches what should be there. That's a qualitatively different capability than anything I was running locally a year ago.

That's not a prediction about AGI or whatever. It's an engineering observation: the gap between "cloud-only capability" and "runs on your dev machine" is closing faster than most testing infrastructure is evolving. If you're building a product with visual surfaces, the tools to test them without a QA team and without a cloud bill already exist. They're just new enough that most people haven't found them yet.

Built with Holo3 (H Company), dev-browser (SawyerHood), and LM Studio.