tldr: Baseline testing captures a reference state, performance, output, or visual rendering, then compares future runs against it. Common in performance, visual regression, and data validation. The trick is keeping the baseline current as the system legitimately changes.

How baselines work

A baseline test runs once on a known-good build and stores the result: a screenshot, a performance number, a hash of expected output. Future runs compare against the stored baseline. A difference is flagged.

This shifts the test design problem from "what is correct?" to "what changed?" Useful when correctness is hard to specify but consistency matters.

Where baselines are useful

Performance baselines

Today's homepage loads in 850ms. Save that. Tomorrow's run loads in 1.2s. The baseline catches the regression even though no specific assertion failed.

Visual baselines

The login page screenshot from last week is the baseline. This week's screenshot is compared pixel by pixel. Differences flag visual changes.

Data baselines

A nightly job runs and produces a 10,000-row report. Hash the report. Tomorrow's hash should match unless the underlying data legitimately changed.

Configuration baselines

The config of a test environment is captured. Drift from the baseline is flagged. Useful for detecting unauthorized changes.

The maintenance problem

Baselines decay. The system legitimately changes, the baseline becomes stale, every test run flags a difference.

Two strategies.

Auto-approve when the difference is reviewed. A reviewer looks at the diff and clicks "approve" if the change was intentional. The baseline updates. Tools like Percy and Chromatic do this.

Bake intentional changes into the test plan. When a feature changes the homepage layout, the test plan includes "update the visual baseline" as a step.

Without one of these, baseline tests rot fast.

When baselines fail

Pixel-perfect comparisons over-flag. Sub-pixel rendering differences, font rendering variations, anti-aliasing changes. Use perceptual diff tools or threshold settings.

Performance baselines on noisy infrastructure. A page that takes 800ms on average can take 1200ms on a busy CI runner. Use percentiles and trends, not exact comparisons.

Data baselines without versioning. When the schema legitimately changes, the baseline does too. Version the baseline alongside the schema.

How AI testing handles baselines

Bug0 does visual regression by default: every test run captures screenshots, future runs compare against the baseline. The diff is reviewable. Approval updates the baseline. Maintenance stays manageable.

FAQs

Is baseline testing the same as visual regression?

Visual regression is a form of baseline testing focused on screenshots. Baselines apply to performance, output, and configuration too.

How often should baselines update?

When the underlying system intentionally changes. Otherwise, never automatically. Auto-updating baselines hides regressions.

What is a good performance baseline?

A 7-day rolling P95 of the metric on production-comparable infrastructure. Single-run baselines are too noisy.

How does Bug0 use baselines?

Bug0 captures visual baselines automatically. Differences between runs are surfaced for review, with approval updating the baseline.

Baseline testing