Most teams set up GitHub Actions, add unit tests, and call it done. Their CI is green. Their product is broken. Here's how to build a pipeline that actually means something.

tldr: Most teams set up GitHub Actions, add unit tests, and call it "automated testing." Their CI is green. Their signup flow is broken on mobile. Here's how to build a pipeline that actually catches bugs, and what to do when maintaining it yourself stops making sense.


Your CI is green. Congratulations.

But what's actually running in that pipeline? I've asked this question to engineering leads at dozens of SaaS companies. The answer is almost always the same: unit tests. Maybe a linter. Maybe type-checking.

No browser tests. No end-to-end coverage. Nothing that simulates a real user logging in, clicking through the dashboard, and completing the workflow your customers pay for.

The GitLab Global DevSecOps Report 2025 found that 82% of teams now deploy weekly. They're also losing an average of 7 hours per week to verification bottlenecks. GitLab calls this the "AI Paradox." Code ships faster. Testing hasn't caught up.

GitHub Actions runs whatever you give it. Give it echo "hello" and it reports success. Give it a test suite that only covers isolated functions, and it reports "all checks passed" while your checkout flow throws a 500 error. That green checkmark means your pipeline executed without errors. Your product might still be broken.

I believe most teams with "automated testing" don't actually have automated testing. They have automated unit testing. The distinction matters.


GitHub Actions is an orchestrator, not a testing tool

Quick primer for engineers setting this up for the first time.

GitHub Actions runs jobs on triggers. You define a workflow in YAML, tell it when to fire (push, pull request, cron schedule), and tell it what to execute. Here's the simplest version:

name: Tests
on: [pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npm test

Twelve lines. Ten minutes to set up. This is where every tutorial stops. And this is where the interesting problems start, because npm test is doing the heavy lifting and nobody asks what it's actually testing.


Unit tests pass. Users still hit bugs. Why?

Unit tests check isolated functions. calculateTotal(100, 0.2) returns 80. Good.

test('calculateTotal applies discount correctly', () => {
  const result = calculateTotal(100, 0.2);
  expect(result).toBe(80);
});

That test tells you the math works. It tells you nothing about whether the checkout page renders, whether the discount input field accepts the value, or whether the success confirmation appears after payment. The Stack Overflow Developer Survey 2025 reports that 45% of developers find debugging AI-generated code more time-consuming than debugging human code. Add brittle test infrastructure on top of that and you're spending engineering cycles on maintenance instead of product.

The bugs users report live in the space between components. The button that doesn't trigger the API call. The form that validates on desktop but breaks at 375px. The redirect loop that only happens when you're logged out and hit a deep link. Unit tests can't see any of this. They were never designed to.

End-to-end testing fills that gap. Real browser. Real clicks. Real user flows. And it's the layer that most teams either never add to their GitHub Actions pipeline or add and then quietly disable within three months. For a full breakdown of how PR-level testing fits into a broader QA strategy, see our guide to pull request testing.


Setting up Playwright in GitHub Actions (the part that works)

Integration tests and E2E browser tests are where a GitHub Actions pipeline starts earning its keep. Here's what the production-ready workflows look like.

Integration tests with real services

name: Integration tests
on:
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_DB: test_db
          POSTGRES_USER: test
          POSTGRES_PASSWORD: test
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npm run test:integration
        env:
          DATABASE_URL: postgres://test:test@localhost:5432/test_db

The health check on Postgres is the detail that matters. Without it, your tests start before the database is ready. You get failures that look like flaky tests but are just infrastructure timing. Teams spend hours debugging ghosts.

End-to-end tests with Playwright

name: E2E tests
on:
  pull_request:
    branches: [main]

jobs:
  e2e:
    runs-on: ubuntu-latest
    timeout-minutes: 15
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npx playwright install --with-deps chromium

      - name: Run Playwright tests
        run: npx playwright test
        env:
          BASE_URL: ${{ secrets.STAGING_URL }}

      - name: Upload report on failure
        uses: actions/upload-artifact@v4
        if: failure()
        with:
          name: playwright-report
          path: playwright-report/
          retention-days: 7

Three things most tutorials don't mention:

--with-deps is critical. Without it, the browser binary installs but system-level dependencies like libgbm and libatk are missing. Your tests fail with cryptic shared library errors. You'll spend an hour on Stack Overflow before you find this flag.

timeout-minutes: 15 saves money. A hung browser process will burn your Actions quota for 60 minutes if you don't cap it. Set it tight.

Install only chromium, not all three browsers. Saves 2-3 minutes per run. Unless you specifically need cross-browser coverage on every PR, one browser is enough for smoke checks.

Sharding for speed

A 100-test Playwright suite runs sequentially in 15-20 minutes. Developers won't wait that long. They'll merge without looking at results.

jobs:
  e2e:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        shard: [1/4, 2/4, 3/4, 4/4]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npx playwright install --with-deps chromium
      - run: npx playwright test --shard=${{ matrix.shard }}
        env:
          BASE_URL: ${{ secrets.STAGING_URL }}

Four shards. Same total compute, 4x faster wall-clock time. Under 5 minutes. That's the threshold where developers actually wait.


Run the right tests at the right time

I see teams run their full E2E regression suite on every single PR. Slow, expensive, and most of those tests have nothing to do with the change being made.

PR smoke checks: 10-20 critical path tests. Login, signup, the one workflow that generates revenue. Under 5 minutes. These gate the merge.

on:
  pull_request:
    branches: [main]
- run: npx playwright test --grep @smoke

Nightly regression: everything. Every test, every viewport, run on a schedule. This catches the slow-burn regressions that accumulate across multiple PRs throughout the day.

on:
  schedule:
    - cron: '0 2 * * *'
- run: npx playwright test

Pre-release: full suite plus anything you'd be nervous about. Performance, edge cases, the checkout flow on a 4G connection. Your final gate.

The pattern: fast feedback on PRs, deep coverage on schedule. Match the depth of testing to the trigger that fired it.


The decay timeline nobody talks about

Here's what actually happens after you set all of this up. I've watched this play out repeatedly.

Week 1. Tests are green. The team celebrates. "We finally have real E2E coverage." Someone posts the green CI screenshot in Slack.

Month 2. The suite takes 18 minutes even with sharding. A developer opens a PR, sees tests running, context-switches. Results come back 20 minutes later. They've already moved on. Some start merging before tests finish…

Month 3. The design team moves the "Submit" button from the form footer to a sticky header. Three tests break. An engineer adds a comment: // TODO: fix after redesign settles. You know how this ends…

Month 5. CI is green. But 40% of E2E tests are disabled. The signup flow hasn't been tested in six weeks. A regression ships to production. A customer emails support.

The root causes:

Selectors rot. You write await page.click('[data-testid="submit-btn"]'). A component refactor renames that testid. Five tests break. Now multiply that by every sprint, every UI change, every feature flag toggle.

CI runners are slower than your laptop. A test passes locally in 200ms. In GitHub Actions it times out because the runner has 2 vCPUs and shared memory. You add waitForTimeout(2000) as a patch. Then another. Then another. The suite balloons.

Environment drift. Tests pass against localhost with seed data. They fail against staging with production-like data, different feature flags, different CDN latency. Parity between environments is a full-time job nobody is staffed for.

The maintenance spiral. The Sonar State of Code Survey found that 38% of developers say reviewing AI-generated code requires more effort than reviewing human code. Stack that on top of maintaining a brittle test suite and engineers start asking the hard question: "Are these tests catching bugs, or are we just maintaining them?"

If the answer takes more than two seconds, the tests get deprioritized. For a deeper look at this maintenance tax, see our breakdown of why your engineering budget is $600K higher than you think.


Bug0 Studio: you describe the test, we run it

You've seen the YAML. Setting up the workflow takes an afternoon. Maintaining the Playwright scripts inside it takes 30-50% of engineering time, every sprint, indefinitely…

Bug0 Studio removes that layer.

You describe a test:

"Log in with valid credentials, navigate to the dashboard, create a new project, verify the project appears in the list."

Bug0 generates the test, runs it on our infrastructure, and posts the result back to your GitHub PR as a status check. You never write a Playwright script.

Bug0 Studio test creation screen showing a plain English test description being typed into the input field.

Image above shows Bug0 Studio test creation screen showing a plain English test description being typed into the input field.

Bug0 Studio screen showing the steps generated from the input prompt.

Bug0 Studio generating test steps from a prompt.

You can also upload a video of yourself walking through a flow, or record your screen directly in the app. Whatever's faster for you. The output is the same: an executable test running on Bug0's cloud that reports back to your CI.

Self-healing is what changes the math

The "disabled temporarily" problem exists because Playwright scripts break when the UI changes and somebody has to manually fix the selector. That's the maintenance spiral.

Bug0 tests heal themselves. 90% of UI changes are handled without intervention. When the submit button moves from a form footer to a sticky header, the test adapts because it understands the intent of the flow, not just the CSS path to an element.

You get notified when the 10% that needs human attention comes up. The other 90% just works. That's why the decay timeline from the previous section doesn't apply.

Bug0 Studio test run results dashboard showing passed/failed test steps with self-healed selectors highlighted or flagged

How it plugs into your existing pipeline

Bug0 posts results as a GitHub PR check. It appears alongside your unit tests, your linter, your type-checker. You configure which suites gate which branches. Green means pass. Red blocks the merge.

[image here: GitHub pull request checks tab showing Bug0 E2E test results as a status check alongside other CI jobs like unit tests and linting]

Your existing GitHub Actions workflows don't change. Unit tests still run on GitHub's runners. Integration tests still hit your services. E2E coverage runs on Bug0's infrastructure and reports back. No browser install steps. No artifact storage. No Actions minutes burned on browser testing.

500+ tests in parallel. Results in under 5 minutes. Starting at $250/month.

Steven Tey at Dub (open-source link management platform) put it simply: "Since we started using Bug0, it helped us catch multiple bugs before they made their way to prod."


Bug0 Managed: when you don't want to think about testing at all

Studio handles test creation and maintenance. Managed goes further.

Some engineering leads don't want to manage test plans, triage failures, or decide what to cover next. They want to merge PRs and know the product works. They want someone else to own QA entirely.

That's what Bug0 Managed is.

A Forward-Deployed Engineer pod embeds into your team. They crawl your staging environment and map every critical flow: signup, onboarding, billing, the features your customers actually use. They build the test plan. They generate and maintain every test using the same AI engine as Studio. They join your standups. They sit in your Slack channel. When a test fails at 2 AM, they triage it and determine if it's a real bug, a flake, or an expected change before your engineers wake up.

They own release sign-offs. Before every deploy, the pod confirms: critical flows pass, regressions are flagged, the release is safe to ship.

Why this involves humans and not just AI

Trust in AI accuracy dropped to 29% in 2026, per Stack Overflow. Engineers don't want an autonomous system making the call on what's broken. Bug0 Managed runs AI for speed and self-healing, then has QA engineers review every run. They filter false positives. They escalate confirmed bugs with video, repro steps, and severity triage. 99% human-verified accuracy.

Jacob Lauritzen, Head of Engineering at Legora (legal AI tech company), said: "Bug0 gives us the speed of AI-native automation with the accuracy and self-healing of human QA."

The cost comparison

A fully-loaded QA engineer runs $130-150K/year. That's base salary plus benefits, taxes, overhead at 30-40%, another $5-15K for tooling, and $3-10K for cloud infrastructure.

Bug0 Managed is $2,500/month. $30K/year. You're not getting one engineer for less money. You're getting a pod with AI infrastructure, parallel execution, 24x5 availability, and the kind of coverage that would take months to build internally. 100% critical flow coverage in 7 days. 80% total coverage within 4 weeks. SOC 2 and ISO 27001 compliant.


FAQs

I already have unit tests in GitHub Actions. Is that enough?

Depends on what you're shipping. If your product is a CLI tool or a pure API, unit and integration tests might cover you. If users interact with your product through a browser, no. Unit tests structurally cannot catch UI regressions, broken navigation, or cross-page flow bugs. The bugs your customers report almost always live in the browser layer.

How do I actually speed up a slow Playwright suite in CI?

Two things work. First, shard with matrix strategy. --shard=1/4 through --shard=4/4 across four runners cuts wall-clock time by 75%. Second, tag tests as @smoke and only run critical paths on PRs. Save the full regression for nightly cron runs. If you're still over 5 minutes after both, you either have too many tests running per-PR or your tests need refactoring.

How much are GitHub Actions minutes actually costing me for E2E?

A Playwright suite of 50 tests on ubuntu-latest uses 20-40 minutes per run. GitHub charges $0.008/minute for Linux runners. At 20 PRs per day, that's $65-130/month just in E2E compute. With Bug0, E2E runs on Bug0's infrastructure. Zero Actions minutes consumed for browser testing.

Why do my E2E tests keep breaking after UI changes?

Because Playwright scripts are bound to selectors, and selectors change every time the frontend team touches a component. A renamed data-testid, a restructured form, a moved button. Each one breaks tests that were working yesterday. Self-healing tests fix this by understanding the flow intent rather than the DOM path. Bug0's self-healing handles 90% of these changes automatically.

Can Bug0 work alongside my existing GitHub Actions setup?

Yes. Bug0 doesn't replace your pipeline. It adds a PR status check alongside your existing jobs. Your unit tests, linter, and build steps stay on GitHub's runners. Bug0 handles E2E on its own infrastructure and posts results back to the PR.

What's the difference between Studio and Managed?

Studio: you describe tests in plain English or upload videos. You decide what gets tested. Bug0 handles execution, self-healing, and reporting. $250/month.

Managed: a Forward-Deployed Engineer pod owns everything. They build the test plan, create tests, triage failures, join your standups, sit in your Slack, and sign off on releases. You don't manage QA at all. $2,500/month, which is 80% less than hiring one QA engineer.

Should I build my own E2E testing on Playwright or use Bug0?

If you have 2+ engineers who can own testing infrastructure long-term (not just build it, maintain it, respond to failures at 2 AM), and you have compliance requirements that prevent SaaS tools, build it yourself. For everyone else, the math is straightforward. DIY Playwright in CI costs $180K-$300K in year one engineering time. Bug0 starts at $3K/year. The question is where your engineers should spend their time.


Get started

Try Bug0 Studio if you want to own test creation without writing Playwright scripts. Describe tests in plain English, upload videos, or record your screen. Results post back to your GitHub PRs. Sign up free.

Book Bug0 Managed if you want a QA pod that owns everything. Forward-Deployed Engineers, 100% critical flow coverage in 7 days, release sign-offs handled. Request a demo.

View pricing details for both.