Root cause analysis (RCA) in testing

tldr: Root cause analysis (RCA) in testing finds why a bug happened, not just where. The classic methods, Five Whys, fishbone diagrams, and fault tree analysis, applied to defect investigations, prevent the same bug from recurring.


Why RCA matters more than fixing

Most teams fix the symptom. The login button does not work. They find the broken selector, fix it, ship it, close the ticket. Done.

The bug recurs three months later in a different form. Same root cause. Different surface.

RCA breaks that cycle. Instead of "fix the bug," the question becomes "fix the system that produced the bug."


When to run RCA

Not every bug. RCA takes time. Apply it where it pays off.

Three triggers:

Production incidents. Anything that affected real users. Even a minor incident is worth understanding.

Repeated defects. A bug class showing up across multiple releases. The recurrence itself is the signal.

High-severity escapes. Bugs that reached production despite the testing process. These reveal where coverage was missing.

For low-severity, one-off bugs, fix and move on. RCA on every defect overwhelms the team.


Method 1: Five Whys

The simplest method. Ask "why" five times.

Example. The order page crashed.

  1. Why? The frontend tried to render a null product.
  2. Why? The product API returned null instead of an error.
  3. Why? The product service was down and the gateway returned null on timeout.
  4. Why? The gateway has no error handling for downstream timeouts.
  5. Why? The team did not consider downstream timeouts during the gateway design.

The fix at level 1 is "handle null in the renderer." The fix at level 5 is "establish a downstream error-handling pattern across all gateway endpoints."

The deeper fix prevents many future bugs, not just this one.


Method 2: Fishbone (Ishikawa) diagram

Useful when the cause might be multi-factor.

Draw the bug at the head of a fish. Major branches: People, Process, Tools, Environment, Materials. List potential contributing factors on each branch.

The diagram forces consideration of every category. A bug attributed to "the engineer made a mistake" often turns out to be "the tooling allowed the mistake, the process did not catch it, the environment masked it, and the engineer was not trained on the edge case."

Most production incidents are multi-factor. Fishbone surfaces that.


Method 3: Fault tree analysis

Used in safety-critical and regulated domains. Start with the failure event. Work backward through the chain of conditions that had to be true for it to happen. Mark each as AND or OR.

The output is a logic tree showing exactly which combinations of conditions produced the failure. Strong defenses break enough branches that the failure cannot recur.

Heavyweight. Worth it for incidents with significant impact.


What good RCA looks like

A useful RCA report has five sections.

  1. What happened. The observable failure, with timeline.
  2. Why it happened. Causal chain back to a root cause.
  3. Why it was not caught. What testing or process should have prevented it but did not.
  4. Action items. Specific, owned, time-boxed.
  5. What we learned. General principles that apply beyond this incident.

Most RCAs miss section 3. Without it, the team fixes the bug but not the missing safety net.


RCA as a continuous process

A team that runs RCA on every meaningful defect builds a learning loop.

Patterns over time:

  • Bug classes start grouping.
  • Process improvements get implemented based on real evidence.
  • Test coverage gaps become visible and fillable.
  • Repeat bugs decline.

Without RCA, the same bugs keep appearing in new clothing. With it, each fix is permanent.


How AI testing supports RCA

The hardest part of RCA is reproduction. Without a reliable reproduction, the team is debugging a story, not a bug.

AI testing platforms make this easier. Bug0 captures full traces, screenshots, network logs, and DOM snapshots on every failure. The artifact is the reproduction. RCA starts with concrete evidence instead of "the user said it was broken."


Common RCA mistakes

Stopping at the first plausible cause. The first answer to "why" is usually a symptom of a deeper cause. Keep going.

Blaming individuals. "The engineer wrote bad code" is rarely the root cause. Why did the system permit bad code to ship? That is the real question.

No follow-through. RCA produces action items. Action items without owners and deadlines are aspirational. Track them like any other work.

Skipping section 3. What testing should have caught this? If the answer is "we did not have a test for it," that is a coverage gap to fix.


FAQs

How is RCA different from defect analysis?

RCA looks at one specific defect deeply. Defect analysis looks at trends across many defects. They are complementary.

How long should RCA take?

Depends on severity. A minor production incident: 1 to 2 hours. A major outage: a multi-day investigation involving multiple teams.

Who runs RCA?

The team closest to the affected system. Engineering leads, QA leads, and SRE participate. Senior leadership reviews the output.

Should RCA reports be public?

Internally, yes. Sharing the learning across teams prevents the same issue elsewhere. Sometimes externally too: post-mortems published publicly build trust with customers.

How does Bug0 help with RCA?

Bug0 captures the full reproduction context for every failure: screenshots, network calls, DOM state, console logs. RCA starts with concrete evidence rather than guesswork.

Ship every deploy with confidence.

Bug0 gives you a dedicated AI QA engineer that tests every critical flow, on every PR, with zero test code to maintain. 200+ engineering teams already made the switch.

From $2,500/mo. Full coverage in 7 days.

Go on vacation. Bug0 never sleeps. - Your AI QA engineer runs 24/7

Go on vacation.
Bug0 never sleeps.

Your AI QA engineer runs 24/7 — on every commit, every deploy, every schedule. Full coverage while you're off the grid.