Ryan Nelson's Blog

Five Whys + AI is a Surprisingly Good Debugging Combination

• By Ryan Nelson

Toyota developed the "five whys" in the 1950s to build their production system. State the problem, ask why it happened, then ask why that happened, five times. The discipline is that you don't stop at the first answer, which is almost always just a description of what broke.

I've been using it with AI lately. Here's a real example.

My site-to-site VPN tunnel went down. Left to my own devices, I'd have checked the peer IP, fixed it, and been three steps into "let me just set the MTU to 1490" before finishing my coffee. Instead, I asked for five whys:

Problem: The site-to-site VPN tunnel between Kingston and Seattle went down and stayed down for an unknown period before we noticed.

Why 1: The ER605's IPsec config had the wrong peer IP — it was pointing to an old Seattle WAN address.

Why 2: The Seattle WAN IP changed. It's a PPPoE residential connection, and the lease was renewed with a different address.

Why 3: There was no mechanism to automatically update the ER605 config when the IP changed. The tunnel config stored a static IP for a fundamentally dynamic endpoint.

Why 4: When the tunnel was originally set up, nobody encoded the assumption "this IP will change" into the design. A DDNS hostname existed and was being maintained, but the ER605 config wasn't using it — the static IP was entered manually and left.

Why 5: There was no feedback loop between the failure and the operator. The tunnel going down produced no alert, so the drift between the configured IP and the real IP went undetected until something across the tunnel stopped working.

Root cause: The system had no monitoring to detect tunnel failures, so a configuration that required periodic manual maintenance could drift silently into a broken state.

What actually needs fixing: Not the IP entry — the observability gap. An alerting mechanism that fires when the tunnel enters a connecting state for more than a few minutes would have surfaced this in minutes instead of hours. The auto-update cron is also good, but monitoring catches failure modes the automation doesn't cover.

Fix the IP and move on, and three months later the PPPoE lease changes again and you're back in the same spot with no idea how long it's been broken.

Two things make AI a good partner for this specifically. It doesn't quit after the second answer — at 11pm when the fix is obvious, your brain wants to shortcut, and the AI just keeps asking the next question. And it doesn't get defensive. If you built the system, "why did you configure it this way" can trigger justification instead of diagnosis. The AI has no stake in the original decision.

The technique has been around for 70 years. The only new thing is having a thinking partner who will do all five.