Integration Runtime
Designing safe production workflows for integrations.
Type
Conceptual
Role
Product Designer
Duration
2 Weeks
Focus
Platform UX
Product design ⦿ Platform UX ⦿ Reliability systems ⦿ Failure recovery ⦿ SAAS
What's this about
Integrations don't fail loudly. They degrade quietly.
Auditing tools like Zapier, Workato, and Make revealed the same pattern — failures buried in logs or collapsed into a generic "something went wrong" state. Admins had data but no way to reason about it under pressure.
The real risk: when admins can't diagnose quickly, their first instinct is to edit configuration. That's the most common way a single failure becomes three.
What I tried first
Notification model
Rejected
The problem
Surface failures as notifications with severity levels. Familiar, low-friction. The problem: notifications are optimised for awareness, not action. Admins still had to navigate to config to understand the situation. Under pressure, that extra step is where mistakes happen.
The Shift
Before
After
Failures buried in logs or raw traces
Failures as structured, diagnosable objects
Generic error states with no context
Execution state always visible and current
Recovery and config on the same surface
Recovery scoped to the active issue only
Admins guessing what's safe to touch
Safe actions determined by system state

Design Decisions
State as the entry point
Surface execution state first — Running, Degraded, Paused, Rate-limited. State governs which actions appear. Unsafe options are never shown, not just disabled.
Cost: Removed flexibility some admins want — like force-resuming a paused integration. Confidence matters more than speed in a failure scenario.

Failures as structured objects
Each failure surfaces: what failed, why, what's affected, and whether data loss occurred. No log access required to understand the incident.
Cost: Structuring failures meant making assumptions about failure types. Novel or compound failures may not fit cleanly — a known gap that needs a fallback in production.

Constrained recovery paths
Recovery flows are issue-specific. Reconnecting auth doesn't expose trigger config. Retrying events doesn't allow editing actions. Each path addresses exactly one failure.
Cost: Power users wanted more control. The constraint held — most errors during incidents come from over-intervention, not under-action.

Configuration read-only by default
Production config defaults to inspection. An explicit action is required to enter edit mode, visually separated from all recovery flows.
Cost: Adds one extra step for legitimate config changes. A small cost that creates a clear break between fixing a failure and changing how the system works.

outcomes
No guesswork
Execution state and failure cause visible before any action is taken
Scoped recovery
Each fix flow addresses one issue — config stays untouched
Auto contained
Repeated failures trigger system pause before data is at risk
Full audit
Every action logged immutably — system and human

The hard part
The most impactful decisions weren't about what to add — they were about what to remove. Every option that felt helpful in a calm moment became a liability the moment pressure was high.
Learnings
Remove options, don't just disable them
Hidden unsafe actions still create cognitive load. If it's not safe, it shouldn't exist on the surface at all.
State is a better primitive than events
Events tell you something happened. State tells you what's true right now. Design for state.
Constrained paths build confidence
Admins made fewer mistakes when choices were scoped. Less freedom in a crisis is a feature, not a limitation.
Friction is sometimes the right answer
The config edit step adds one click. That one click prevents the most common incident escalation pattern.
WHAT I’D IMPROVE
This works for known failures. It breaks with multiple ones. When everything goes wrong at once, there’s no single path forward. I’d add a fallback: pause everything and make the system safe by default
See what else I built
Working on a complex product? I can help bring clarity to it.
I take on projects, part-time work, and full-time roles. send the details — we’ll figure it out.
vizuraja@gmail.com
CHAOS
→
CLARITY

V
I
S
W
A
You could have been anywhere on the internet, yet you’re here. thanks for visiting

