Eng + Support VP brief · turning a 9,000-case Free backlog into a self-resolving system
Problem: 9,000 of 14,000 backlog cases are Self-Serve Free. These customers still buy add-ons + usage-based billing and feed the upgrade funnel, so we can't drop support — but adding headcount linearly is economically broken.
Shift: Treat Free support as a demand-reduction + automated-resolution engineering problem, not a staffing problem. Push every contact down a cost gradient: prevent → self-heal → AI resolve → human exception. Each resolved case should make the next one cheaper (knowledge + signals compound).
flowchart TD
A["Incoming Free demand"] --> T0["TIER 0 — PREVENT
case never happens"]
T0 -->|"residual"| T1["TIER 1 — SELF-HEAL
customer fixes in dashboard"]
T1 -->|"unresolved"| T2["TIER 2 — AGENT LEE
AI diagnoses + acts on Stripe"]
T2 -->|"exceptions only"| T3["TIER 3 — HUMANS
refunds over policy, Stripe bugs, disputes, legal"]
T1 -.->|"always reachable"| T3
classDef c0 fill:#e6f4ea,stroke:#34a853
classDef c1 fill:#e8f0fe,stroke:#1a73e8
classDef c2 fill:#fff4e5,stroke:#f9ab00
classDef c3 fill:#fce8e6,stroke:#ea4335
class T0 c0
class T1 c1
class T2 c2
class T3 c3
Golden rule: the cheapest case is the one never raised. Cost per contact falls ~80× moving from Tier 3 to Tier 0/1 (Gartner basis).
flowchart TD
C[Customer / Dashboard] --> R{Entry Router
intent + auth + plan/risk}
R -->|prevent class| T0[Tier 0 — Prevent
Stripe webhooks, dunning,
proactive nudges, link auto-regen]
R -->|self-serve eligible| T1[Tier 1 — Self-Heal UI
Billing Health Meter
1-click fixes]
R -->|needs resolution| T2[Tier 2 — Agent Lee
LLM tool-calling + RAG
scoped Stripe actions]
T1 -->|unresolved / low confidence| T2
T2 -->|"risk gate fails / low confidence"| T3["Tier 3 — Humans
refunds over policy, Stripe corruption,
disputes, fraud, legal"]
T1 -.->|always reachable| T3
T1 --> DIAG[Billing Diagnostics API]
T0 -.-> DIAG
T2 --> DIAG
T2 --> POL[Policy / Guardrail Engine]
T2 --> AUD[(Audit log)]
T2 --> KCS[KCS knowledge store]
DIAG --> STRIPE[(Stripe)]
subgraph Shared services
DIAG
POL
KCS
AUD
OBS[Observability: resolution rate, CES, drift]
end
One internal service both the Self-Heal UI and Agent Lee call. It normalizes three sources into a single plain-language diagnosis — closing the #1 self-service failure mode (Gartner: 45% "company didn't understand my intent", 43% "no relevant content").
flowchart LR
UI["Billing Health Meter
(Self-Heal UI)"] --> API
LEE["Agent Lee
(AI resolver)"] --> API
API["Billing Diagnostics API
normalizes sources into
one plain-language diagnosis"]
API -->|"payment / invoice state"| STRIPE[("Stripe
payments · invoices · dunning")]
API -->|"is the subscription there?"| SUB[("Subscription / Entitlements
source-of-truth — TBD")]
API -->|"how much usage / what cost?"| OPE[("OPE — ClickHouse
billable usage + observability")]
API --> OUT["Diagnosis output
status · root_cause · explanation
recommended_action · self_serve_eligible"]
classDef verified fill:#e6f4ea,stroke:#34a853,color:#000;
classDef tbd fill:#fff4e5,stroke:#f9ab00,color:#000;
class STRIPE,OPE verified; class SUB tbd;
OPE (Ordered Parallel Execution) is a ClickHouse usage/metering layer — it answers "how much usage / what cost", not "does a subscription exist". The subscription source-of-truth is still TBD. (User-provided definition; wiki was unreachable to verify.)
sequenceDiagram
participant Stripe
participant T0 as Tier 0 (webhook)
participant Cust as Customer
participant UI as Billing Health Meter
participant Diag as Diagnostics API
Stripe->>T0: invoice.payment_failed
T0->>T0: classify failure, regenerate expired link
T0->>Cust: proactive nudge + fix CTA
Cust->>UI: opens dashboard
UI->>Diag: get_billing_diagnosis(customer_id)
Diag->>Stripe: read PaymentIntent.last_payment_error
Diag-->>UI: {red, insufficient_funds, action}
UI-->>Cust: "Card declined — update card / retry"
Cust->>UI: 1-click retry
UI->>Stripe: confirm payment (idempotent)
Stripe-->>UI: success
UI-->>Cust: resolved — zero human touch
sequenceDiagram
participant Cust as Customer
participant Lee as Agent Lee
participant Pol as Policy Engine
participant Stripe
participant Aud as Audit Log
participant Human
Cust->>Lee: "I was charged, I want a refund"
Lee->>Pol: issue_refund(charge_id, amount)
alt within policy (<= cap & eligible)
Pol-->>Lee: approved
Lee->>Stripe: refund (idempotency key)
Stripe-->>Lee: refunded
Lee->>Aud: log actor=agent-lee, decision, outcome
Lee-->>Cust: refund confirmed
else exceeds policy
Pol-->>Lee: denied (needs approval)
Lee->>Human: escalate with full context
Human-->>Cust: resolves (no re-auth)
end
gantt
title Delivery roadmap (phased)
dateFormat YYYY-MM-DD
axisFormat %b
section Phase 1 — Bridge (0–30d)
Free Pod drains 9k backlog :p1, 2026-07-01, 30d
Pareto-rank intents (Salesforce) :2026-07-01, 21d
section Phase 2 — Prevent (30–90d)
Billing Diagnostics API :p2, after p1, 60d
Tier 0 webhooks + link auto-regen :after p1, 45d
section Phase 3 — Resolve (90–180d)
Billing Health Meter UI :p3, after p2, 90d
Agent Lee on top-3 intents :after p2, 90d
section Phase 4 — Scale (180d+)
Expand scope + upgrade prompts :after p3, 90d
Retire temporary Free Pod :after p3, 45d
| State | Mix | Volume | Unit cost | Monthly |
|---|---|---|---|---|
| Today (all human) | 100% human | 9,000 | $8.00 | $72,000 |
| Tier 0 prevent | 30% | 2,700 | ~$0 | $0 |
| Tier 1 self-heal | 20% | 1,800 | $0.10 | $180 |
| Tier 2 AI | 35% | 3,150 | $0.20 | $630 |
| Tier 3 human | 15% | 1,350 | $8.00 | $10,800 |
| Future total | 100% | 9,000 | blended | $11,610 |
Illustrative — plug in real monthly inflow + dunning data. Plus a separate recovered-revenue lever: fixing silent failed payments protects UBB/add-on revenue (e.g. ~$45k/mo at 25% recoverable × $20 avg).
Proven at scale: Klarna's action-taking AI did the work of ~700 agents, matched human CSAT, cut repeat contacts 25%, dropped resolution 11→2 min. Intercom Fin averages ~76% resolution across 12,000+ customers.
The one rule we can't break — never a dead end: any low-confidence/policy-fail/error path routes to a human with full context (Gartner: easy escalation → 74% still self-serve next time). This avoids the billing@ → auto-reply → dashboard loop that triggered a recent public complaint.