Self-Serve Free Queue — Strategy Overview

Eng + Support VP brief · turning a 9,000-case Free backlog into a self-resolving system

The problem & the shift

Problem: 9,000 of 14,000 backlog cases are Self-Serve Free. These customers still buy add-ons + usage-based billing and feed the upgrade funnel, so we can't drop support — but adding headcount linearly is economically broken.

Shift: Treat Free support as a demand-reduction + automated-resolution engineering problem, not a staffing problem. Push every contact down a cost gradient: prevent → self-heal → AI resolve → human exception. Each resolved case should make the next one cheaper (knowledge + signals compound).

The Demand Pyramid — push work to the cheapest layer

TIER 0 · PREVENTFix root causes so the case never happens. ~$0.10/case.
TIER 1 · SELF-HEALBilling Health Meter, 1-click fixes. ~$0.10/case.
TIER 2 · AI RESOLVEAgent Lee reads Stripe & acts. ~$0.20/case.
TIER 3 · HUMANSRefunds, Stripe bugs, disputes, legal. ~$8.00/case.
flowchart TD
    A["Incoming Free demand"] --> T0["TIER 0 — PREVENT
case never happens"] T0 -->|"residual"| T1["TIER 1 — SELF-HEAL
customer fixes in dashboard"] T1 -->|"unresolved"| T2["TIER 2 — AGENT LEE
AI diagnoses + acts on Stripe"] T2 -->|"exceptions only"| T3["TIER 3 — HUMANS
refunds over policy, Stripe bugs, disputes, legal"] T1 -.->|"always reachable"| T3 classDef c0 fill:#e6f4ea,stroke:#34a853 classDef c1 fill:#e8f0fe,stroke:#1a73e8 classDef c2 fill:#fff4e5,stroke:#f9ab00 classDef c3 fill:#fce8e6,stroke:#ea4335 class T0 c0 class T1 c1 class T2 c2 class T3 c3

Golden rule: the cheapest case is the one never raised. Cost per contact falls ~80× moving from Tier 3 to Tier 0/1 (Gartner basis).

Target architecture (control-plane view)

flowchart TD
    C[Customer / Dashboard] --> R{Entry Router
intent + auth + plan/risk} R -->|prevent class| T0[Tier 0 — Prevent
Stripe webhooks, dunning,
proactive nudges, link auto-regen] R -->|self-serve eligible| T1[Tier 1 — Self-Heal UI
Billing Health Meter
1-click fixes] R -->|needs resolution| T2[Tier 2 — Agent Lee
LLM tool-calling + RAG
scoped Stripe actions] T1 -->|unresolved / low confidence| T2 T2 -->|"risk gate fails / low confidence"| T3["Tier 3 — Humans
refunds over policy, Stripe corruption,
disputes, fraud, legal"] T1 -.->|always reachable| T3 T1 --> DIAG[Billing Diagnostics API] T0 -.-> DIAG T2 --> DIAG T2 --> POL[Policy / Guardrail Engine] T2 --> AUD[(Audit log)] T2 --> KCS[KCS knowledge store] DIAG --> STRIPE[(Stripe)] subgraph Shared services DIAG POL KCS AUD OBS[Observability: resolution rate, CES, drift] end

Diagnostics API — the keystone

One internal service both the Self-Heal UI and Agent Lee call. It normalizes three sources into a single plain-language diagnosis — closing the #1 self-service failure mode (Gartner: 45% "company didn't understand my intent", 43% "no relevant content").

flowchart LR
    UI["Billing Health Meter
(Self-Heal UI)"] --> API LEE["Agent Lee
(AI resolver)"] --> API API["Billing Diagnostics API
normalizes sources into
one plain-language diagnosis"] API -->|"payment / invoice state"| STRIPE[("Stripe
payments · invoices · dunning")] API -->|"is the subscription there?"| SUB[("Subscription / Entitlements
source-of-truth — TBD")] API -->|"how much usage / what cost?"| OPE[("OPE — ClickHouse
billable usage + observability")] API --> OUT["Diagnosis output
status · root_cause · explanation
recommended_action · self_serve_eligible"] classDef verified fill:#e6f4ea,stroke:#34a853,color:#000; classDef tbd fill:#fff4e5,stroke:#f9ab00,color:#000; class STRIPE,OPE verified; class SUB tbd;
Open item: OPE (Ordered Parallel Execution) is a ClickHouse usage/metering layer — it answers "how much usage / what cost", not "does a subscription exist". The subscription source-of-truth is still TBD. (User-provided definition; wiki was unreachable to verify.)

How it feels in practice

Flow 1 — Failed-payment self-heal (no case raised)

sequenceDiagram
    participant Stripe
    participant T0 as Tier 0 (webhook)
    participant Cust as Customer
    participant UI as Billing Health Meter
    participant Diag as Diagnostics API
    Stripe->>T0: invoice.payment_failed
    T0->>T0: classify failure, regenerate expired link
    T0->>Cust: proactive nudge + fix CTA
    Cust->>UI: opens dashboard
    UI->>Diag: get_billing_diagnosis(customer_id)
    Diag->>Stripe: read PaymentIntent.last_payment_error
    Diag-->>UI: {red, insufficient_funds, action}
    UI-->>Cust: "Card declined — update card / retry"
    Cust->>UI: 1-click retry
    UI->>Stripe: confirm payment (idempotent)
    Stripe-->>UI: success
    UI-->>Cust: resolved — zero human touch
    

Flow 2 — Agent-issued refund with policy gate

sequenceDiagram
    participant Cust as Customer
    participant Lee as Agent Lee
    participant Pol as Policy Engine
    participant Stripe
    participant Aud as Audit Log
    participant Human
    Cust->>Lee: "I was charged, I want a refund"
    Lee->>Pol: issue_refund(charge_id, amount)
    alt within policy (<= cap & eligible)
        Pol-->>Lee: approved
        Lee->>Stripe: refund (idempotency key)
        Stripe-->>Lee: refunded
        Lee->>Aud: log actor=agent-lee, decision, outcome
        Lee-->>Cust: refund confirmed
    else exceeds policy
        Pol-->>Lee: denied (needs approval)
        Lee->>Human: escalate with full context
        Human-->>Cust: resolves (no re-auth)
    end
    

The plan — phased roadmap

gantt
    title Delivery roadmap (phased)
    dateFormat YYYY-MM-DD
    axisFormat %b
    section Phase 1 — Bridge (0–30d)
    Free Pod drains 9k backlog        :p1, 2026-07-01, 30d
    Pareto-rank intents (Salesforce)  :2026-07-01, 21d
    section Phase 2 — Prevent (30–90d)
    Billing Diagnostics API           :p2, after p1, 60d
    Tier 0 webhooks + link auto-regen :after p1, 45d
    section Phase 3 — Resolve (90–180d)
    Billing Health Meter UI           :p3, after p2, 90d
    Agent Lee on top-3 intents        :after p2, 90d
    section Phase 4 — Scale (180d+)
    Expand scope + upgrade prompts     :after p3, 90d
    Retire temporary Free Pod          :after p3, 45d
    
  1. Phase 1 — Bridge (0–30d): stand up a time-boxed "Free Pod" with AI-assisted drafting to drain today's 9k; rank intents (expect ~5–8 = ~80% of volume).
  2. Phase 2 — Prevent (30–90d): ship the Billing Diagnostics API + Tier 0 webhooks (failure-reason explainer, auto-link-regeneration, dunning nudges).
  3. Phase 3 — Resolve (90–180d): launch the Billing Health Meter; put Agent Lee on the top-3 intents with read + scoped-write tools behind the Policy Engine.
  4. Phase 4 — Scale (180d+): expand agent scope, add in-context upgrade prompts, retire the temporary pod as structural demand drops.

The business case

~84%
cost reduction vs all-human
~$725k/yr
projected savings (illustrative)
$0.10 vs $8.00
self-serve vs human / contact
StateMixVolumeUnit costMonthly
Today (all human)100% human9,000$8.00$72,000
Tier 0 prevent30%2,700~$0$0
Tier 1 self-heal20%1,800$0.10$180
Tier 2 AI35%3,150$0.20$630
Tier 3 human15%1,350$8.00$10,800
Future total100%9,000blended$11,610

Illustrative — plug in real monthly inflow + dunning data. Plus a separate recovered-revenue lever: fixing silent failed payments protects UBB/add-on revenue (e.g. ~$45k/mo at 25% recoverable × $20 avg).

Proof & guardrails

Proven at scale: Klarna's action-taking AI did the work of ~700 agents, matched human CSAT, cut repeat contacts 25%, dropped resolution 11→2 min. Intercom Fin averages ~76% resolution across 12,000+ customers.

The one rule we can't break — never a dead end: any low-confidence/policy-fail/error path routes to a human with full context (Gartner: easy escalation → 74% still self-serve next time). This avoids the billing@ → auto-reply → dashboard loop that triggered a recent public complaint.