Debugging Adaptive Journeys

When something goes wrong in production — a journey that won't trigger, sends that don't arrive, branches that look wrong — Apex gives you four observability surfaces that work together:

Live executions panel on the journey detail page (instant, last 50 executions).
Audit timeline for the journey (every authoring + runtime event).
CloudWatch dashboard + alarms for the runtime as a whole (operator view).
CloudWatch Logs Insights for tracing one user through one journey end-to-end.

This guide walks each one and gives you copy-pasteable queries.

1. Live executions panel

Open the journey at /dashboard/communications/journeys/[id] and scroll to Live executions.

Every running, completed, failed, and timed-out execution shows up here within ~1s of state change. Each row links to the Step Functions console for the full execution history, and shows the current step's correlationId (formatted <executionTail>-<stepId>). The correlation ID is the key to everything else; it's logged on every dispatcher invocation and audit emit.

If you only see the panel but not the rows, check that the workspace has at least one published journey and that you have an inbound event matching a trigger contract. The trigger evaluator runs deterministically — same event + same workspace + same trigger contract = exactly one execution start.

2. Audit timeline

Hit GET /api/journeys/[id]/audit (or scroll the audit drawer in the canvas). Every entry has:

timestamp — ISO-8601, also doubles as the SK fragment so the records sort naturally
actor — who or what produced the event (system for runtime, user sub for authoring + admin actions)
action — see the canonical list below
metadata — free-form context (always JSON-serializable)

Audit actions you'll see

Group	Actions
Authoring	`created`, `edited`, `published`, `paused`, `archived`
Authoring (special)	`simulate-with-real-user`, `dry-run-override-applied`
Runtime sends	`send.executed`, `send.skipped`, `send.failed`
Runtime webhooks	`webhook.executed`, `webhook.rejected`, `webhook.failed`
Execution lifecycle	`execution.started`, `execution.exited`, `execution.failed`
Webhook destination registry	`webhook.destination_registered`, `webhook.destination_updated`, `webhook.destination_key_rotated`, `webhook.destination_deleted`, `webhook.destination_pinged`

When a send is skipped, the metadata always includes a reason — one of frequency_cap_exceeded, daily_cap_exceeded, opt_out, suppression_match, holdout_membership, or eligibility_failed. That's where to start any "why didn't this send?" investigation.

3. CloudWatch dashboard

The CDK stack ships a per-stage dashboard at:

Staging: apex-staging-adaptive-journeys
Prod: apex-prod-adaptive-journeys

It surfaces:

Executions started / succeeded / failed / timed-out in 5-min buckets — the at-a-glance view of system health.
Dispatcher invocations & errors + Dispatcher latency (p50 / p95 / max) — Lambda-level detail.
Event correlator (wait-for-event resolutions) — non-zero traffic here means subjects are passing through wait-until-event steps.
Timed-out / failed / total executions, last 24h — single-value cards for daily trend.

Alarms that fire to `apex-<stage>-journey-alerts`

Alarm	Threshold	What it means
`journey-dispatcher-errors`	>0 errors in 5 min	Dispatcher Lambda is throwing — check its log group
`journey-dispatcher-slow`	p95 > 10s for 3× 5 min	Step Functions executions piling up; tail latency in dispatcher → app
`journey-dispatcher-throttled`	>0 throttles	Concurrency floor breached; raise reserved concurrency or back off triggers
`journey-event-correlator-errors`	>0 errors in 5 min	Wait-for-event signals not landing; pending journeys may stall
`journey-executions-failed`	>0 failed in 5 min	One or more journeys hit a `fail` action — see the dispatcher error
`journey-executions-timed-out`	>0 timed-out in 5 min	A journey hit the 60-day max-execution timeout — usually a wait/loop bug

The SNS topic forwards to whatever emails are configured in infra/lib/config.ts under alertEmails.

Info

There is no "stuck executions" alarm. A long-running execution is not a stuck one — a journey that waits three days for an open is doing its job. The old alarm measured execution age and paged on every healthy multi-day run, so it was removed in July 2026. To find a run that genuinely isn't moving, use the live executions panel on the journey detail page, or the Logs Insights queries below; journey-executions-failed and journey-executions-timed-out cover the cases that actually end a run.

4. CloudWatch Logs Insights

Three queries handle 95% of journey debugging.

Trace one journey for one user end-to-end

fields @timestamp, @message
| filter @message like /<endUserHash>/ and @message like /<journeyId>/
| sort @timestamp asc
| limit 200

Substitute <endUserHash> (the SHA-256 of the email used in dispatcher logs — never the raw email) and <journeyId>. The chronological log gives you the dispatcher's view of every step the subject traversed.

Find slow dispatch calls

filter @duration > 5000
| stats count() by step.id
| sort count desc

Rolls slow dispatcher calls up by step ID so you can see if one step (typically a Send waiting on SES, or a webhook with a slow remote endpoint) is dominating tail latency.

Find blocked sends

filter @message like /send.skipped/
| stats count() by reason

Or for one specific journey:

filter @message like /send.skipped/ and @message like /<journeyId>/
| stats count() by reason

Counts every blocked send by reason. If you're seeing high holdout_membership counts and you didn't expect any, double-check /dashboard/settings/journeys — someone may have raised the workspace holdout %.

Common failure modes & fixes

"My journey isn't triggering"

Check the audit log for execution.started — if it's there, the journey IS running; the issue is downstream.
If no execution.started, run a dry-run on the journey. The simulator uses the same planners + eligibility as production, so an audience that fails in dry-run will also fail in production.
If dry-run shows the trigger should fire but real events don't start executions, check the trigger contract is published, the audience contains the user, and the trigger event name matches your SDK send exactly.

"My send was skipped — why?"

The audit metadata's reason field tells you exactly. Most common causes in order:

frequency_cap_exceeded — per-user-per-channel cap (default 5/day, 20/week) is enforced inside JourneyCapStore. Raise the cap on /dashboard/settings/journeys if intentional.
holdout_membership — the user is in the global holdout. Deterministic FNV-1a hash; the same user is suppressed across all journeys. Inspect HOLDSUB# records for confirmation.
opt_out — EndUserCommPreferences set the channel to false. Customer self-served via /preferences/<token> or admin set it.
suppression_match — SES bounce / complaint suppression list match. Won't be sent regardless of the journey.

"My webhook step is failing every time"

Check audit webhook.rejected — likely SSRF check failed (HTTPS-only, private-CIDR block, DNS rebinding defense).
Check audit webhook.failed — fetch threw or the receiver returned non-2xx.
Use the Send test ping action on /dashboard/settings/webhook-destinations to isolate: does the receiver verify HMAC correctly? Returns 2xx?
If ping is fine but production webhooks fail, the payload template is probably the culprit. Check the dispatcher logs for the rendered body.

"An execution has been running for days with no progress"

The dashboard's Stuck executions panel surfaces these via the CloudWatch metric filter. The alarm fires at the first detection.
Most common cause: a wait-for-event step whose target event never fires. Check JWAIT# records — they have TTL = deadline + 24h.
Second most common: an infinite loop in a conditional branch. The 60-day execution timeout will eventually catch it.
Forcibly stop via aws stepfunctions stop-execution --execution-arn ... or use the "Stop" button on the live executions panel.

Adaptive Journeys — the system at a glance
Journey Audiences — predicate language reference
Calibrated Impact — how lift is measured
Global Holdout — why some sends are intentionally skipped