Debugging Adaptive Journeys
When something goes wrong in production — a journey that won't trigger, sends that don't arrive, branches that look wrong — Apex gives you four observability surfaces that work together:
- Live executions panel on the journey detail page (instant, last 50 executions).
- Audit timeline for the journey (every authoring + runtime event).
- CloudWatch dashboard + alarms for the runtime as a whole (operator view).
- CloudWatch Logs Insights for tracing one user through one journey end-to-end.
This guide walks each one and gives you copy-pasteable queries.
1. Live executions panel
Open the journey at /dashboard/communications/journeys/[id] and scroll to Live executions.
Every running, completed, failed, and timed-out execution shows up here within ~1s of state change. Each row links to the Step Functions console for the full execution history, and shows the current step's correlationId (formatted <executionTail>-<stepId>). The correlation ID is the key to everything else; it's logged on every dispatcher invocation and audit emit.
If you only see the panel but not the rows, check that the workspace has at least one published journey and that you have an inbound event matching a trigger contract. The trigger evaluator runs deterministically — same event + same workspace + same trigger contract = exactly one execution start.
2. Audit timeline
Hit GET /api/journeys/[id]/audit (or scroll the audit drawer in the canvas). Every entry has:
timestamp— ISO-8601, also doubles as the SK fragment so the records sort naturallyactor— who or what produced the event (systemfor runtime, usersubfor authoring + admin actions)action— see the canonical list belowmetadata— free-form context (always JSON-serializable)
Audit actions you'll see
| Group | Actions |
|---|---|
| Authoring | created, edited, published, paused, archived |
| Authoring (special) | simulate-with-real-user, dry-run-override-applied |
| Runtime sends | send.executed, send.skipped, send.failed |
| Runtime webhooks | webhook.executed, webhook.rejected, webhook.failed |
| Execution lifecycle | execution.started, execution.exited, execution.failed |
| Webhook destination registry | webhook.destination_registered, webhook.destination_updated, webhook.destination_key_rotated, webhook.destination_deleted, webhook.destination_pinged |
When a send is skipped, the metadata always includes a reason — one of frequency_cap_exceeded, daily_cap_exceeded, opt_out, suppression_match, holdout_membership, or eligibility_failed. That's where to start any "why didn't this send?" investigation.
3. CloudWatch dashboard
The CDK stack ships a per-stage dashboard at:
- Staging:
apex-staging-adaptive-journeys - Prod:
apex-prod-adaptive-journeys
It surfaces:
- Executions started / succeeded / failed / timed-out in 5-min buckets — the at-a-glance view of system health.
- Dispatcher invocations & errors + Dispatcher latency (p50 / p95 / max) — Lambda-level detail.
- Event correlator (wait-for-event resolutions) — non-zero traffic here means subjects are passing through wait-until-event steps.
- Stuck executions (>24h) — anything other than zero is investigable.
- Failed / total executions, last 24h — single-value cards for daily trend.
Alarms that fire to apex-<stage>-journey-alerts
| Alarm | Threshold | What it means |
|---|---|---|
journey-dispatcher-errors | >0 errors in 5 min | Dispatcher Lambda is throwing — check its log group |
journey-dispatcher-slow | p95 > 10s for 3× 5 min | Step Functions executions piling up; tail latency in dispatcher → app |
journey-dispatcher-throttled | >0 throttles | Concurrency floor breached; raise reserved concurrency or back off triggers |
journey-event-correlator-errors | >0 errors in 5 min | Wait-for-event signals not landing; pending journeys may stall |
journey-executions-failed | >0 failed in 5 min | One or more journeys hit a fail action — see the dispatcher error |
journey-executions-timed-out | >0 timed-out in 5 min | A journey hit the 60-day max-execution timeout — usually a wait/loop bug |
journey-stuck-executions | >0 in 15 min | Dispatcher logged stuck-execution (>24h running) — inspect the live executions panel |
The SNS topic forwards to whatever emails are configured in infra/lib/config.ts under alertEmails.
4. CloudWatch Logs Insights
Three queries handle 95% of journey debugging.
Trace one journey for one user end-to-end
fields @timestamp, @message
| filter @message like /<endUserHash>/ and @message like /<journeyId>/
| sort @timestamp asc
| limit 200
Substitute <endUserHash> (the SHA-256 of the email used in dispatcher logs — never the raw email) and <journeyId>. The chronological log gives you the dispatcher's view of every step the subject traversed.
Find slow dispatch calls
filter @duration > 5000
| stats count() by step.id
| sort count desc
Rolls slow dispatcher calls up by step ID so you can see if one step (typically a Send waiting on SES, or a webhook with a slow remote endpoint) is dominating tail latency.
Find blocked sends
filter @message like /send.skipped/
| stats count() by reason
Or for one specific journey:
filter @message like /send.skipped/ and @message like /<journeyId>/
| stats count() by reason
Counts every blocked send by reason. If you're seeing high holdout_membership counts and you didn't expect any, double-check /dashboard/settings/journeys — someone may have raised the workspace holdout %.
Common failure modes & fixes
"My journey isn't triggering"
- Check the audit log for
execution.started— if it's there, the journey IS running; the issue is downstream. - If no
execution.started, run a dry-run on the journey. The simulator uses the same planners + eligibility as production, so an audience that fails in dry-run will also fail in production. - If dry-run shows the trigger should fire but real events don't start executions, check the trigger contract is
published, the audience contains the user, and the trigger event name matches your SDK send exactly.
"My send was skipped — why?"
The audit metadata's reason field tells you exactly. Most common causes in order:
frequency_cap_exceeded— per-user-per-channel cap (default 5/day, 20/week) is enforced insideJourneyCapStore. Raise the cap on/dashboard/settings/journeysif intentional.holdout_membership— the user is in the global holdout. Deterministic FNV-1a hash; the same user is suppressed across all journeys. InspectHOLDSUB#records for confirmation.opt_out—EndUserCommPreferencesset the channel tofalse. Customer self-served via/preferences/<token>or admin set it.suppression_match— SES bounce / complaint suppression list match. Won't be sent regardless of the journey.
"My webhook step is failing every time"
- Check audit
webhook.rejected— likely SSRF check failed (HTTPS-only, private-CIDR block, DNS rebinding defense). - Check audit
webhook.failed— fetch threw or the receiver returned non-2xx. - Use the Send test ping action on
/dashboard/settings/webhook-destinationsto isolate: does the receiver verify HMAC correctly? Returns 2xx? - If ping is fine but production webhooks fail, the payload template is probably the culprit. Check the dispatcher logs for the rendered body.
"An execution has been running for days with no progress"
- The dashboard's Stuck executions panel surfaces these via the CloudWatch metric filter. The alarm fires at the first detection.
- Most common cause: a wait-for-event step whose target event never fires. Check
JWAIT#records — they have TTL = deadline + 24h. - Second most common: an infinite loop in a conditional branch. The 60-day execution timeout will eventually catch it.
- Forcibly stop via
aws stepfunctions stop-execution --execution-arn ...or use the "Stop" button on the live executions panel.
Related
- Adaptive Journeys — the system at a glance
- Journey Audiences — predicate language reference
- Calibrated Impact — how lift is measured
- Global Holdout — why some sends are intentionally skipped