pex

Debugging Adaptive Journeys

When something goes wrong in production — a journey that won't trigger, sends that don't arrive, branches that look wrong — Apex gives you four observability surfaces that work together:

  1. Live executions panel on the journey detail page (instant, last 50 executions).
  2. Audit timeline for the journey (every authoring + runtime event).
  3. CloudWatch dashboard + alarms for the runtime as a whole (operator view).
  4. CloudWatch Logs Insights for tracing one user through one journey end-to-end.

This guide walks each one and gives you copy-pasteable queries.

1. Live executions panel

Open the journey at /dashboard/communications/journeys/[id] and scroll to Live executions.

Every running, completed, failed, and timed-out execution shows up here within ~1s of state change. Each row links to the Step Functions console for the full execution history, and shows the current step's correlationId (formatted <executionTail>-<stepId>). The correlation ID is the key to everything else; it's logged on every dispatcher invocation and audit emit.

If you only see the panel but not the rows, check that the workspace has at least one published journey and that you have an inbound event matching a trigger contract. The trigger evaluator runs deterministically — same event + same workspace + same trigger contract = exactly one execution start.

2. Audit timeline

Hit GET /api/journeys/[id]/audit (or scroll the audit drawer in the canvas). Every entry has:

  • timestamp — ISO-8601, also doubles as the SK fragment so the records sort naturally
  • actor — who or what produced the event (system for runtime, user sub for authoring + admin actions)
  • action — see the canonical list below
  • metadata — free-form context (always JSON-serializable)

Audit actions you'll see

GroupActions
Authoringcreated, edited, published, paused, archived
Authoring (special)simulate-with-real-user, dry-run-override-applied
Runtime sendssend.executed, send.skipped, send.failed
Runtime webhookswebhook.executed, webhook.rejected, webhook.failed
Execution lifecycleexecution.started, execution.exited, execution.failed
Webhook destination registrywebhook.destination_registered, webhook.destination_updated, webhook.destination_key_rotated, webhook.destination_deleted, webhook.destination_pinged

When a send is skipped, the metadata always includes a reason — one of frequency_cap_exceeded, daily_cap_exceeded, opt_out, suppression_match, holdout_membership, or eligibility_failed. That's where to start any "why didn't this send?" investigation.

3. CloudWatch dashboard

The CDK stack ships a per-stage dashboard at:

  • Staging: apex-staging-adaptive-journeys
  • Prod: apex-prod-adaptive-journeys

It surfaces:

  • Executions started / succeeded / failed / timed-out in 5-min buckets — the at-a-glance view of system health.
  • Dispatcher invocations & errors + Dispatcher latency (p50 / p95 / max) — Lambda-level detail.
  • Event correlator (wait-for-event resolutions) — non-zero traffic here means subjects are passing through wait-until-event steps.
  • Stuck executions (>24h) — anything other than zero is investigable.
  • Failed / total executions, last 24h — single-value cards for daily trend.

Alarms that fire to apex-<stage>-journey-alerts

AlarmThresholdWhat it means
journey-dispatcher-errors>0 errors in 5 minDispatcher Lambda is throwing — check its log group
journey-dispatcher-slowp95 > 10s for 3× 5 minStep Functions executions piling up; tail latency in dispatcher → app
journey-dispatcher-throttled>0 throttlesConcurrency floor breached; raise reserved concurrency or back off triggers
journey-event-correlator-errors>0 errors in 5 minWait-for-event signals not landing; pending journeys may stall
journey-executions-failed>0 failed in 5 minOne or more journeys hit a fail action — see the dispatcher error
journey-executions-timed-out>0 timed-out in 5 minA journey hit the 60-day max-execution timeout — usually a wait/loop bug
journey-stuck-executions>0 in 15 minDispatcher logged stuck-execution (>24h running) — inspect the live executions panel

The SNS topic forwards to whatever emails are configured in infra/lib/config.ts under alertEmails.

4. CloudWatch Logs Insights

Three queries handle 95% of journey debugging.

Trace one journey for one user end-to-end

fields @timestamp, @message
| filter @message like /<endUserHash>/ and @message like /<journeyId>/
| sort @timestamp asc
| limit 200

Substitute <endUserHash> (the SHA-256 of the email used in dispatcher logs — never the raw email) and <journeyId>. The chronological log gives you the dispatcher's view of every step the subject traversed.

Find slow dispatch calls

filter @duration > 5000
| stats count() by step.id
| sort count desc

Rolls slow dispatcher calls up by step ID so you can see if one step (typically a Send waiting on SES, or a webhook with a slow remote endpoint) is dominating tail latency.

Find blocked sends

filter @message like /send.skipped/
| stats count() by reason

Or for one specific journey:

filter @message like /send.skipped/ and @message like /<journeyId>/
| stats count() by reason

Counts every blocked send by reason. If you're seeing high holdout_membership counts and you didn't expect any, double-check /dashboard/settings/journeys — someone may have raised the workspace holdout %.

Common failure modes & fixes

"My journey isn't triggering"

  1. Check the audit log for execution.started — if it's there, the journey IS running; the issue is downstream.
  2. If no execution.started, run a dry-run on the journey. The simulator uses the same planners + eligibility as production, so an audience that fails in dry-run will also fail in production.
  3. If dry-run shows the trigger should fire but real events don't start executions, check the trigger contract is published, the audience contains the user, and the trigger event name matches your SDK send exactly.

"My send was skipped — why?"

The audit metadata's reason field tells you exactly. Most common causes in order:

  1. frequency_cap_exceeded — per-user-per-channel cap (default 5/day, 20/week) is enforced inside JourneyCapStore. Raise the cap on /dashboard/settings/journeys if intentional.
  2. holdout_membership — the user is in the global holdout. Deterministic FNV-1a hash; the same user is suppressed across all journeys. Inspect HOLDSUB# records for confirmation.
  3. opt_outEndUserCommPreferences set the channel to false. Customer self-served via /preferences/<token> or admin set it.
  4. suppression_match — SES bounce / complaint suppression list match. Won't be sent regardless of the journey.

"My webhook step is failing every time"

  1. Check audit webhook.rejected — likely SSRF check failed (HTTPS-only, private-CIDR block, DNS rebinding defense).
  2. Check audit webhook.failed — fetch threw or the receiver returned non-2xx.
  3. Use the Send test ping action on /dashboard/settings/webhook-destinations to isolate: does the receiver verify HMAC correctly? Returns 2xx?
  4. If ping is fine but production webhooks fail, the payload template is probably the culprit. Check the dispatcher logs for the rendered body.

"An execution has been running for days with no progress"

  1. The dashboard's Stuck executions panel surfaces these via the CloudWatch metric filter. The alarm fires at the first detection.
  2. Most common cause: a wait-for-event step whose target event never fires. Check JWAIT# records — they have TTL = deadline + 24h.
  3. Second most common: an infinite loop in a conditional branch. The 60-day execution timeout will eventually catch it.
  4. Forcibly stop via aws stepfunctions stop-execution --execution-arn ... or use the "Stop" button on the live executions panel.