Orchestrator Observability Architecture
This document defines the observability architecture for the orchestration pipeline from job creation through module execution and lifecycle persistence.
Goals
- Make orchestration failures actionable in production.
- Preserve request/job context across async boundaries.
- Keep logging structured and queryable.
- Minimize hidden failure modes in state transitions.
Scope
Applies to:
CreateAnalysisJobUseCaseExecuteAnalysisJobUseCaseJobWorkflow- Worker execution path in
job_queue
Design Principles
-
Context at boundaries
- Every public use-case/workflow entrypoint must emit structured context at start and finish.
- Required identifiers:
job_id,project_id,module, and transition target when applicable.
-
Error locality
- Failures should be logged as close as possible to their source with operation-specific metadata.
- Callers should receive typed errors; logs carry operational detail.
-
State-machine visibility
- Every job transition (
Pending -> Queued -> Running -> Completed/Failed/Cancelled) must be visible in logs with reason and timing.
- Every job transition (
-
Async fan-out accountability
- Parallel module execution logs should expose:
- module spawn
- module completion/failure
- aggregate completion counts
- Parallel module execution logs should expose:
-
No panic paths in orchestration
- Runtime errors should not panic in production paths.
Event Model
Job Lifecycle Events
job.lifecycle.enqueuejob.lifecycle.startjob.lifecycle.completejob.lifecycle.failjob.lifecycle.cancel
Suggested fields:
job_idproject_idstatus_fromstatus_toreasonduration_ms(where applicable)
Module Execution Events
job.module.spawnjob.module.completejob.module.errorjob.module.panic(unexpected task panic)
Suggested fields:
job_idmoduleduration_mserror
Instrumentation Strategy
- Use
#[instrument]on public async orchestration methods. - Use
infofor lifecycle milestones. - Use
warnfor degraded-but-continued paths. - Use
errorfor failed operations and panics. - Include elapsed timing at operation boundaries for coarse latency tracking.
Operational Outcomes
With this architecture, operators can answer:
- Which phase is failing most often?
- Which module frequently exceeds expected runtime?
- Which jobs failed due to transition/persistence errors vs module logic?
- How long jobs spend in each phase.
Future Extensions
- Export spans to OpenTelemetry collector.
- Add metrics counters/histograms aligned with event model.
- Correlate webhook delivery outcomes with lifecycle events.