Evaluation

DynaSchedBench evaluation is trajectory-based. Generate an instance, run an agent, then evaluate the resulting trajectory.

Trajectory Files

Agent runs can write two trajectory formats:

trajectory.json: a full serialized trajectory. This is convenient for small examples and debugging.
trajectory_light.jsonl: a streaming summary trajectory. This is the better default for larger runs because it avoids keeping every full snapshot in memory.

Most evaluation and visualization commands accept both formats.

Recompute Metrics

dsbx-eval from-trajectory \
  -t runs/minimal/spt/trajectory_light.jsonl \
  -o runs/minimal/spt/metrics_recomputed.json

Options:

--show-violations / --hide-violations: print hard-constraint violations.
--fail-on-violation: use a non-zero exit code if violations exist.
--warm: ignore an initial warm-up window for time-series metrics.
--warmup-ratio: warm-up window fraction, default 0.3.

Metric Families

Static scalar metrics include:

flow and completion: makespan, total_flow_time, mean_flow_time, throughput;
due-date performance: total_tardiness, mean_tardiness, num_tardy_jobs, total_weighted_tardiness;
WIP and queues: max_wip, mean_wip, max_queue_length_total, mean_queue_length_total;
completion ratios: num_jobs_total, num_jobs_completed, job_completion_ratio, job_cancellation_ratio;
utilization: avg_utilization_global, max_utilization_global, min_utilization_global, utilization_cv_global;
job structure: mean_num_ops_per_job, mean_work_content_per_job.

Dynamic metrics include:

avg_changed_ops_ratio
avg_start_time_shift
max_start_time_shift
reschedule_steps
reschedule_frequency
schedule_edit_intensity

Use dsbx-vis metrics-list to print the metric names supported by curve plots and scalar summaries.

Validate Generated Events

dsbx-eval check-events \
  -c runs/minimal/input_model.json \
  -e runs/minimal/events.jsonl \
  --strict-events \
  --fail-on-error

Recommended usage:

Run without --strict-events for quick feasibility checks.
Add --strict-events before launching large experiments.
Add --fail-on-error in CI or benchmark generation scripts.
Use --strict-max-messages to limit long error reports.

Check Schedule Feasibility

dsbx-eval check-schedule \
  -t runs/minimal/spt/trajectory_light.jsonl \
  --show-violations \
  --fail-on-violation

This checks hard constraints on the produced schedule. It is useful after implementing a new agent or changing environment logic.

Warm-Window Evaluation

Warm-window evaluation ignores early transient behavior:

dsbx-eval from-trajectory \
  -t runs/minimal/spt/trajectory_light.jsonl \
  --warm \
  --warmup-ratio 0.3

With --warmup-ratio 0.3, time-series aggregates begin at 30 percent of the final trajectory time.

Debug LLM Episodes

LLM debug commands consolidate trajectory, event, and log information into compact files for per-decision inspection.

dsbx-eval debug-llmscheduler \
  -t runs/llm_scheduler/trajectory_light.jsonl \
  -e runs/minimal/events.jsonl \
  -l runs/llm_scheduler/main.log \
  -o runs/llm_scheduler/debug.jsonl

dsbx-eval debug-llmcoder \
  -t runs/llm_coder/trajectory_light.jsonl \
  -e runs/minimal/events.jsonl \
  --coder-trajectory runs/llm_coder/coder_trajectory.jsonl \
  -l runs/llm_coder/main.log \
  -o runs/llm_coder/debug.json

Use these files to inspect prompts, selected actions, candidate rules, and event context around difficult decisions.