Evaluation

DynaSchedBench evaluation is trajectory-based. Generate an instance, run an agent, then evaluate the resulting trajectory.

Trajectory Files

Agent runs can write two trajectory formats:

  • trajectory.json: a full serialized trajectory. This is convenient for small examples and debugging.

  • trajectory_light.jsonl: a streaming summary trajectory. This is the better default for larger runs because it avoids keeping every full snapshot in memory.

Most evaluation and visualization commands accept both formats.

Recompute Metrics

dsbx-eval from-trajectory \
  -t runs/minimal/spt/trajectory_light.jsonl \
  -o runs/minimal/spt/metrics_recomputed.json

Options:

  • --show-violations / --hide-violations: print hard-constraint violations.

  • --fail-on-violation: use a non-zero exit code if violations exist.

  • --warm: ignore an initial warm-up window for time-series metrics.

  • --warmup-ratio: warm-up window fraction, default 0.3.

Metric Families

Static scalar metrics include:

  • flow and completion: makespan, total_flow_time, mean_flow_time, throughput;

  • due-date performance: total_tardiness, mean_tardiness, num_tardy_jobs, total_weighted_tardiness;

  • WIP and queues: max_wip, mean_wip, max_queue_length_total, mean_queue_length_total;

  • completion ratios: num_jobs_total, num_jobs_completed, job_completion_ratio, job_cancellation_ratio;

  • utilization: avg_utilization_global, max_utilization_global, min_utilization_global, utilization_cv_global;

  • job structure: mean_num_ops_per_job, mean_work_content_per_job.

Dynamic metrics include:

  • avg_changed_ops_ratio

  • avg_start_time_shift

  • max_start_time_shift

  • reschedule_steps

  • reschedule_frequency

  • schedule_edit_intensity

Use dsbx-vis metrics-list to print the metric names supported by curve plots and scalar summaries.

Validate Generated Events

dsbx-eval check-events \
  -c runs/minimal/input_model.json \
  -e runs/minimal/events.jsonl \
  --strict-events \
  --fail-on-error

Recommended usage:

  • Run without --strict-events for quick feasibility checks.

  • Add --strict-events before launching large experiments.

  • Add --fail-on-error in CI or benchmark generation scripts.

  • Use --strict-max-messages to limit long error reports.

Check Schedule Feasibility

dsbx-eval check-schedule \
  -t runs/minimal/spt/trajectory_light.jsonl \
  --show-violations \
  --fail-on-violation

This checks hard constraints on the produced schedule. It is useful after implementing a new agent or changing environment logic.

Warm-Window Evaluation

Warm-window evaluation ignores early transient behavior:

dsbx-eval from-trajectory \
  -t runs/minimal/spt/trajectory_light.jsonl \
  --warm \
  --warmup-ratio 0.3

With --warmup-ratio 0.3, time-series aggregates begin at 30 percent of the final trajectory time.

Debug LLM Episodes

LLM debug commands consolidate trajectory, event, and log information into compact files for per-decision inspection.

dsbx-eval debug-llmscheduler \
  -t runs/llm_scheduler/trajectory_light.jsonl \
  -e runs/minimal/events.jsonl \
  -l runs/llm_scheduler/main.log \
  -o runs/llm_scheduler/debug.jsonl
dsbx-eval debug-llmcoder \
  -t runs/llm_coder/trajectory_light.jsonl \
  -e runs/minimal/events.jsonl \
  --coder-trajectory runs/llm_coder/coder_trajectory.jsonl \
  -l runs/llm_coder/main.log \
  -o runs/llm_coder/debug.json

Use these files to inspect prompts, selected actions, candidate rules, and event context around difficult decisions.