Evaluation
DynaSchedBench evaluation is trajectory-based. Generate an instance, run an agent, then evaluate the resulting trajectory.
Trajectory Files
Agent runs can write two trajectory formats:
trajectory.json: a full serialized trajectory. This is convenient for small examples and debugging.trajectory_light.jsonl: a streaming summary trajectory. This is the better default for larger runs because it avoids keeping every full snapshot in memory.
Most evaluation and visualization commands accept both formats.
Recompute Metrics
dsbx-eval from-trajectory \
-t runs/minimal/spt/trajectory_light.jsonl \
-o runs/minimal/spt/metrics_recomputed.json
Options:
--show-violations / --hide-violations: print hard-constraint violations.--fail-on-violation: use a non-zero exit code if violations exist.--warm: ignore an initial warm-up window for time-series metrics.--warmup-ratio: warm-up window fraction, default0.3.
Metric Families
Static scalar metrics include:
flow and completion:
makespan,total_flow_time,mean_flow_time,throughput;due-date performance:
total_tardiness,mean_tardiness,num_tardy_jobs,total_weighted_tardiness;WIP and queues:
max_wip,mean_wip,max_queue_length_total,mean_queue_length_total;completion ratios:
num_jobs_total,num_jobs_completed,job_completion_ratio,job_cancellation_ratio;utilization:
avg_utilization_global,max_utilization_global,min_utilization_global,utilization_cv_global;job structure:
mean_num_ops_per_job,mean_work_content_per_job.
Dynamic metrics include:
avg_changed_ops_ratioavg_start_time_shiftmax_start_time_shiftreschedule_stepsreschedule_frequencyschedule_edit_intensity
Use dsbx-vis metrics-list to print the metric names supported by curve plots
and scalar summaries.
Validate Generated Events
dsbx-eval check-events \
-c runs/minimal/input_model.json \
-e runs/minimal/events.jsonl \
--strict-events \
--fail-on-error
Recommended usage:
Run without
--strict-eventsfor quick feasibility checks.Add
--strict-eventsbefore launching large experiments.Add
--fail-on-errorin CI or benchmark generation scripts.Use
--strict-max-messagesto limit long error reports.
Check Schedule Feasibility
dsbx-eval check-schedule \
-t runs/minimal/spt/trajectory_light.jsonl \
--show-violations \
--fail-on-violation
This checks hard constraints on the produced schedule. It is useful after implementing a new agent or changing environment logic.
Warm-Window Evaluation
Warm-window evaluation ignores early transient behavior:
dsbx-eval from-trajectory \
-t runs/minimal/spt/trajectory_light.jsonl \
--warm \
--warmup-ratio 0.3
With --warmup-ratio 0.3, time-series aggregates begin at 30 percent of the
final trajectory time.
Debug LLM Episodes
LLM debug commands consolidate trajectory, event, and log information into compact files for per-decision inspection.
dsbx-eval debug-llmscheduler \
-t runs/llm_scheduler/trajectory_light.jsonl \
-e runs/minimal/events.jsonl \
-l runs/llm_scheduler/main.log \
-o runs/llm_scheduler/debug.jsonl
dsbx-eval debug-llmcoder \
-t runs/llm_coder/trajectory_light.jsonl \
-e runs/minimal/events.jsonl \
--coder-trajectory runs/llm_coder/coder_trajectory.jsonl \
-l runs/llm_coder/main.log \
-o runs/llm_coder/debug.json
Use these files to inspect prompts, selected actions, candidate rules, and event context around difficult decisions.