# Evaluation DynaSchedBench evaluation is trajectory-based. Generate an instance, run an agent, then evaluate the resulting trajectory. ## Trajectory Files Agent runs can write two trajectory formats: - `trajectory.json`: a full serialized trajectory. This is convenient for small examples and debugging. - `trajectory_light.jsonl`: a streaming summary trajectory. This is the better default for larger runs because it avoids keeping every full snapshot in memory. Most evaluation and visualization commands accept both formats. ## Recompute Metrics ```bash dsbx-eval from-trajectory \ -t runs/minimal/spt/trajectory_light.jsonl \ -o runs/minimal/spt/metrics_recomputed.json ``` Options: - `--show-violations / --hide-violations`: print hard-constraint violations. - `--fail-on-violation`: use a non-zero exit code if violations exist. - `--warm`: ignore an initial warm-up window for time-series metrics. - `--warmup-ratio`: warm-up window fraction, default `0.3`. ## Metric Families Static scalar metrics include: - flow and completion: `makespan`, `total_flow_time`, `mean_flow_time`, `throughput`; - due-date performance: `total_tardiness`, `mean_tardiness`, `num_tardy_jobs`, `total_weighted_tardiness`; - WIP and queues: `max_wip`, `mean_wip`, `max_queue_length_total`, `mean_queue_length_total`; - completion ratios: `num_jobs_total`, `num_jobs_completed`, `job_completion_ratio`, `job_cancellation_ratio`; - utilization: `avg_utilization_global`, `max_utilization_global`, `min_utilization_global`, `utilization_cv_global`; - job structure: `mean_num_ops_per_job`, `mean_work_content_per_job`. Dynamic metrics include: - `avg_changed_ops_ratio` - `avg_start_time_shift` - `max_start_time_shift` - `reschedule_steps` - `reschedule_frequency` - `schedule_edit_intensity` Use `dsbx-vis metrics-list` to print the metric names supported by curve plots and scalar summaries. ## Validate Generated Events ```bash dsbx-eval check-events \ -c runs/minimal/input_model.json \ -e runs/minimal/events.jsonl \ --strict-events \ --fail-on-error ``` Recommended usage: - Run without `--strict-events` for quick feasibility checks. - Add `--strict-events` before launching large experiments. - Add `--fail-on-error` in CI or benchmark generation scripts. - Use `--strict-max-messages` to limit long error reports. ## Check Schedule Feasibility ```bash dsbx-eval check-schedule \ -t runs/minimal/spt/trajectory_light.jsonl \ --show-violations \ --fail-on-violation ``` This checks hard constraints on the produced schedule. It is useful after implementing a new agent or changing environment logic. ## Warm-Window Evaluation Warm-window evaluation ignores early transient behavior: ```bash dsbx-eval from-trajectory \ -t runs/minimal/spt/trajectory_light.jsonl \ --warm \ --warmup-ratio 0.3 ``` With `--warmup-ratio 0.3`, time-series aggregates begin at 30 percent of the final trajectory time. ## Debug LLM Episodes LLM debug commands consolidate trajectory, event, and log information into compact files for per-decision inspection. ```bash dsbx-eval debug-llmscheduler \ -t runs/llm_scheduler/trajectory_light.jsonl \ -e runs/minimal/events.jsonl \ -l runs/llm_scheduler/main.log \ -o runs/llm_scheduler/debug.jsonl ``` ```bash dsbx-eval debug-llmcoder \ -t runs/llm_coder/trajectory_light.jsonl \ -e runs/minimal/events.jsonl \ --coder-trajectory runs/llm_coder/coder_trajectory.jsonl \ -l runs/llm_coder/main.log \ -o runs/llm_coder/debug.json ``` Use these files to inspect prompts, selected actions, candidate rules, and event context around difficult decisions.