SensitivityDataset: read episode_results.jsonl directly#811
Conversation
… FactorSchema - Add SensitivityDataset.from_episode_results: discover the factor schema from the recorder's per-episode variation draws (number -> continuous, numeric vector -> one scalar factor per component, string -> categorical), building theta/x directly. - Make generate_report single-mode (--episode_results); remove the factors.yaml / episode_summary path (from_files, FactorSchema.from_yaml, arena_env_args helpers). - Remove the FactorSchema dataclass: SensitivityDataset now holds factors: list[FactorSpec] and owns factor_columns; consumers use dataset.factors. - Replace the two from_files tests with episode_results equivalents covering the continuous-first layout, integer-coded categoricals, and inferred ranges. Signed-off-by: Clemens Volk <cvolk@nvidia.com>
- Remove the vestigial FactorSpec.dim field: vector variations are split into scalar factors in from_episode_results, so a factor always spans one theta column. - Simplify factor_columns to one-column-per-factor (continuous-first) and the constructor's width check to len(factors); drop the total_factor_dim arithmetic. Signed-off-by: Clemens Volk <cvolk@nvidia.com>
- Vector variation components are named with an opaque positional suffix (key[0], ...); note the follow-up to emit semantic names (e.g. camera x_right/y_down/z_forward) at the recorder. Signed-off-by: Clemens Volk <cvolk@nvidia.com>
…ests - Drop factors that took a single value across all episodes (constant continuous or one-choice categorical) instead of warn-and-keep; a constant categorical otherwise crashed MNPE's mixed-density transform during fit. Assert if nothing varied. - Update concept_sensitivity_analysis.rst to the single --episode_results flow with discovered schema; remove the stale factors.yaml / episode_summary docs and CLI. - Add tests: constant-factor dropping, all-constant raises, bool->categorical, inconsistent factor set across rows, non-numeric vector component. Signed-off-by: Clemens Volk <cvolk@nvidia.com>
There was a problem hiding this comment.
🤖 Isaac Lab-Arena Review Bot
Summary
Clean, well-scoped refactor: the sensitivity analyzer now discovers its factor schema directly from the recorder's episode_results.jsonl variations block, removing the hand-authored factors.yaml indirection, the FactorSchema wrapper, and the vestigial FactorSpec.dim. The schema-from-data inference (number → continuous, numeric vector → one factor per component, string/bool → categorical) is sensibly chosen and the constant-factor drop is a real correctness guard (a constant continuous factor would give a zero-width range and a constant categorical crashes MNPE). Tests cover the new paths well, including the negative cases. This is the kind of removal-and-simplify change the codebase benefits from.
The CLI/generate_report/from_files signature change is a breaking change to this entry point, but it's the stated purpose of the PR, the docs are updated in lockstep, and the recorder that emits this format isn't shipped yet — so downstream blast radius is minimal.
Findings
🔵 Improvement: dataset.py:315 — outcome values get no type validation or row context, unlike the careful asserts on the variations side. A malformed recorder field (e.g. a string outcome) would raise a bare ValueError/TypeError with no pointer to the offending row. Worth a small guarded conversion mirroring _flatten_variation_value so a bad outcome fails as legibly as a bad variation does.
Test Coverage
Good. The two from_files tests are replaced with from_episode_results equivalents covering vector-splitting, mixed continuous/categorical with inferred ranges/choices, constant-factor dropping, all-constant raising, bool→categorical, inconsistent factor sets, and non-numeric vector components. These are pure-CPU analysis tests (no sim), so the inner/outer run_simulation_app_function pattern correctly doesn't apply.
Verdict
Ship it
Greptile SummaryThis PR replaces the two-file input schema (
Confidence Score: 5/5Safe to merge — the refactoring is self-contained, all call sites are updated consistently, and the test suite comprehensively covers the new reader paths. The schema-discovery logic is correct end-to-end: vector expansion, constant-factor dropping, type-consistency checks across rows, and the null/non-dict variations guard all behave as designed. The range representation change (list-of-pairs → flat tuple) is applied uniformly across dataset.py, analyzer.py, plotting.py, and the synthetic test fixture. Issues flagged in previous review threads — null variations crashing the pre-check, double min/max traversal, and the interleaved syntax error — are all resolved in the current code. No regressions or new defects identified. No files require special attention. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant User
participant generate_report
participant episode_results_reader
participant SensitivityDataset
participant SensitivityAnalyzer
participant plot_marginals
User->>generate_report: episode_results.jsonl + outcome_names + factor_names
generate_report->>episode_results_reader: dataset_from_episode_results()
episode_results_reader->>episode_results_reader: _read_rows() — parse JSONL
episode_results_reader->>episode_results_reader: _discover_factor_values() — flatten vectors, validate rows
episode_results_reader->>episode_results_reader: _build_factor_columns() — drop constants, infer ranges
episode_results_reader->>episode_results_reader: _build_outcome_columns()
episode_results_reader->>SensitivityDataset: SensitivityDataset(factors, theta, x)
SensitivityDataset-->>generate_report: dataset
generate_report->>SensitivityAnalyzer: SensitivityAnalyzer(dataset)
generate_report->>SensitivityAnalyzer: fit()
generate_report->>SensitivityAnalyzer: sample_posterior(observation)
SensitivityAnalyzer-->>generate_report: posterior samples
generate_report->>plot_marginals: plot_marginals(samples, dataset, observation)
plot_marginals-->>User: figure saved to output_path
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant User
participant generate_report
participant episode_results_reader
participant SensitivityDataset
participant SensitivityAnalyzer
participant plot_marginals
User->>generate_report: episode_results.jsonl + outcome_names + factor_names
generate_report->>episode_results_reader: dataset_from_episode_results()
episode_results_reader->>episode_results_reader: _read_rows() — parse JSONL
episode_results_reader->>episode_results_reader: _discover_factor_values() — flatten vectors, validate rows
episode_results_reader->>episode_results_reader: _build_factor_columns() — drop constants, infer ranges
episode_results_reader->>episode_results_reader: _build_outcome_columns()
episode_results_reader->>SensitivityDataset: SensitivityDataset(factors, theta, x)
SensitivityDataset-->>generate_report: dataset
generate_report->>SensitivityAnalyzer: SensitivityAnalyzer(dataset)
generate_report->>SensitivityAnalyzer: fit()
generate_report->>SensitivityAnalyzer: sample_posterior(observation)
SensitivityAnalyzer-->>generate_report: posterior samples
generate_report->>plot_marginals: plot_marginals(samples, dataset, observation)
plot_marginals-->>User: figure saved to output_path
Reviews (8): Last reviewed commit: "Sensitivity: fix the --factors pre-check..." | Re-trigger Greptile |
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
- from_episode_results / generate_report take an optional factor_names allowlist (--factors); a vector variation is selected by its base name and keeps all components. Unset analyzes every recorded variation (unchanged default). Unknown names raise. - Add tests for subset selection and unknown-name rejection; document --factors. Signed-off-by: Clemens Volk <cvolk@nvidia.com>
- Drop reStructuredText double-ticks and semicolons from the docstrings and comments we added; remove restated default values and estimator-internal asides. - Keep each docstring to what the function does rather than the surrounding flow. Signed-off-by: Clemens Volk <cvolk@nvidia.com>
There was a problem hiding this comment.
🤖 Isaac Lab-Arena Review Bot
Summary
This drops the hand-authored factors.yaml / episode_summary.jsonl indirection in favor of discovering the factor schema directly from the recorder's per-episode variations block, and collapses FactorSchema + FactorSpec.dim into a flat factors: list[FactorSpec] where every factor is exactly one theta column. The simplification is well-motivated and the new discovery/validation path is thorough — inconsistent factor sets, non-numeric vector components, empty/constant inputs, and unknown --factors names all fail loudly with clear messages, and the tests cover each of those branches plus the vector-split and bool→categorical cases. Nice, well-scoped cleanup.
Findings
🔵 Improvement: dataset.py:192 — type inference from JSON value type means an integer-coded categorical would be silently treated as continuous. See inline note — likely fine given the recorder emits strings for categoricals, but worth confirming/documenting the assumption.
Test Coverage
These are CPU-only analysis tests (no Isaac Sim), so the inner/outer run_simulation_app_function pattern correctly does not apply. Coverage is good: the new from_episode_results path is exercised across vector-split, mixed continuous/categorical, constant-drop, all-constant-raises, bool→categorical, inconsistent-factor-set, non-numeric-component, subset selection, and unknown-name rejection. The two removed from_files tests are replaced 1:1 with episode-results equivalents.
Verdict
Ship it
…r assumption - The posterior is taken relative to how the sweep drew the factors, so an unevenly sampled categorical leaks its sampling frequency into the result (a no-effect factor stops reading flat). Warn when the most-sampled choice exceeds the least by 1.5x or more. - Document in the concept page that factors should be drawn from the prior (uniform per continuous range, equal episodes per categorical choice), and that balancing the draws is the clean fix. Signed-off-by: Clemens Volk <cvolk@nvidia.com>
There was a problem hiding this comment.
🤖 Isaac Lab-Arena Review Bot
Summary
This refactor has SensitivityDataset read the recorder's episode_results.jsonl directly and discover the factor schema from each row's variations block, retiring the hand-authored factors.yaml, the FactorSchema wrapper, from_files, and FactorSpec.dim. It's a genuine net simplification — fewer types, one input path, one column per factor — and the discovery rules (number→continuous, numeric vector→one factor per component, string/bool→categorical), constant-factor dropping, and the uneven-categorical warning are all backed by focused tests. Discovery, validation, type-consistency, and the public generate_report signature all check out, and no other module references the removed symbols, so nothing downstream breaks.
Findings
🔵 Improvement: dataset.py:212 — _build_dataset_from_episode_rows carries a lot at once (subset validation → per-row discovery/validation → constant-drop → theta/x assembly). Optional polish: see inline.
Test Coverage
Strong. The new cases cover vector-split, mixed continuous/categorical, constant-factor dropping, the all-constant raise, bool→categorical, inconsistent factor sets across rows, non-numeric vector components, subset selection, and unknown-name rejection. All are pure-Python (no sim), so they correctly land in Phase 1 and don't need the inner/outer run_simulation_app_function pattern. The two retired from_files tests are replaced with from_episode_results equivalents.
Verdict
Ship it
alexmillane
left a comment
There was a problem hiding this comment.
Looks great.
Only real comment is the request for a e2e smoke test.
…view) - Define theta (the per-episode factor matrix, one column per factor) in the FactorSpec and SensitivityDataset docstrings, so "column of theta" reads on its own. - FactorSpec.range is now a single (low, high) tuple, not a list of pairs — the list was vestigial from the removed per-dim handling. Updates analyzer, plotting, reader, and tests. Signed-off-by: Clemens Volk <cvolk@nvidia.com>
- Move from_episode_results and its helpers to a new episode_results_reader.py module, exposed as dataset_from_episode_results(); dataset.py is now a pure in-memory container that no longer knows the recorder's on-disk format. - Break the reader into small composable helpers (_read_rows, _discover_factor_values, _build_factor_columns, _warn_if_unevenly_sampled, _build_outcome_columns); the entry point is a short composition that returns a SensitivityDataset directly. - Update generate_report, the analyzer error message, and the tests to the new entry point. Signed-off-by: Clemens Volk <cvolk@nvidia.com>
There was a problem hiding this comment.
🤖 Isaac Lab-Arena Review Bot
Summary
This replaces the hand-authored factors.yaml + episode_summary.jsonl pair with direct discovery from a single episode_results.jsonl, and moves the on-disk reader into its own episode_results_reader.py so dataset.py stays a pure in-memory container. The data model is genuinely simpler (one factor = one theta column, range inferred from data), the layering is clean (analysis stays sim-agnostic, the reader is the only thing that knows the recorder format), and the test suite covers the new paths well — vector splitting, mixed continuous/categorical, bool-as-categorical, constant dropping, inconsistent factor sets, subset selection, and unknown-name rejection. Only a couple of minor nits below.
Findings
🔵 Improvement: episode_results_reader.py:233 — the 1.5 imbalance threshold is a behavior-controlling magic number; pulling it into a named module constant would make it discoverable and tunable.
🔵 Improvement: episode_results_reader.py:223 — _warn_if_unevenly_sampled is only ever called with categorical string values, so the parameter type could be the more precise list[str].
Test Coverage
Thorough. The new dataset_from_episode_results paths are well exercised (including the negative cases: inconsistent factor set, non-numeric vector component, all-constant, unknown factor name). These are CPU-only analysis tests with no sim dependency, so the inner/outer run_simulation_app_function pattern correctly does not apply. The synthetic-data and from_files→from_episode_results migrations keep the existing recovery tests intact.
Verdict
Ship it
- Guard that a row's variations block is a dict, so a null/non-object value fails with a clear message instead of an AttributeError. - Assert each outcome value is numeric or boolean with row-and-path context, matching the variation-side error legibility instead of a bare cast error. - Note in the docstring that categoricals must be recorded as strings or bools, not integer codes (an integer is read as a continuous factor). - Lift the imbalance-warning threshold to a named _IMBALANCE_WARN_RATIO constant. - Add guard tests for the non-dict variations and non-numeric outcome cases. Signed-off-by: Clemens Volk <cvolk@nvidia.com>
There was a problem hiding this comment.
🤖 Isaac Lab-Arena Review Bot
Summary
This is a welcome simplification: the sensitivity toolbox now discovers its factor schema directly from the recorder's episode_results.jsonl (dataset_from_episode_results), dropping the hand-authored factors.yaml, the FactorSchema wrapper, and the vestigial FactorSpec.dim. dataset.py becomes a pure in-memory container and the on-disk format lives in one well-decomposed reader module. The net is ~200 fewer lines, the layering reads cleanly (pure-Python, sim-agnostic analysis), and the new behavior is well covered by tests. Two minor robustness/consistency nits below; neither blocks.
Findings
See inline comments.
Test Coverage
Good. The new reader is exercised across vector-splitting, mixed continuous/categorical discovery, constant-factor dropping, all-constant rejection, bool→categorical, inconsistent factor sets, non-numeric vector components, subset selection, unknown-name rejection, non-dict variations, and non-numeric outcomes. These are pure-CPU analysis tests (no SimulationApp), so they correctly stay in Phase 1 and don't need the inner/outer simulation pattern. The replaced from_files tests were ported to episode_results equivalents.
Verdict
Ship it
Co-authored-by: arena-review-bot[bot] <290456231+arena-review-bot[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
There was a problem hiding this comment.
🤖 Isaac Lab-Arena Review Bot
Summary
This is a clean simplification: the sensitivity toolbox now discovers its factor schema straight from the recorder's episode_results.jsonl, dropping the hand-authored factors.yaml / FactorSchema / FactorSpec.dim machinery. The reader is well-placed in its own module so dataset.py stays a pure in-memory container, and the new behavior (constant-factor dropping, imbalance warning, subset selection) is well covered by tests.
There is one hard blocker: a duplicated line introduced by the last (suggestion) commit is a SyntaxError, so the module currently can't import. Once that's removed the PR is in good shape.
Findings
🔴 Critical: episode_results_reader.py:124-125 — see inline comment. A duplicated if selected is not None: is a SyntaxError; nothing in the module imports, so every new test and generate_report is broken. One-line fix.
Test Coverage
Strong. The two old from_files tests are replaced with episode-results equivalents and the new paths are well exercised — vector splitting, mixed continuous/categorical, constant-factor dropping, all-constant raise, bool→categorical, inconsistent factor set, non-numeric vector component / outcome, non-dict variations, subset selection, and unknown-name rejection. These all live in Phase 1 (no sim), which is correct for a CPU-only analysis layer. Note that the syntax error means none of them can currently run — they'll pass once the duplicate line is dropped.
Verdict
Minor fixes needed — sound refactor blocked only by the one-line syntax error.
There was a problem hiding this comment.
🤖 Isaac Lab-Arena Review Bot
Summary
This PR drops the hand-authored factors.yaml + episode_summary.jsonl pair in favour of discovering factors directly from a single episode_results.jsonl, and removes FactorSchema / FactorSpec.dim to leave a leaner FactorSpec + SensitivityDataset. The direction is good and the data-model simplification is welcome — but the central new module does not parse, so it can't merge as-is.
Findings
🔴 Critical: episode_results_reader.py:123-132 — the factor_names validation block is malformed (duplicated if selected is not None: headers, an unclosed assert, the missing check never raised). python -m py_compile fails with IndentationError: expected an indented block after 'if' statement on line 124. The module can't be imported, which takes down generate_report.py and every test in test_sensitivity_analysis.py (all import dataset_from_episode_results). This looks like a botched merge/edit. Reconstructed fix in the inline suggestion — it also restores the actual unknown-factor assertion that test_from_episode_results_rejects_unknown_factor_name depends on.
🔵 Improvement: episode_results_reader.py:134 — factor_kinds carries the bare strings "continuous"/"categorical" while the FactorType enum (same module) already models exactly this. Could it use dict[str, FactorType] and FactorType.CATEGORICAL/FactorType.CONTINUOUS, in line with the typed-discriminator convention?
Test Coverage
Good coverage of the new reader: vector splitting, mixed continuous/categorical, constant-factor drop, all-constant error, bool-as-categorical, inconsistent factor set, non-numeric vector/outcome, and the subset path. These are pure-Python (no sim), so they belong in Phase 1 — no inner/outer harness needed. Note that none of these currently run: the IndentationError fails collection for the whole file. Once the syntax is fixed they should pass.
Verdict
Needs rework — the new module doesn't parse; fix the syntax error (and confirm the suite runs green) before this can land.
…uggestions - Two applied GitHub suggestions interleaved three `if selected is not None:` blocks with an unclosed assert, so the module no longer parsed (IndentationError) — breaking every import. - Restore a single pre-check that guards row 0's variations is a dict (clear message instead of a TypeError on null), and re-adds the dropped `assert not missing` for unknown factor names. Signed-off-by: Clemens Volk <cvolk@nvidia.com>
There was a problem hiding this comment.
🤖 Isaac Lab-Arena Review Bot
Summary
This collapses the sensitivity data model from a hand-authored factors.yaml + nested episode_summary.jsonl pair down to a single episode_results.jsonl whose factor schema is discovered from each row's variations block. The new episode_results_reader.py cleanly isolates all on-disk-format knowledge so dataset.py stays a pure in-memory container, and the net effect is a large simplification (≈238 deletions). The discovery, vector-splitting, constant-dropping, and error paths are all covered by focused tests. Reads well and the layering is sound — nothing blocking.
One thing worth noting (not a code finding): this is a deliberate breaking change to the sensitivity CLI/API — --factors_yaml/--episode_summary become --episode_results, and SensitivityDataset.from_files/FactorSchema/FactorSpec.dim are gone. Since the producing recorder hasn't shipped yet (per the .. todo::) and every in-repo caller is updated in this PR, there are no external consumers to surprise — flagging only for the record.
Findings
No issues found. A few things I checked and was happy with: the continuous-first column layout stays consistent across _build_factor_columns, dataset.factor_columns, and the analyzer's _normalize/_normalized_prior slicing; constant factors are dropped before the MNPE fit (the documented crash) and an all-constant input asserts loudly; bool is correctly intercepted before the int/float branch and kept categorical; and sorted(set(...)) choices make the integer coding deterministic/reproducible.
Test Coverage
Thorough. The new tests exercise vector-splitting, mixed continuous/categorical discovery, constant-drop, all-constant raise, bool-as-categorical, inconsistent factor sets, non-numeric vector components, the --factors subset path, unknown-factor and non-dict-variations errors, and non-numeric outcomes — including asserting the MNPE vs NPE selection. These are pure-CPU analysis tests with no sim dependency, so the inner/outer run_simulation_app_function pattern correctly does not apply, and with no camera/subprocess use they land in Phase 1 as intended.
Verdict
Ship it
Summary
Sensitivity analysis reads recorder output directly
Detailed description
episode_results.jsonlcarrying the realized variation draws, so the old hand-writtenfactors.yamlschema plus bespoke nestedepisode_summary.jsonlpair is redundant.SensitivityDataset.from_episode_results, which discovers the factors from each row'svariationsblock — a number becomes continuous, a numeric vector splits into one factor per component, a string becomes categorical (choices learned from the data).generate_reporttakes a single--episode_resultsinput, with--factorsto analyze a subset.factors.yamlpath (from_files), theFactorSchemawrapper, andFactorSpec.dim. It is now justFactorSpec(one factor = one theta column) andSensitivityDataset(factors + theta + x). Continuous-only sweeps fit NPE, mixed sweeps fit MNPE automatically.