Skip to content
Merged
Show file tree
Hide file tree
Changes from 60 commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
bb79bb4
Add sensitivity-analysis pipeline + per-episode recording
cvolkcvolk Jun 3, 2026
48fba5d
Add sensitivity-analysis sweep configs
cvolkcvolk Jun 3, 2026
5a4e50d
Fix episode_writer for MetricsCfg configclass
cvolkcvolk Jun 4, 2026
ee5b406
Discard sbi training logs during report fits
cvolkcvolk Jun 4, 2026
95d1d84
Expand shiny/matte sweep to 1000 jobs
cvolkcvolk Jun 4, 2026
6224293
Replace HTML report with a single-PDF sensitivity report
cvolkcvolk Jun 4, 2026
730a1db
Fix PDF suptitle clipping on narrow single-factor figures
cvolkcvolk Jun 5, 2026
47fdcb0
Park multi-factor and multi-object sweep configs off the MVP branch
cvolkcvolk Jun 5, 2026
7ac550a
Merge branch 'main' into cvolk/feature/sensitivity_analysis_mvp1
cvolkcvolk Jun 5, 2026
5f08550
Densify verbose inline comments in the sensitivity module
cvolkcvolk Jun 5, 2026
3b51bae
Move log_uniform sampling out of the MVP
cvolkcvolk Jun 5, 2026
8372846
Slim the sensitivity module: drop module docstrings and the sbi null-…
cvolkcvolk Jun 8, 2026
3e67c1c
Add a shared EmpiricalAnalyzer base for the KDE and frequency-table a…
cvolkcvolk Jun 8, 2026
6699cf7
Split analyzer module into base/posterior/empirical + factory
cvolkcvolk Jun 8, 2026
73e792e
Move sensitivity synthetic-data generators into tests/utils
cvolkcvolk Jun 8, 2026
383aa79
Move sensitivity CLI scripts into the analysis package
cvolkcvolk Jun 8, 2026
5447928
Reduce MVP to KDE + MNPE; park NPE and FrequencyTable analyzers
cvolkcvolk Jun 8, 2026
f4a3fad
Make sensitivity factory asserts and docstrings self-contained
cvolkcvolk Jun 8, 2026
a9d05b8
Move episode_writer from the analysis package into evaluation
cvolkcvolk Jun 8, 2026
d737e15
Drop the single-PNG analyze CLI; PDF report is the sole renderer
cvolkcvolk Jun 8, 2026
9eea5fc
Drop the synthetic task_duration outcome from the continuous generator
cvolkcvolk Jun 8, 2026
dcb48a4
Use current-year-only copyright headers in the sensitivity workstream
cvolkcvolk Jun 8, 2026
bad0760
Hoist episode_writer import to eval_runner module top
cvolkcvolk Jun 8, 2026
c2c5638
Clarify SliceSpec as plain dataset provenance
cvolkcvolk Jun 8, 2026
4d9509c
Delete the unused synthetic continuous data generator
cvolkcvolk Jun 8, 2026
102c1ce
Hoist shared plot styling into named constants
cvolkcvolk Jun 8, 2026
3ec444c
Add MetricsManager.compute_per_episode
cvolkcvolk Jun 8, 2026
36b993c
Reuse MetricsManager in episode_writer; drop task_duration
cvolkcvolk Jun 8, 2026
4c53619
Rewrite compute_per_episode with explicit loops
cvolkcvolk Jun 8, 2026
c2142d5
Rename generate_report CLI module to generate_pdf_report
cvolkcvolk Jun 8, 2026
52d615b
Rework sensitivity docstrings to be concise and self-contained
cvolkcvolk Jun 8, 2026
ffa1473
Empty the sensitivity package __init__
cvolkcvolk Jun 8, 2026
be64ae3
Label binary-outcome rug as = 0 / = 1, not >= 0.5 threshold
cvolkcvolk Jun 8, 2026
43ac62a
Single-source SUCCESS_THRESHOLD and clarify continuous-factor error
cvolkcvolk Jun 8, 2026
a405da0
Match episode_idx loop variable to the emitted JSONL key
cvolkcvolk Jun 8, 2026
b6c2d4b
Add empirical success-rate plots for binary outcomes
cvolkcvolk Jun 10, 2026
89cdc1b
Split continuous success-rate curves by a categorical factor
cvolkcvolk Jun 10, 2026
0d41041
Snapshot sensitivity MVP before robolab-matching refactor
cvolkcvolk Jun 11, 2026
3af9edf
Refactor sensitivity analysis to mirror robolab (MNPE + NPE)
cvolkcvolk Jun 11, 2026
7d02195
Render continuous marginals as density curves; add rich multi-factor …
cvolkcvolk Jun 11, 2026
432a8ea
Clean up sensitivity docstrings and hoist stdlib imports
cvolkcvolk Jun 11, 2026
5aeba86
Address PR review comments
cvolkcvolk Jun 11, 2026
ea2355f
Sensitivity cleanup: top-level imports, merge report CLI, eval/ outputs
cvolkcvolk Jun 11, 2026
e695b37
Commit to binary outcomes in the conditioning default
cvolkcvolk Jun 11, 2026
a12aec8
Add sensitivity analysis documentation page
cvolkcvolk Jun 11, 2026
4a0a47b
Address docs review comments on the sensitivity page
cvolkcvolk Jun 11, 2026
7871aa1
Relocate synthetic generator into the package and dedupe its builders
cvolkcvolk Jun 11, 2026
096ba3e
Decouple plot_marginals from the analyzer
cvolkcvolk Jun 11, 2026
b9a9cf5
Model synthetic factors as frozen dataclasses
cvolkcvolk Jun 11, 2026
0e688ee
Move default_observation from the analyzer to the dataset
cvolkcvolk Jun 11, 2026
1dac0e3
Remove sensitivity sweep eval configs from the MVP PR
cvolkcvolk Jun 11, 2026
e41e375
Drop the slice block from factors.yaml; TODO a data filter
cvolkcvolk Jun 11, 2026
57e7f24
Outcomes are an analysis-time query, not part of factors.yaml
cvolkcvolk Jun 11, 2026
2ee03e4
Revert .gitignore change
cvolkcvolk Jun 11, 2026
6075a55
Move per-episode recording to a follow-up PR
cvolkcvolk Jun 11, 2026
fe4cad6
Merge branch 'main' into cvolk/feature/sensitivity_analysis_mvp1
cvolkcvolk Jun 11, 2026
33b27e5
Move sensitivity deps to runtime (sbi, scipy, matplotlib)
cvolkcvolk Jun 11, 2026
adf430d
Require at least one --observation value (nargs=+)
cvolkcvolk Jun 11, 2026
88ff6b8
Docs: collapse the synthetic demo to one example
cvolkcvolk Jun 11, 2026
2b43c8d
Merge the two MNPE synthetic datasets into make_mixed_dataset
cvolkcvolk Jun 11, 2026
f0afbaf
Merge branch 'main' into cvolk/feature/sensitivity_analysis_mvp1
cvolkcvolk Jun 12, 2026
48b58bf
Assert continuous factors carry a range before normalizing
cvolkcvolk Jun 12, 2026
ff00343
Seed the sensitivity report RNG for reproducibility
cvolkcvolk Jun 12, 2026
1cec287
Drop the unused SensitivityDataset.prior property
cvolkcvolk Jun 12, 2026
9599d51
Test factors.yaml / episode_summary.jsonl parsing
cvolkcvolk Jun 12, 2026
804759f
Model FactorSpec.type as a FactorType enum
cvolkcvolk Jun 12, 2026
308580d
Type FactorSpec.range as list[tuple[float, float]]
cvolkcvolk Jun 12, 2026
6f4a007
Move synthetic dataset generator into the tests package
cvolkcvolk Jun 12, 2026
68cca97
Clarify FactorSchema docstring
cvolkcvolk Jun 12, 2026
b1a25b7
Define theta and x in the SensitivityAnalyzer docstring
cvolkcvolk Jun 12, 2026
8a24daa
Tighten sensitivity docs wording
cvolkcvolk Jun 12, 2026
0778af2
Define the posterior in the sensitivity docs
cvolkcvolk Jun 12, 2026
f317570
Mark per-episode recording as a follow-up in the sensitivity docs
cvolkcvolk Jun 12, 2026
934999a
Gloss theta and x in the sensitivity docs
cvolkcvolk Jun 12, 2026
3c39843
Clarify the report-reading interpretation in the sensitivity docs
cvolkcvolk Jun 12, 2026
bc917dd
Note the planned vector-factor scalarization in the sensitivity docs
cvolkcvolk Jun 12, 2026
9006912
Signpost the continuous-first theta layout in the analyzer
cvolkcvolk Jun 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 155 additions & 0 deletions docs/pages/concepts/policy/concept_sensitivity_analysis.rst
Comment thread
cvolkcvolk marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
Sensitivity Analysis
====================

The sensitivity-analysis toolbox answers a single question about a policy:
*which environment conditions drive success?* Given the per-episode results of an
evaluation sweep — where factors such as lighting, object mass, or table material were
varied — it fits a posterior over those factors conditioned on the outcome and renders
Comment thread
cvolkcvolk marked this conversation as resolved.
Outdated
one figure summarising which factor values are associated with success.

Why a joint posterior, not a success rate per factor?
Comment thread
cvolkcvolk marked this conversation as resolved.
-----------------------------------------------------

The simplest analysis would chart a success rate for each factor independently. That hides
the two things that matter most in a multi-factor sweep:

- **Factors interact.** How much light a policy needs can depend on the object — a matte
object may succeed at low light while a shiny one needs far more. A per-factor
"success vs light" curve averages over objects and reports one blurry gate that is wrong
for both. The joint posterior keeps the interaction, so you can condition on a specific
object and see its gate.
- **Factors confound each other.** If bright-light episodes also happened to use an easy
object, a per-factor light chart cannot tell which one drove success. Modelling all
factors together attributes the effect to the factor that actually carries it.
Comment thread
cvolkcvolk marked this conversation as resolved.

The per-factor rate is a projection of the joint posterior — derivable from it, but not the
other way around. The toolbox therefore always fits the joint — via simulation-based
inference (MNPE or NPE) — and reads the per-factor marginals from it.

How it works
------------

The toolbox is a thin analysis layer over `sbi <https://sbi.readthedocs.io>`_'s
neural posterior estimators. The flow is:

Comment thread
cvolkcvolk marked this conversation as resolved.
1. **Per-episode recording.** During evaluation, ``episode_writer`` appends one row per
episode to an ``episode_summary.jsonl`` file.
Comment thread
cvolkcvolk marked this conversation as resolved.
Outdated
2. **Schema.** A ``factors.yaml`` declares the *factors* — which ``arena_env_args`` columns
were varied and whether each is continuous or categorical, plus the continuous ranges
that were swept (so the analyzer's prior matches the simulation). It does **not** list
outcomes — *which* outcome to condition on is chosen at analysis time, not saved here.
Comment thread
cvolkcvolk marked this conversation as resolved.
3. **Inference.** ``SensitivityAnalyzer`` loads the pair, trains an estimator on the full
``(theta, x)`` jointly, and samples the joint posterior conditioned on a chosen
Comment thread
cvolkcvolk marked this conversation as resolved.
Outdated
observation (by default, success).
4. **Report.** A smooth density curve for each continuous factor and a probability bar chart
Comment thread
cvolkcvolk marked this conversation as resolved.
Outdated
for each categorical factor.

Inputs
------

**factors.yaml** declares only the factors that were varied (and the continuous ranges that
were swept). Outcomes are not declared here — they're selected at analysis time (see below):

.. code-block:: yaml

factors:
light_intensity:
type: continuous
range: [[0.0, 5000.0]] # the swept range; inferred from the data's min/max if omitted
table_material:
type: categorical
choices: [oak, walnut, bamboo]

**episode_summary.jsonl** is produced by the eval runner — one JSON object per episode. It
carries every measured outcome; the analysis picks which one(s) to condition on:

.. code-block:: json
Comment thread
cvolkcvolk marked this conversation as resolved.

{"job_name": "pi0_sweep", "episode_idx": 0,
"arena_env_args": {"light_intensity": 3200.0, "table_material": "oak"},
"outcomes": {"success": 1}}

Choice of estimator
-------------------

``SensitivityAnalyzer`` picks the estimator from the schema automatically:

.. list-table::
:header-rows: 1
:widths: 25 25 50

* - Schema
- Estimator
- Notes
* - Any categorical factor
- MNPE
- Mixed density estimator; handles continuous + categorical factors together.
* - All continuous factors
- NPE
- Restricts to a Gaussian on a single factor, so a meaningful continuous-only
analysis needs at least two continuous factors.

Continuous factors are normalised to ``[0, 1]`` before fitting and de-normalised when
sampling, so factors on very different scales (e.g. light in the thousands, an offset in
the hundredths) train on equal footing. Outcomes are binary (0/1); the default query
conditions on success (1).

Running a report
----------------

Point the report generator at a ``(factors.yaml, episode_summary.jsonl)`` pair. The output
format follows the file extension (``.png``, ``.pdf``, …); reports are written under
``eval/`` by default.

.. code-block:: bash

python -m isaaclab_arena.analysis.sensitivity.generate_report \
--factors_yaml factors.yaml \
--episode_summary episode_summary.jsonl \
--outcome success \
--output eval/sensitivity_report.png

``--outcome`` selects which per-episode outcome(s) to condition on (keys in the rows'
``outcomes`` block); it defaults to ``success``. Pass ``--observation`` to set the value
per outcome — since outcomes are binary, use ``1`` for success or ``0`` for failure; it
defaults to ``1`` (success).
Comment thread
cvolkcvolk marked this conversation as resolved.

Trying it on synthetic data
---------------------------

A synthetic simulator with a *known* ground truth lets you run the whole pipeline on CPU,
Comment thread
cvolkcvolk marked this conversation as resolved.
Outdated
without Isaac Sim — useful for seeing the output shape and for validating the toolbox
(the recovered posterior should reflect the planted relationship):

.. code-block:: bash

# mixed: three continuous + two categorical factors (MNPE)
python -m isaaclab_arena.analysis.sensitivity.synthetic --kind mixed --output eval/demo.png

``--kind`` also accepts ``continuous`` (continuous-only factors, which exercises the NPE path).

Reading the output
------------------

.. todo::

Add a sample report figure here and walk through reading it.

Each panel is the posterior over one factor *conditioned on success* — "given the policy
succeeded, which values of this factor were responsible?" For a continuous factor, mass
Comment thread
cvolkcvolk marked this conversation as resolved.
Outdated
concentrated at one end of its range means success favoured that end (e.g. a curve rising
toward bright light → the policy is light-gated). For a categorical factor, the tallest
Comment thread
cvolkcvolk marked this conversation as resolved.
Outdated
bar is the value most associated with success.

Comment thread
cvolkcvolk marked this conversation as resolved.
Current scope
-------------

- Outcomes are treated as **binary** (0/1). Conditioning defaults to success; a continuous
Comment thread
cvolkcvolk marked this conversation as resolved.
outcome is rejected with a clear error rather than silently averaged.
- Continuous **vector** factors (``dim > 1``) are reserved for a future extension.
Comment thread
cvolkcvolk marked this conversation as resolved.
Outdated
- The estimators run on CPU and do not require Isaac Sim, so a report can be generated
anywhere the evaluation JSONL is available.
- The analysis assumes the ``episode_summary.jsonl`` is a single coherent slice — one
policy, task, and embodiment. **TODO:** add a filter (in the spirit of robolab's
``--filter-policy`` / ``--filter-task``) to select that slice from a larger JSONL,
rather than relying on the caller to pre-filter it.
Comment thread
cvolkcvolk marked this conversation as resolved.
1 change: 1 addition & 0 deletions docs/pages/concepts/policy/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,3 +91,4 @@ More details
:maxdepth: 1

concept_evaluation_types
concept_sensitivity_analysis
4 changes: 4 additions & 0 deletions isaaclab_arena/analysis/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Copyright (c) 2026, The Isaac Lab Arena Project Developers (https://github.com/isaac-sim/IsaacLab-Arena/blob/main/CONTRIBUTORS.md).
# All rights reserved.
#
# SPDX-License-Identifier: Apache-2.0
4 changes: 4 additions & 0 deletions isaaclab_arena/analysis/sensitivity/__init__.py
Comment thread
cvolkcvolk marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Copyright (c) 2026, The Isaac Lab Arena Project Developers (https://github.com/isaac-sim/IsaacLab-Arena/blob/main/CONTRIBUTORS.md).
# All rights reserved.
#
# SPDX-License-Identifier: Apache-2.0
98 changes: 98 additions & 0 deletions isaaclab_arena/analysis/sensitivity/analyzer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Copyright (c) 2026, The Isaac Lab Arena Project Developers (https://github.com/isaac-sim/IsaacLab-Arena/blob/main/CONTRIBUTORS.md).
# All rights reserved.
#
# SPDX-License-Identifier: Apache-2.0

from __future__ import annotations

import torch

from sbi.inference import MNPE, NPE
from sbi.utils import BoxUniform

from isaaclab_arena.analysis.sensitivity.dataset import SensitivityDataset


class SensitivityAnalyzer:
"""Fits a neural posterior over all factors, conditioned on all outcomes.

Picks the sbi estimator from the schema:

- MNPE when any factor is categorical (it handles mixed continuous + categorical theta).
- NPE when every factor is continuous.

It then trains on the full (theta, x) and samples the joint posterior at a chosen
Comment thread
cvolkcvolk marked this conversation as resolved.
Outdated
observation. The single observation conditions on *all* outcome columns at once, so a
query like "which factors produced success?" is answered for every factor jointly.

Continuous factors are normalized to [0, 1] before fitting and denormalized when
sampling, so factors on very different scales (e.g. light in thousands, an offset in
hundredths) train on equal footing. Categorical columns keep their integer codes.
"""

def __init__(self, dataset: SensitivityDataset):
self.dataset = dataset
self.posterior = None
continuous_factors = [factor for factor in dataset.schema.factors if factor.type == "continuous"]
self._num_continuous = len(continuous_factors)
self._continuous_low = torch.tensor([factor.range[0][0] for factor in continuous_factors])
self._continuous_high = torch.tensor([factor.range[0][1] for factor in continuous_factors])
Comment thread
cvolkcvolk marked this conversation as resolved.

def _select_inference_class(self):
"""Choose the sbi inference class for this schema.

Returns MNPE when any factor is categorical (its mixed density estimator handles
continuous + categorical theta together), and NPE when every factor is continuous.
"""
return MNPE if self.dataset.has_categorical_factors else NPE

def _normalized_prior(self):
"""Uniform prior matching the normalized theta: continuous dims [0, 1], categoricals [0, k-1]."""
low_bounds = [0.0] * self._num_continuous
high_bounds = [1.0] * self._num_continuous
for factor in self.dataset.schema.factors:
if factor.type == "categorical":
low_bounds.append(0.0)
high_bounds.append(float(len(factor.choices) - 1))
return BoxUniform(low=torch.tensor(low_bounds), high=torch.tensor(high_bounds))
Comment thread
cvolkcvolk marked this conversation as resolved.
Comment thread
cvolkcvolk marked this conversation as resolved.

def _normalize(self, theta: torch.Tensor) -> torch.Tensor:
"""Scale the continuous (leading) theta columns to [0, 1]; leave categoricals untouched."""
normalized = theta.clone()
span = (self._continuous_high - self._continuous_low).clamp_min(1e-12)
normalized[:, : self._num_continuous] = (theta[:, : self._num_continuous] - self._continuous_low) / span
return normalized

def _denormalize(self, theta: torch.Tensor) -> torch.Tensor:
"""Inverse of _normalize: map the continuous columns back to their original ranges."""
denormalized = theta.clone()
span = self._continuous_high - self._continuous_low
denormalized[:, : self._num_continuous] = theta[:, : self._num_continuous] * span + self._continuous_low
Comment thread
cvolkcvolk marked this conversation as resolved.
return denormalized

def fit(self, training_batch_size: int = 50):
Comment thread
cvolkcvolk marked this conversation as resolved.
"""Train the estimator on the full (theta, x); store and return the fitted posterior."""
print(
f"[INFO] SensitivityAnalyzer: fitting {self._select_inference_class().__name__} on"
f" {self.dataset.num_episodes} episodes"
f" (theta dim={self.dataset.theta.shape[1]}, x dim={self.dataset.x.shape[1]})."
)
inference = self._select_inference_class()(prior=self._normalized_prior())
inference.append_simulations(self._normalize(self.dataset.theta), self.dataset.x)
density_estimator = inference.train(training_batch_size=training_batch_size)
self.posterior = inference.build_posterior(density_estimator)
return self.posterior

def sample_posterior(self, observation: torch.Tensor | None = None, num_samples: int = 5000) -> torch.Tensor:
"""Sample the joint posterior over all factors at observation.

Defaults to the dataset's default observation (condition on success). Returns a
(num_samples, total_factor_dim) tensor laid out like theta — continuous columns first
(in original, denormalized units), then integer-coded categorical columns.
"""
assert self.posterior is not None, "Call fit() before sampling the posterior"
if observation is None:
observation = self.dataset.default_observation()
with torch.no_grad():
normalized_samples = self.posterior.sample((num_samples,), x=observation)
return self._denormalize(normalized_samples)
Loading
Loading