isaac-sim · cvolkcvolk · Jun 12, 2026 · Jun 3, 2026 · Jun 3, 2026 · Jun 4, 2026
@@ -0,0 +1,155 @@
+Sensitivity Analysis
+====================
+
+The sensitivity-analysis toolbox answers a single question about a policy:
+*which environment conditions drive success?* Given the per-episode results of an
+evaluation sweep — where factors such as lighting, object mass, or table material were
+varied — it fits a posterior over those factors conditioned on the outcome and renders
+one figure summarising which factor values are associated with success.
+
+Why a joint posterior, not a success rate per factor?
+-----------------------------------------------------
+
+The simplest analysis would chart a success rate for each factor independently. That hides
+the two things that matter most in a multi-factor sweep:
+
+- **Factors interact.** How much light a policy needs can depend on the object — a matte
+  object may succeed at low light while a shiny one needs far more. A per-factor
+  "success vs light" curve averages over objects and reports one blurry gate that is wrong
+  for both. The joint posterior keeps the interaction, so you can condition on a specific
+  object and see its gate.
+- **Factors confound each other.** If bright-light episodes also happened to use an easy
+  object, a per-factor light chart cannot tell which one drove success. Modelling all
+  factors together attributes the effect to the factor that actually carries it.
+
+The per-factor rate is a projection of the joint posterior — derivable from it, but not the
+other way around. The toolbox therefore always fits the joint — via simulation-based
+inference (MNPE or NPE) — and reads the per-factor marginals from it.
+
+How it works
+------------
+
+The toolbox is a thin analysis layer over `sbi <https://sbi.readthedocs.io>`_'s
+neural posterior estimators. The flow is:
+
+1. **Per-episode recording.** During evaluation, ``episode_writer`` appends one row per
+   episode to an ``episode_summary.jsonl`` file.
+2. **Schema.** A ``factors.yaml`` declares the *factors* — which ``arena_env_args`` columns
+   were varied and whether each is continuous or categorical, plus the continuous ranges
+   that were swept (so the analyzer's prior matches the simulation). It does **not** list
+   outcomes — *which* outcome to condition on is chosen at analysis time, not saved here.
+3. **Inference.** ``SensitivityAnalyzer`` loads the pair, trains an estimator on the full
+   ``(theta, x)`` jointly, and samples the joint posterior conditioned on a chosen
+   observation (by default, success).
+4. **Report.** A smooth density curve for each continuous factor and a probability bar chart
+   for each categorical factor.
+
+Inputs
+------
+
+**factors.yaml** declares only the factors that were varied (and the continuous ranges that
+were swept). Outcomes are not declared here — they're selected at analysis time (see below):
+
+.. code-block:: yaml
+
+   factors:
+     light_intensity:
+       type: continuous
+       range: [[0.0, 5000.0]]   # the swept range; inferred from the data's min/max if omitted
+     table_material:
+       type: categorical
+       choices: [oak, walnut, bamboo]
+
+**episode_summary.jsonl** is produced by the eval runner — one JSON object per episode. It
+carries every measured outcome; the analysis picks which one(s) to condition on:
+
+.. code-block:: json
+
+   {"job_name": "pi0_sweep", "episode_idx": 0,
+    "arena_env_args": {"light_intensity": 3200.0, "table_material": "oak"},
+    "outcomes": {"success": 1}}
+
+Choice of estimator
+-------------------
+
+``SensitivityAnalyzer`` picks the estimator from the schema automatically:
+
+.. list-table::
+   :header-rows: 1
+   :widths: 25 25 50
+
+   * - Schema
+     - Estimator
+     - Notes
+   * - Any categorical factor
+     - MNPE
+     - Mixed density estimator; handles continuous + categorical factors together.
+   * - All continuous factors
+     - NPE
+     - Restricts to a Gaussian on a single factor, so a meaningful continuous-only
+       analysis needs at least two continuous factors.
+
+Continuous factors are normalised to ``[0, 1]`` before fitting and de-normalised when
+sampling, so factors on very different scales (e.g. light in the thousands, an offset in
+the hundredths) train on equal footing. Outcomes are binary (0/1); the default query
+conditions on success (1).
+
+Running a report
+----------------
+
+Point the report generator at a ``(factors.yaml, episode_summary.jsonl)`` pair. The output
+format follows the file extension (``.png``, ``.pdf``, …); reports are written under
+``eval/`` by default.
+
+.. code-block:: bash
+
+   python -m isaaclab_arena.analysis.sensitivity.generate_report \
+     --factors_yaml factors.yaml \
+     --episode_summary episode_summary.jsonl \
+     --outcome success \
+     --output eval/sensitivity_report.png
+
+``--outcome`` selects which per-episode outcome(s) to condition on (keys in the rows'
+``outcomes`` block); it defaults to ``success``. Pass ``--observation`` to set the value
+per outcome — since outcomes are binary, use ``1`` for success or ``0`` for failure; it
+defaults to ``1`` (success).
+
+Trying it on synthetic data
+---------------------------
+
+A synthetic simulator with a *known* ground truth lets you run the whole pipeline on CPU,
+without Isaac Sim — useful for seeing the output shape and for validating the toolbox
+(the recovered posterior should reflect the planted relationship):
+
+.. code-block:: bash
+
+   # mixed: three continuous + two categorical factors (MNPE)
+   python -m isaaclab_arena.analysis.sensitivity.synthetic --kind mixed --output eval/demo.png
+
+``--kind`` also accepts ``continuous`` (continuous-only factors, which exercises the NPE path).
+
+Reading the output
+------------------
+
+.. todo::
+
+   Add a sample report figure here and walk through reading it.
+
+Each panel is the posterior over one factor *conditioned on success* — "given the policy
+succeeded, which values of this factor were responsible?" For a continuous factor, mass
+concentrated at one end of its range means success favoured that end (e.g. a curve rising
+toward bright light → the policy is light-gated). For a categorical factor, the tallest
+bar is the value most associated with success.
+
+Current scope
+-------------
+
+- Outcomes are treated as **binary** (0/1). Conditioning defaults to success; a continuous
+  outcome is rejected with a clear error rather than silently averaged.
+- Continuous **vector** factors (``dim > 1``) are reserved for a future extension.
+- The estimators run on CPU and do not require Isaac Sim, so a report can be generated
+  anywhere the evaluation JSONL is available.
+- The analysis assumes the ``episode_summary.jsonl`` is a single coherent slice — one
+  policy, task, and embodiment. **TODO:** add a filter (in the spirit of robolab's
+  ``--filter-policy`` / ``--filter-task``) to select that slice from a larger JSONL,
+  rather than relying on the caller to pre-filter it.
@@ -91,3 +91,4 @@ More details
    :maxdepth: 1
 
    concept_evaluation_types
+   concept_sensitivity_analysis
@@ -0,0 +1,4 @@
+# Copyright (c) 2026, The Isaac Lab Arena Project Developers (https://github.com/isaac-sim/IsaacLab-Arena/blob/main/CONTRIBUTORS.md).
+# All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
@@ -0,0 +1,4 @@
+# Copyright (c) 2026, The Isaac Lab Arena Project Developers (https://github.com/isaac-sim/IsaacLab-Arena/blob/main/CONTRIBUTORS.md).
+# All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
@@ -0,0 +1,98 @@
+# Copyright (c) 2026, The Isaac Lab Arena Project Developers (https://github.com/isaac-sim/IsaacLab-Arena/blob/main/CONTRIBUTORS.md).
+# All rights reserved.
+#
+# SPDX-License-Identifier: Apache-2.0
+
+from __future__ import annotations
+
+import torch
+
+from sbi.inference import MNPE, NPE
+from sbi.utils import BoxUniform
+
+from isaaclab_arena.analysis.sensitivity.dataset import SensitivityDataset
+
+
+class SensitivityAnalyzer:
+    """Fits a neural posterior over all factors, conditioned on all outcomes.
+
+    Picks the sbi estimator from the schema:
+
+    - MNPE when any factor is categorical (it handles mixed continuous + categorical theta).
+    - NPE when every factor is continuous.
+
+    It then trains on the full (theta, x) and samples the joint posterior at a chosen
+    observation. The single observation conditions on *all* outcome columns at once, so a
+    query like "which factors produced success?" is answered for every factor jointly.
+
+    Continuous factors are normalized to [0, 1] before fitting and denormalized when
+    sampling, so factors on very different scales (e.g. light in thousands, an offset in
+    hundredths) train on equal footing. Categorical columns keep their integer codes.
+    """
+
+    def __init__(self, dataset: SensitivityDataset):
+        self.dataset = dataset
+        self.posterior = None
+        continuous_factors = [factor for factor in dataset.schema.factors if factor.type == "continuous"]
+        self._num_continuous = len(continuous_factors)
+        self._continuous_low = torch.tensor([factor.range[0][0] for factor in continuous_factors])
+        self._continuous_high = torch.tensor([factor.range[0][1] for factor in continuous_factors])
+
+    def _select_inference_class(self):
+        """Choose the sbi inference class for this schema.
+
+        Returns MNPE when any factor is categorical (its mixed density estimator handles
+        continuous + categorical theta together), and NPE when every factor is continuous.
+        """
+        return MNPE if self.dataset.has_categorical_factors else NPE
+
+    def _normalized_prior(self):
+        """Uniform prior matching the normalized theta: continuous dims [0, 1], categoricals [0, k-1]."""
+        low_bounds = [0.0] * self._num_continuous
+        high_bounds = [1.0] * self._num_continuous
+        for factor in self.dataset.schema.factors:
+            if factor.type == "categorical":
+                low_bounds.append(0.0)
+                high_bounds.append(float(len(factor.choices) - 1))
+        return BoxUniform(low=torch.tensor(low_bounds), high=torch.tensor(high_bounds))
+
+    def _normalize(self, theta: torch.Tensor) -> torch.Tensor:
+        """Scale the continuous (leading) theta columns to [0, 1]; leave categoricals untouched."""
+        normalized = theta.clone()
+        span = (self._continuous_high - self._continuous_low).clamp_min(1e-12)
+        normalized[:, : self._num_continuous] = (theta[:, : self._num_continuous] - self._continuous_low) / span
+        return normalized
+
+    def _denormalize(self, theta: torch.Tensor) -> torch.Tensor:
+        """Inverse of _normalize: map the continuous columns back to their original ranges."""
+        denormalized = theta.clone()
+        span = self._continuous_high - self._continuous_low
+        denormalized[:, : self._num_continuous] = theta[:, : self._num_continuous] * span + self._continuous_low
+        return denormalized
+
+    def fit(self, training_batch_size: int = 50):
+        """Train the estimator on the full (theta, x); store and return the fitted posterior."""
+        print(
+            f"[INFO] SensitivityAnalyzer: fitting {self._select_inference_class().__name__} on"
+            f" {self.dataset.num_episodes} episodes"
+            f" (theta dim={self.dataset.theta.shape[1]}, x dim={self.dataset.x.shape[1]})."
+        )
+        inference = self._select_inference_class()(prior=self._normalized_prior())
+        inference.append_simulations(self._normalize(self.dataset.theta), self.dataset.x)
+        density_estimator = inference.train(training_batch_size=training_batch_size)
+        self.posterior = inference.build_posterior(density_estimator)
+        return self.posterior
+
+    def sample_posterior(self, observation: torch.Tensor | None = None, num_samples: int = 5000) -> torch.Tensor:
+        """Sample the joint posterior over all factors at observation.
+
+        Defaults to the dataset's default observation (condition on success). Returns a
+        (num_samples, total_factor_dim) tensor laid out like theta — continuous columns first
+        (in original, denormalized units), then integer-coded categorical columns.
+        """
+        assert self.posterior is not None, "Call fit() before sampling the posterior"
+        if observation is None:
+            observation = self.dataset.default_observation()
+        with torch.no_grad():
+            normalized_samples = self.posterior.sample((num_samples,), x=observation)
+        return self._denormalize(normalized_samples)
Original file line number	Diff line number	Diff line change
Expand Up		@@ -91,3 +91,4 @@ More details
		:maxdepth: 1

		concept_evaluation_types
		concept_sensitivity_analysis