Skip to content

Add per-label integration for sub mode#294

Open
nictru wants to merge 16 commits into
devfrom
per-group-integration
Open

Add per-label integration for sub mode#294
nictru wants to merge 16 commits into
devfrom
per-group-integration

Conversation

@nictru

@nictru nictru commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add --integrate_per_label for base-adata-only sub mode to split by base_label_col and reuse the existing INTEGRATE subworkflow per label.
  • Preserve per-label metadata through integration and clustering so outputs can be merged back into a single finalized AnnData.
  • Add focused nf-test coverage for the new subworkflow and an end-to-end pipeline case using nft-anndata.

Test plan

  • nftu subworkflows/local/sub_integrate/tests/main.nf.test
  • nft subworkflows/local/sub_integrate/tests/main.nf.test
  • nft tests/main_pipeline_sub_integrate_per_label.nf.test

Reuse the existing integration workflow after splitting base AnnData labels so sub-mode runs can create per-label embeddings in the combined output.
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit f073f21

+| ✅ 294 tests passed       |+
#| ❔   1 tests had warnings |#
!| ❗  15 tests had warnings |!
Details

❗ Test warnings:

  • files_exist - File not found: conf/igenomes.config
  • files_exist - File not found: conf/igenomes_ignored.config
  • readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).
  • pipeline_todos - TODO string in README.md: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file.
  • pipeline_todos - TODO string in README.md: Add bibliography of tools and data used in your pipeline
  • pipeline_todos - TODO string in nextflow.config: Optionally, you can add a pipeline-specific nf-core config at https://github.com/nf-core/configs
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in CONTRIBUTING.md: Add any pipeline specific contribution guidelines here, such as coding styles, procedures, checklists etc.
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in nextflow.config: Specify any additional parameters here
  • pipeline_todos - TODO string in base.config: Check the defaults for all processes
  • pipeline_todos - TODO string in base.config: Customise requirements for specific processes.
  • pipeline_todos - TODO string in awsfulltest.yml: You can customise AWS full pipeline tests as required

❔ Tests fixed:

✅ Tests passed:

Run details

  • nf-core/tools version 4.0.2
  • Run at 2026-06-22 07:27:18

Update INTEGRATE subworkflow snapshots for the integration meta field and add pipeline test snapshots including versions.yml for nf-core lint.
@nictru nictru marked this pull request as ready for review June 16, 2026 13:42
nictru added 11 commits June 17, 2026 20:09
Keep meta.id as the sample subset while meta.integration carries the method, so publish prefixes and scib filtering stay correct for per-label runs.
Avoid coupling published filenames to meta.id when only the subset label should distinguish per-label runs.
…elpers.

Extract subset expansion into a dedicated subworkflow so CLUSTER can run graph, UMAP, Leiden, and entropy as a linear pipeline without UMAP id workarounds or duplicated plan matching in the parent workflow.
Fall back to meta.id in publish prefixes so isolated module nf-tests keep stable output names after the per-label integration meta changes.
Use meta.id as fallback in ADATA_MERGEEMBEDDINGS publishDir so module nf-tests do not write under a null integration key.
ADATA_MERGEEMBEDDINGS looked up base obsm keys from meta.id, which is still "merged" in extension mode after the integration meta refactor, causing KeyError X_merged.
@nictru

nictru commented Jun 21, 2026

Copy link
Copy Markdown
Collaborator Author

Version capture: INTEGRATE / SUB_INTEGRATE still mix .out.versions

While testing per-label sub integration (integrate_per_label=true, e.g. splitting on coarse_annotation), the pipeline fails before SUB_INTEGRATE completes:

ERROR ~ No such variable: Exception evaluating property 'versions' for nextflow.script.ChannelOut,
Reason: groovy.lang.MissingPropertyException: No such property: versions for class: groovyx.gpars.dataflow.DataflowBroadcast

 -- Check script 'subworkflows/local/integrate/main.nf' at line: 238

Root cause

Module outputs now publish versions via topic: versions (collected in workflows/scdownstream.nf with channel.topic("versions")), but INTEGRATE still uses the old pattern:

  • ch_versions = channel.empty() + ch_versions.mix(<module>.out.versions) for each integration method
  • ch_versions.mix(SCIMILARITY.out.versions) at line 238 — SCIMILARITY subworkflow does not emit versions, so this blows up when scimilarity is in integration_methods

Same pattern likely applies anywhere subworkflows mix .out.versions from child workflows that no longer expose that emit.

Suggested fix

Align INTEGRATE (and SUB_INTEGRATE) with other subworkflows (CLUSTER, COMBINE, PER_GROUP, etc.):

  1. Remove all ch_versions mixing in subworkflows/local/integrate/main.nf
  2. Drop the versions emit from INTEGRATE
  3. Drop versions = INTEGRATE.out.versions from subworkflows/local/sub_integrate/main.nf
  4. Update subworkflows/local/integrate/tests/main.nf.test snapshots — remove workflow.out.versions from assertions (versions are covered at pipeline level via topic)

Repro (publication repo)

04_scdownstream/04_sub/nextflow.config
  integrate_per_label = true
  integration_methods = 'scvi,scimilarity'
  base_label_col = 'coarse_annotation'

Run sub mode with a base merged.h5ad that has the label column in obs.


TODO: address on this branch; integrate nf-test snapshots will need nftu after the fix.

@nictru

nictru commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

SCIMILARITY_EMBED: obsm pickle filename mismatch in per-label integration

When running integrate_per_label=true with scimilarity in integration_methods, SCIMILARITY_EMBED can fail with:

Missing output file(s) `X_scimilarity-Erythrocyte.pkl` expected by process
`NFCORE_SCDOWNSTREAM:SCDOWNSTREAM:SUB_INTEGRATE:INTEGRATE:SCIMILARITY:SCIMILARITY_EMBED (Erythrocyte)`

Root cause

conf/modules.config sets a subset-aware prefix for scimilarity modules:

ext.prefix = { (meta.integration ?: meta.id) + (meta.subset ? "-${meta.subset}" : '') }

So for the Erythrocyte subset, prefix is scimilarity-Erythrocyte.

  • modules/local/scimilarity/embed/main.nf declares output: X_${prefix}.pklX_scimilarity-Erythrocyte.pkl
  • modules/local/scimilarity/embed/templates/embed.py was writing: X_${meta.id}.pklX_Erythrocyte.pkl

meta.id is only the subset name from SUB_INTEGRATE; it does not include the integration method. This bug is latent on global runs (where prefix often equals meta.id) but breaks per-label sub-integration.

Other integration modules (scVI, PCA, symphony, etc.) already use ${prefix} for obsm pickles.

Fix

In modules/local/scimilarity/embed/templates/embed.py:

-df.to_pickle("X_${meta.id}.pkl")
+df.to_pickle("X_${prefix}.pkl")

Repro

04_scdownstream/04_sub run config: integrate_per_label=true, integration_methods=scvi,scimilarity, split on coarse_annotation.

nictru added 3 commits June 22, 2026 09:23
Per-label integration sets ext.prefix to integration-subset, so the
embedding pickle must match X_${prefix}.pkl rather than meta.id alone.
Version capture now uses topic: versions at pipeline level; remove the
legacy ch_versions channel and drop versions from subworkflow outputs.
Drop the forwarded versions emit and add a per-label scimilarity stub
regression test for integrate_per_label runs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant