fix(convert): preserve forward-affecting config metadata in GGUF->APR import — fixes .apr GPU F2 divergence on Blackwell (PMAT class) by noahgift · Pull Request #2244 · paiml/aprender

noahgift · 2026-06-25T19:05:19Z

P4 correctness sweep — `.apr`-vs-`.gguf` GPU F2 divergence on Blackwell

Investigated the reported .apr-vs-.gguf GPU F2 per-position divergence for qwen2.5-coder-1.5b-instruct-q4_k_m, reproduced on real GB10 (sm_121). ORACLE throughout = the .gguf path (GGUFConfig::from_gguf); all falsifiers are oracle-based + mutation-verified (per feedback_contracts_ratchet_not_radar).

What was proven (CPU-side, no GPU)

The GGUFConfig built by from_apr (the .apr loader) is already byte-identical to from_gguf (the .gguf oracle) for this model — architecture, hidden_dim=1536, num_layers=28, num_heads=12, num_kv_heads=2, head_dim=128, intermediate=8960, rope_theta=1e6, rope_type=2 (NEOX), eps=1e-6, context_length=32768, attn_scale, BOS/EOS all match. The raw .apr metadata stamps rms_norm_eps≈1e-6, rope_type=2, rope_theta=1e6 correctly. The "missing config field causes pos-11 divergence" hypothesis is FALSIFIED at the config level — there is no format/config divergence for this model.

GPU re-verify (gx10 GB10, `--features cuda`)

.gguf and .apr behave identically: same load-time PARITY-GATE cosine 0.981714, same F2 result (GPU token 4740 != CPU token 16 BOS probe), and the same coherent output ("4" for "2+2=") on the 647-kernel CUDA graph under SKIP_PARITY_GATE=1. The F2 BOS-probe rejection is the known stale-gate behavior (apr-cpu-vs-gpu-output-parity-v1 v1.10.0 / PMAT-885), is format-independent, and is NOT an .apr-specific bug.

The real defect this audit surfaced (the fix)

GgufToAprQ4KConverter::convert resolved rms_norm_eps with a hard-coded unwrap_or(1e-5) (LLaMA's epsilon) for every architecture, while GGUFConfig::from_gguf falls back to the arch-specific ArchConstraints::default_eps (1e-6 for Qwen2/Qwen3). For any 1e-6-eps model whose GGUF omits the epsilon key (e.g. a weights-only Qwen2 export), the old code would stamp 1e-5 into the .apr → a real per-layer RMSNorm divergence vs the same model run as .gguf (pos-0 clean, compounds position-by-position — the F2 signature). Fix: route eps through a new resolve_rms_eps() helper mirroring from_gguf's arch-aware default. Raw-byte Q4K passthrough preserved (no requant).

Falsifiers (oracle-based, mutation-verified)

resolve_rms_eps unit tests FALSIFY-APR-IMPORT-EPS-001..004: qwen2/qwen3 missing-key → 1e-6, llama → 1e-5, explicit GGUF eps used verbatim. Mutation-verified: reverting to unwrap_or(1e-5) turns the qwen2/qwen3 tests RED.
apr_import_config_fidelity integration test: from_apr config == from_gguf config field-for-field (host-gated; auto-skips without the fixture).
Contract contracts/apr-import-config-fidelity-v1.yaml (OBLIG-APR-IMPORT-CONFIG-FIDELITY); pv validate + pv lint contracts/ PASS.

Tests

aprender-serve convert lib: 444 pass; integration: 2 pass; clippy --lib clean.
Pre-existing ffn_coverage/convert_coverage standalone test bins fail to compile on base (stale struct literals) — unrelated to this change.

Honest status

This PR ships a proven non-divergence (the format/config is faithful) plus a ratchet that fixes a latent arch-eps stamping gap. It does NOT claim to fix the F2 BOS-probe GPU fallback — that is a separate, format-independent CUDA-gate issue already tracked by apr-cpu-vs-gpu-output-parity-v1.

🤖 Generated with Claude Code

… import — fixes .apr GPU F2 divergence on Blackwell (PMAT class) P4 correctness sweep on the reported .apr-vs-.gguf GPU F2 per-position divergence (qwen2.5-coder-1.5b on GB10 sm_121). ORACLE = the .gguf path (GGUFConfig::from_gguf); falsifiers are oracle-based and mutation-verified per feedback_contracts_ratchet_not_radar. WHAT WAS PROVEN (CPU-side, no GPU): GGUFConfig built by from_apr (the .apr loader) is ALREADY byte-identical to from_gguf (the .gguf oracle) for this model — architecture, hidden_dim=1536, num_layers=28, num_heads=12, num_kv_heads=2, head_dim=128, intermediate=8960, rope_theta=1e6, rope_type=2 (NEOX), eps=1e-6, context_length=32768, attn_scale, BOS/EOS all match. The raw .apr metadata stamps rms_norm_eps ~1e-6, rope_type=2, rope_theta=1e6 correctly. So the "missing config field" hypothesis is FALSIFIED at the config level; there is NO format/config divergence for this model. GPU RE-VERIFY (gx10 GB10, --features cuda build): .gguf and .apr behave IDENTICALLY — same load-time PARITY-GATE cosine 0.981714, same F2 result (GPU token 4740 != CPU token 16 BOS probe), same coherent output "4" for "2+2=" under SKIP_PARITY_GATE=1 on the 647-kernel CUDA graph. The F2 BOS-probe rejection is the known stale-gate behavior (apr-cpu-vs-gpu-output-parity-v1 v1.10.0 PMAT-885), format-independent, NOT an .apr-specific bug. LATENT GAP FIXED (the real correctness defect the audit surfaced): GgufToAprQ4KConverter::convert resolved rms_norm_eps with a hard-coded `unwrap_or(1e-5)` (LLaMA's epsilon) for EVERY architecture, while GGUFConfig::from_gguf falls back to the arch-specific ArchConstraints::default_eps (1e-6 for Qwen2/Qwen3). For any 1e-6-eps model whose GGUF OMITS the epsilon key (e.g. a weights-only Qwen2 export) the old code would stamp 1e-5 into the .apr -> a real per-layer RMSNorm divergence vs the same model run as .gguf (pos-0 clean, compounds position-by-position). Fix: route eps through a new resolve_rms_eps() helper mirroring from_gguf's arch-aware default. Raw-byte Q4K passthrough preserved (no requant). FALSIFIERS (oracle-based, mutation-verified): - resolve_rms_eps unit tests (FALSIFY-APR-IMPORT-EPS-001..004): qwen2/qwen3 missing-key -> 1e-6, llama -> 1e-5, explicit GGUF eps used verbatim. Mutation: reverting to unwrap_or(1e-5) makes the qwen2/qwen3 tests RED. - apr_import_config_fidelity integration test: from_apr config == from_gguf config field-for-field (host-gated; auto-skips without the fixture). Contract contracts/apr-import-config-fidelity-v1.yaml (OBLIG-APR-IMPORT-CONFIG-FIDELITY); pv validate + pv lint contracts/ PASS. Tests: aprender-serve convert lib 444 pass; integration 2 pass; clippy --lib clean. (Pre-existing ffn_coverage/convert_coverage standalone test bins fail to compile on base — stale struct literals — unrelated to this change.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

noahgift enabled auto-merge June 25, 2026 19:05

noahgift added this pull request to the merge queue Jun 25, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(convert): preserve forward-affecting config metadata in GGUF->APR import — fixes .apr GPU F2 divergence on Blackwell (PMAT class)#2244

fix(convert): preserve forward-affecting config metadata in GGUF->APR import — fixes .apr GPU F2 divergence on Blackwell (PMAT class)#2244
noahgift wants to merge 1 commit into
mainfrom
beat/apr-import-config-fidelity

noahgift commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

noahgift commented Jun 25, 2026

P4 correctness sweep — .apr-vs-.gguf GPU F2 divergence on Blackwell

What was proven (CPU-side, no GPU)

GPU re-verify (gx10 GB10, --features cuda)

The real defect this audit surfaced (the fix)

Falsifiers (oracle-based, mutation-verified)

Tests

Honest status

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

P4 correctness sweep — `.apr`-vs-`.gguf` GPU F2 divergence on Blackwell

GPU re-verify (gx10 GB10, `--features cuda`)