Skip to content

feat(apr-code): CCPA teacher.stream.ndjson -> apr-code tool_call SFT export + LoRA flip spike#2243

Open
noahgift wants to merge 4 commits into
mainfrom
beat/apr-code-sft-export
Open

feat(apr-code): CCPA teacher.stream.ndjson -> apr-code tool_call SFT export + LoRA flip spike#2243
noahgift wants to merge 4 commits into
mainfrom
beat/apr-code-sft-export

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Builds the missing SFT export pipeline for apr-code distillation (no SFT/distill pipeline existed in CCPA) and runs the falsifiable 0→1 tool_call flip spike.

The grounded gap (from the distill feasibility spike): the student emits 0 tool_calls — a Markdown/prose prior — instead of the apr-code <tool_call> envelope. This PR harvests real Claude Code teacher trajectories from claude-code-parity-apr and remaps them into apr-code SFT data.

Part 1 — Export converter (tools/ccpa-sft-export/)

Standalone Rust converter (serde_json only — not a workspace member). Walks every teacher.stream.ndjson under CCPA evidence/ and:

  • Remaps Anthropic-native → apr-code: Read→file_read, Write→file_write, Edit→file_edit (old_string→old, new_string→new), Bash→shell, Grep→grep, Glob→glob.
  • Flattens each assistant tool_use turn into an entrenar InstructSample {instruction, response, system}:
    • system = apr-code CODE_SYSTEM_PROMPT
    • instruction = running observation transcript (prior tool_calls + tool_results)
    • response = literal <tool_call>{"name":..,"input":..}</tool_call> (InstructSample has no tool_calls field — the response string is the envelope)
  • Modes: --curated (first action/trajectory), --balanced (stratified per tool), --full.

Part 2 — Datasets (datasets/)

File Samples Notes
apr_code_sft_balanced.jsonl 200 40 each of file_read/file_edit/file_write/shell/grep; 184 with context — recommended spike set
apr_code_sft_curated.jsonl 124 one first-action tool_call/trajectory, deduped

Measured corpus stats

138 streams · 40 fixtures · 6,569 raw tool_use6,442 remappable (127 unmappable Task*/ToolSearch/Agent/AskUserQuestion dropped) → 5,773 full SFT samples. 100% of emitted responses are valid parseable apr-code tool_calls.

Flip test

The falsifiable claim = base model emits 0 tool_calls; the LoRA'd student emits ≥1 parseable <tool_call> on a held-out fixture prompt. Result reported in PR comments (faithful — measured numbers or the exact blocker, never a fabricated flip).

🤖 Generated with Claude Code

noahgift and others added 2 commits June 25, 2026 20:47
…export + LoRA flip spike

The missing piece for apr-code distillation: an SFT export pipeline that
harvests real Claude Code teacher trajectories (claude-code-parity-apr) and
remaps the Anthropic-native tool schema onto the apr-code <tool_call> schema,
flattening each turn into an entrenar InstructSample {instruction, response, system}.

tools/ccpa-sft-export/ — standalone Rust converter (serde_json only):
  - Read->file_read, Write->file_write, Edit->file_edit, Bash->shell,
    Grep->grep, Glob->glob (field remaps per CODE_SYSTEM_PROMPT)
  - response = literal <tool_call>{"name":..,"input":..}</tool_call>
  - system  = apr-code CODE_SYSTEM_PROMPT
  - instruction = running observation transcript (prior tool_calls + results)
  - modes: --curated (first action/trajectory), --balanced (stratified per tool),
    --full (every remappable turn)

datasets/:
  - apr_code_sft_balanced.jsonl  200 samples (40 each: file_read/file_edit/
    file_write/shell/grep), 184 with context — recommended spike set
  - apr_code_sft_curated.jsonl   124 first-action samples

Measured corpus: 138 streams, 40 fixtures, 6,569 raw tool_use -> 6,442
remappable (127 unmappable Task*/ToolSearch/Agent dropped) -> 5,773 full
SFT samples. 100% of emitted responses are valid parseable apr-code tool_calls.

Spike goal: SFT a small student on this corpus to produce the 0->1 tool_call
flip vs the base model (which emits 0 tool_calls / Markdown prose prior).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
remap_adapter.py bridges the entrenar InstructTrainer checkpoint
(lora.{L}.{q,v}_proj.lora_{a,b}, no rank/alpha metadata) to the tensor
names + header metadata that `apr finetune --merge` expects
(blk.{L}.attn_{q,v}.weight.lora_{a,b} + lora_rank/lora_alpha). Required
because apr code/run have no inference-time adapter flag — the trained
delta must be merged into the base first.

README documents the full 0->1 tool_call flip runbook and the three
measured pipeline blockers in the existing apr finetune/merge/run plumbing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@noahgift

Copy link
Copy Markdown
Contributor Author

Flip-test progress — base baseline + pipeline mechanism verified

Base baseline (measured): apr code --model qwen2.5-coder-1.5b-instruct-q4k.apr --max-turns 1 on a held-out "fix the wrong-operator bug" task emits 0 tool_calls — pure prose: "Sure, please provide the code for the src/lib.rs file." (trace assistant_turn has a single text block, no tool_use). This is exactly the documented Markdown/prose prior.

Pipeline mechanism verified end-to-end (independent of trained weights):

  • --features training,wgpu does not compile (entrenar WgpuInstructPipeline/autograd::wgpu_training cfg'd out). Built --features cuda,training instead (4090).
  • Neither apr code nor apr run accept an inference-time LoRA flag → the trained delta must be merged into the base.
  • Trainer writes checkpoints/best/model.safetensors named lora.{L}.{q,v}_proj.lora_{a,b} (no rank/alpha metadata); apr finetune --merge expects blk.{L}.attn_{q,v}.weight.lora_{a,b} + header lora_rank/lora_alpha. remap_adapter.py bridges this.
  • Proof: a synthetic adapter with the remapped names merged cleanly — Layers merged: 56 / 339 (28 layers × q+v = 56, exactly the targeted projections) → valid 6.62 GiB merged.apr that loads and runs in apr code.

So every leg of safetensors → remap → apr finetune --merge → merged.apr → apr code → trace is mechanically confirmed. The only remaining dependency is the real CUDA LoRA training producing the checkpoint (in flight on the 4090; the 1.5B F32 InstructPipeline build is the long pole). The measured 0→1 flip result (or the exact blocker) will be posted when training completes.

…ed on InstructPipeline build

Faithful spike result: base model emits 0 tool_calls (measured); merge->run
leg mechanically verified (synthetic adapter merged 56/339 layers -> runnable
merged.apr that loads in apr code); LoRA training did NOT reach a training step
in 15+ min (CUDA InstructPipeline::from_apr stuck in F32 dequant + PTX JIT
pre-warm on this host). Dataset + runbook ready to run once build cost addressed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@noahgift

Copy link
Copy Markdown
Contributor Author

Final flip-test result (faithful — measured, not fabricated)

Leg Status Evidence
Export converter (Part 1) DONE tools/ccpa-sft-export/ — 138 streams, 6,569 raw tool_use → 6,442 remappable → 5,773 full SFT samples; balanced spike set = 200 (40/tool), 100% valid <tool_call>
Base baseline MEASURED: 0 tool_calls apr code on held-out fix task → prose "Sure, please provide the code…"; trace assistant_turn = 1 text block, no tool_use
remap → merge → run MECHANICALLY VERIFIED synthetic remapped adapter → apr finetune --merge Layers merged: 56/339 (28 layers × q+v) → 6.62 GiB merged.apr loads + generates in apr code
LoRA training BLOCKED (real) apr finetune --features cuda,training on 1.5B: CUDA InstructPipeline::from_apr stuck 15+ min in F32-dequant + PTX-JIT pre-warm, 0 training steps reached, GPU never sustained load. CPU --features training path = same InstructPipeline-construction wall.

Measured flip = BASE 0 → LoRA not measured (training-build blocker), NOT a completed 0→1 flip. I am not claiming a flip that did not happen.

Ready-to-run (once the InstructPipeline build cost is addressed — pre-compiled cubins / smaller base / longer GPU budget)

cargo build --release --bin apr --features cuda,training
apr finetune qwen2.5-coder-1.5b-instruct-q4k.apr --method lora --rank 16 \
  --data datasets/apr_code_sft_balanced.jsonl --output adapter.apr --epochs 2 \
  --gpu-backend cuda --checkpoint-format safetensors
uv run tools/ccpa-sft-export/remap_adapter.py checkpoints/best/model.safetensors \
  adapter_remapped.safetensors --rank 16 --alpha 32
apr finetune qwen2.5-coder-1.5b-instruct-q4k.apr --merge --adapter adapter_remapped.safetensors -o merged.apr
apr code --model merged.apr --max-turns 1 --emit-trace lora_trace.jsonl \
  -p "The subtract fn in src/lib.rs adds instead of subtracts. Read the file and fix it."
# SUCCESS = tool_use_count(lora_trace) >= 1  vs  base == 0

This PR ships the converter + datasets + remap bridge + runbook (the critical missing piece) regardless. The flip itself is gated on the training-build blocker, reported honestly.

…ion (PMAT-FINETUNE-CONSTRUCT) (#2246)

The apr-code LoRA flip test was reported BLOCKED on a "15+ min wall in
InstructPipeline::from_apr (F32 dequant + PTX-JIT pre-warm) without reaching a
single training step". Profiling (APR_FROM_APR_TIMING=1) falsifies that diagnosis:

  [from_apr-timing] open+header:                       2.02s (339 tensors)
  [from_apr-timing] parallel dequant->F32:             1.66s
  [from_apr-timing] validate (struct+shape+nan/inf):   4.61s
  [from_apr-timing] from_params(build tensors):       160us | TOTAL 8.29s

Construction of the 1.5B Qwen2.5-Coder q4k.apr is ~8s, NOT 15 min. There is no
PTX-JIT pre-warm in the `--method lora` (non-NF4) path at all — `init_cuda` only
runs when `quantize_nf4` is true, and the LoRA train step uses CPU autograd
(`cuda_blocks` is None). The real bottleneck the prior spike misattributed to
construction is per-step CPU autograd forward+backward of a 1.5B model.

Changes:
- Add APR_FROM_APR_TIMING phase timers to Transformer::from_apr so the
  construction cost is observable instead of a black box.
- Parallelize the per-tensor F32 dequant with rayon par_iter (read_tensor_as_f32
  is a pure read into the owned byte buffer — data-parallel-safe). Serial dequant
  was the prime suspect; parallelizing removes it as a scaling term on multi-core.
- Add rayon as an explicit aprender-train dependency.

Falsifier FALSIFY-FINETUNE-CONSTRUCT-001 (transformer::model::tests):
builds a complete tiny Qwen2-shaped APR, loads it through from_apr (exercising
the parallel collect + validation), asserts every weight loads and forward
produces finite logits. Mutation-verified: injecting a `.filter()` that drops
model.embed_tokens.weight from the parallel collect turns the test RED;
all-tensors-present is GREEN.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
@noahgift noahgift added this pull request to the merge queue Jun 25, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant