feat(apr-code): CCPA teacher.stream.ndjson -> apr-code tool_call SFT export + LoRA flip spike#2243
feat(apr-code): CCPA teacher.stream.ndjson -> apr-code tool_call SFT export + LoRA flip spike#2243noahgift wants to merge 4 commits into
Conversation
…export + LoRA flip spike
The missing piece for apr-code distillation: an SFT export pipeline that
harvests real Claude Code teacher trajectories (claude-code-parity-apr) and
remaps the Anthropic-native tool schema onto the apr-code <tool_call> schema,
flattening each turn into an entrenar InstructSample {instruction, response, system}.
tools/ccpa-sft-export/ — standalone Rust converter (serde_json only):
- Read->file_read, Write->file_write, Edit->file_edit, Bash->shell,
Grep->grep, Glob->glob (field remaps per CODE_SYSTEM_PROMPT)
- response = literal <tool_call>{"name":..,"input":..}</tool_call>
- system = apr-code CODE_SYSTEM_PROMPT
- instruction = running observation transcript (prior tool_calls + results)
- modes: --curated (first action/trajectory), --balanced (stratified per tool),
--full (every remappable turn)
datasets/:
- apr_code_sft_balanced.jsonl 200 samples (40 each: file_read/file_edit/
file_write/shell/grep), 184 with context — recommended spike set
- apr_code_sft_curated.jsonl 124 first-action samples
Measured corpus: 138 streams, 40 fixtures, 6,569 raw tool_use -> 6,442
remappable (127 unmappable Task*/ToolSearch/Agent dropped) -> 5,773 full
SFT samples. 100% of emitted responses are valid parseable apr-code tool_calls.
Spike goal: SFT a small student on this corpus to produce the 0->1 tool_call
flip vs the base model (which emits 0 tool_calls / Markdown prose prior).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
remap_adapter.py bridges the entrenar InstructTrainer checkpoint
(lora.{L}.{q,v}_proj.lora_{a,b}, no rank/alpha metadata) to the tensor
names + header metadata that `apr finetune --merge` expects
(blk.{L}.attn_{q,v}.weight.lora_{a,b} + lora_rank/lora_alpha). Required
because apr code/run have no inference-time adapter flag — the trained
delta must be merged into the base first.
README documents the full 0->1 tool_call flip runbook and the three
measured pipeline blockers in the existing apr finetune/merge/run plumbing.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Flip-test progress — base baseline + pipeline mechanism verifiedBase baseline (measured): Pipeline mechanism verified end-to-end (independent of trained weights):
So every leg of |
…ed on InstructPipeline build Faithful spike result: base model emits 0 tool_calls (measured); merge->run leg mechanically verified (synthetic adapter merged 56/339 layers -> runnable merged.apr that loads in apr code); LoRA training did NOT reach a training step in 15+ min (CUDA InstructPipeline::from_apr stuck in F32 dequant + PTX JIT pre-warm on this host). Dataset + runbook ready to run once build cost addressed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Final flip-test result (faithful — measured, not fabricated)
Measured flip = BASE 0 → LoRA not measured (training-build blocker), NOT a completed 0→1 flip. I am not claiming a flip that did not happen. Ready-to-run (once the InstructPipeline build cost is addressed — pre-compiled cubins / smaller base / longer GPU budget)cargo build --release --bin apr --features cuda,training
apr finetune qwen2.5-coder-1.5b-instruct-q4k.apr --method lora --rank 16 \
--data datasets/apr_code_sft_balanced.jsonl --output adapter.apr --epochs 2 \
--gpu-backend cuda --checkpoint-format safetensors
uv run tools/ccpa-sft-export/remap_adapter.py checkpoints/best/model.safetensors \
adapter_remapped.safetensors --rank 16 --alpha 32
apr finetune qwen2.5-coder-1.5b-instruct-q4k.apr --merge --adapter adapter_remapped.safetensors -o merged.apr
apr code --model merged.apr --max-turns 1 --emit-trace lora_trace.jsonl \
-p "The subtract fn in src/lib.rs adds instead of subtracts. Read the file and fix it."
# SUCCESS = tool_use_count(lora_trace) >= 1 vs base == 0This PR ships the converter + datasets + remap bridge + runbook (the critical missing piece) regardless. The flip itself is gated on the training-build blocker, reported honestly. |
…ion (PMAT-FINETUNE-CONSTRUCT) (#2246) The apr-code LoRA flip test was reported BLOCKED on a "15+ min wall in InstructPipeline::from_apr (F32 dequant + PTX-JIT pre-warm) without reaching a single training step". Profiling (APR_FROM_APR_TIMING=1) falsifies that diagnosis: [from_apr-timing] open+header: 2.02s (339 tensors) [from_apr-timing] parallel dequant->F32: 1.66s [from_apr-timing] validate (struct+shape+nan/inf): 4.61s [from_apr-timing] from_params(build tensors): 160us | TOTAL 8.29s Construction of the 1.5B Qwen2.5-Coder q4k.apr is ~8s, NOT 15 min. There is no PTX-JIT pre-warm in the `--method lora` (non-NF4) path at all — `init_cuda` only runs when `quantize_nf4` is true, and the LoRA train step uses CPU autograd (`cuda_blocks` is None). The real bottleneck the prior spike misattributed to construction is per-step CPU autograd forward+backward of a 1.5B model. Changes: - Add APR_FROM_APR_TIMING phase timers to Transformer::from_apr so the construction cost is observable instead of a black box. - Parallelize the per-tensor F32 dequant with rayon par_iter (read_tensor_as_f32 is a pure read into the owned byte buffer — data-parallel-safe). Serial dequant was the prime suspect; parallelizing removes it as a scaling term on multi-core. - Add rayon as an explicit aprender-train dependency. Falsifier FALSIFY-FINETUNE-CONSTRUCT-001 (transformer::model::tests): builds a complete tiny Qwen2-shaped APR, loads it through from_apr (exercising the parallel collect + validation), asserts every weight loads and forward produces finite logits. Mutation-verified: injecting a `.filter()` that drops model.embed_tokens.weight from the parallel collect turns the test RED; all-tensors-present is GREEN. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Builds the missing SFT export pipeline for apr-code distillation (no SFT/distill pipeline existed in CCPA) and runs the falsifiable 0→1 tool_call flip spike.
The grounded gap (from the distill feasibility spike): the student emits 0 tool_calls — a Markdown/prose prior — instead of the apr-code
<tool_call>envelope. This PR harvests real Claude Code teacher trajectories fromclaude-code-parity-aprand remaps them into apr-code SFT data.Part 1 — Export converter (
tools/ccpa-sft-export/)Standalone Rust converter (serde_json only — not a workspace member). Walks every
teacher.stream.ndjsonunder CCPAevidence/and:Read→file_read,Write→file_write,Edit→file_edit(old_string→old,new_string→new),Bash→shell,Grep→grep,Glob→glob.tool_useturn into an entrenarInstructSample{instruction, response, system}:system= apr-codeCODE_SYSTEM_PROMPTinstruction= running observation transcript (prior tool_calls + tool_results)response= literal<tool_call>{"name":..,"input":..}</tool_call>(InstructSample has no tool_calls field — the response string is the envelope)--curated(first action/trajectory),--balanced(stratified per tool),--full.Part 2 — Datasets (
datasets/)apr_code_sft_balanced.jsonlapr_code_sft_curated.jsonlMeasured corpus stats
138 streams · 40 fixtures · 6,569 raw tool_use → 6,442 remappable (127 unmappable Task*/ToolSearch/Agent/AskUserQuestion dropped) → 5,773 full SFT samples. 100% of emitted responses are valid parseable apr-code tool_calls.
Flip test
The falsifiable claim = base model emits 0 tool_calls; the LoRA'd student emits ≥1 parseable
<tool_call>on a held-out fixture prompt. Result reported in PR comments (faithful — measured numbers or the exact blocker, never a fabricated flip).🤖 Generated with Claude Code