LLM inference callback support#1689
Conversation
815bbd0 to
7bc95c0
Compare
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| /// request/response. | ||
| /// </summary> | ||
| public LlmRequestHandler? Handler { get; set; } | ||
| } |
There was a problem hiding this comment.
Why do we have this wrapper type and not just put LlmRequestHandler directly onto the create/resume config?
| </PropertyGroup> | ||
|
|
||
| <ItemGroup> | ||
| <InternalsVisibleTo Include="GitHub.Copilot.SDK.Test" /> |
There was a problem hiding this comment.
Is this IVT really needed?
| /// model-layer request. | ||
| /// </summary> | ||
| [Experimental(Diagnostics.Experimental)] | ||
| public enum LlmInferenceTransport |
There was a problem hiding this comment.
Why is this public when everything else in here is internal? Maybe it could move to the file with all the public API.
| // Return from httpRequestStart immediately (after registering state) so | ||
| // the runtime's RPC reply is not gated on the consumer's I/O. The actual | ||
| // provider work runs asynchronously. | ||
| _ = RunProviderAsync(llmRequest, state, sink); |
There was a problem hiding this comment.
Why are you doing that? What's wrong with holding the RPC reply until the consumer's I/O completes? Isn't tat necessary to achieve natural backpressure and error propagation?
Altogether I'm surprised by how much complexity there is in this file and wonder if you're trying to make it too clever. Why isn't there a significantly simpler way of adapting the RPC protocol to the public base class, for example by eliminating any separate queueing/buffering/etc?
| } | ||
|
|
||
| return result; | ||
| } |
There was a problem hiding this comment.
Seems really odd to have an adapter like this when the source data is already in the right format. What stops us from simplifying all this?
Adds an opt-in llmInference config to CopilotClientOptions that lets SDK consumers register a callback the runtime invokes whenever it would otherwise issue an outbound non-streaming LLM HTTP request itself. v1 scope is TS-only/non-streaming, mirroring the runtime support added in github/copilot-agent-runtime. Streaming SSE and WebSocket transports are out of scope for v1 and continue to bypass the callback. - New `LlmInferenceProvider` interface with a single `onLlmRequest` method. - `createLlmInferenceAdapter` converts the provider into the wire-shape `LlmInferenceHandler` consumed by the RPC dispatcher. - Client wiring: `llmInference.setProvider` is sent on connect; per-session adapter is attached alongside the existing sessionFs hook. - New `llm_inference.e2e.test.ts` exercises the full RPC round-trip against the runtime. Resolves github/copilot-sdk-internal#88 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Matches the runtime move of `llmInference.httpRequest` out of the session-scoped client API and onto a new `clientGlobal` schema root. - Codegen emits a new `registerClientGlobalApiHandlers` alongside the existing `registerClientSessionApiHandlers`. Handlers passed to it are dispatched directly (no per-session `getHandlers` callback) and carry no implicit sessionId — sessionId, when present, is just a payload field on the call. - `CopilotClient` now constructs the LLM inference adapter once and registers it process-wide via `registerClientGlobalApiHandlers` during connection setup. The per-session `setupLlmInference` path and the `SessionConfigBase.createLlmInferenceProvider` override are removed — there is no longer any per-session notion of which provider to use. - `LlmInferenceConfig.createLlmInferenceProvider` is now `() => LlmInferenceProvider` (was `(session) => ...`). - `LlmInferenceRequest` exposes the new optional `sessionId` field so consumers can correlate requests with a runtime session when one is in scope. E2E test updated to verify the global registration works and that sessionId is populated on in-session traffic. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
With the Rust runtime intercept chokepoint in place, every model-layer HTTP request - including /models and /models/session - is now dispatched through the SDK callback. Update the e2e test to: - Stub realistic responses for non-streaming model catalog and session endpoints (so the runtime can proceed past model resolution). - Hard-assert the catalog request is intercepted (no more 'either-or' fallback for the pre-rust-intercept state). Streaming inference requests still pass through to the recorded CAPI proxy; a fully-mocked end-to-end inference test will land alongside the streaming-intercept commit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extends LlmInferenceProvider with an optional onLlmStreamRequest method that returns a response head synchronously and pushes body chunks via the provided sink. The adapter implements the generated httpStreamStart RPC method and forwards chunks back to the runtime via the typed server-RPC client (llmInference.streamChunk / streamEnd). Adds a fully-mocked e2e test (test/e2e/llm_inference_stream.e2e.test.ts) that drives a complete user->assistant turn through the callback alone: the runtime hits the callback for /models, /models/session, and the chat completion itself, the assistant text returned to the SDK consumer is the synthetic text supplied by the stub. - nodejs/src/llmInferenceProvider.ts: LlmInferenceStreamSink, onLlmStreamRequest, httpStreamStart adapter - nodejs/src/client.ts: pass a lazy server-RPC accessor into the adapter - nodejs/src/index.ts: re-export new types - nodejs/test/e2e/llm_inference_stream.e2e.test.ts: full-mock e2e - nodejs/src/generated/*, python/*, go/*, rust/*: codegen for new RPC methods - dotnet/src/Generated/*: codegen for new RPC methods Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds test/e2e/llm_inference_errors.e2e.test.ts that wires a callback whose inference handler throws a synthetic transport error and verifies the failure surfaces to the SDK consumer (the call does not hang and any error caught is non-empty). Confirms the runtime's existing retry / error reporting path handles callback-side failures the same way it handles real transport failures. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirrors the runtime-side cleanup: the callback wire no longer carries providerType / endpointKind / wireApi / transport / modelId. Adapter stops forwarding the field, e2e tests filter by URL instead of metadata, and the missing LlmInferenceStreamSink / LlmInferenceStreamStartResponse re-exports in types.ts are added so index.ts type-checks cleanly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
[Phase 3] Realign the Node SDK with the runtime's new four-method chunk
protocol. One unified provider callback:
interface LlmInferenceProvider {
onLlmRequest(req: LlmInferenceRequest): Promise<void>;
}
LlmInferenceRequest exposes:
* url / method / headers / sessionId
* requestBody: AsyncIterable<Uint8Array> // body delivered as chunks
* responseBody: LlmInferenceResponseSink // start/write/end/error
The sink enforces start -> 0..N writes -> exactly one of end/error and
maps each call to the corresponding httpResponseStart / httpResponseChunk
RPC. createLlmInferenceAdapter maintains a per-requestId state map; the
generated httpRequestStart handler registers state synchronously and
fires onLlmRequest in the background, so the runtime's RPC reply isn't
gated on consumer I/O.
The body queue iterator now latches a 'done' flag so a consumer that
calls .next() again after end:true gets done back instead of blocking
forever waiting for chunks the runtime will never send.
Removes the previous onLlmRequest + onLlmStreamRequest split and the
LlmInferenceResponse / LlmInferenceStreamSink /
LlmInferenceStreamStartResponse public types. All three e2e tests
rewritten against the unified callback (one of them URL-dispatches
/responses -> SSE and /chat/completions -> buffered JSON; the consumer
can also branch on whether the request body has stream:true).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 4.1: expose an AbortSignal on the request envelope, abort it on a
cancel chunk from the runtime, and map consumer-side aborts to a 499 +
error{code:cancelled} response. Adds the cancellation e2e test.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an e2e test asserting that when the SDK consumer signals a terminal
error via responseBody.error({ code: 'cancelled' }) the runtime surfaces
it faithfully as a request failure rather than hanging. Completes the
consumer->runtime direction of Phase 4.1.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Surface the new `transport` discriminator on `LlmInferenceRequest` so consumers can tell an `"http"` request (plain HTTP / SSE) from a `"websocket"` one (full-duplex: each request-body chunk is one inbound WS message, each response-body write one outbound message). The adapter threads `params.transport` through, defaulting to `"http"`. Regenerate rpc.ts against the runtime schema for the new field and add an e2e test exercising the full-duplex path: the fake model advertises `ws:/responses`, the runtime's WebSocket flag is enabled via env var, and the consumer pumps `/responses` events back per inbound message. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Friendly product-code starting point for SDK consumers who want to observe or mutate LLM inference requests/responses by overriding virtual methods on a base class. Implements LlmInferenceProvider, so an instance can be returned directly from createLlmInferenceProvider. Default behaviour is a transparent pass-through: each request is forwarded to its original URL via the WHATWG fetch global (HTTP) or WebSocket global (WebSocket), and the upstream response is streamed back unchanged. The same subclass handles both transports - onLlmRequest dispatches on req.transport. Virtual hooks: - HTTP: transformRequest, forward, transformResponse - WebSocket: forwardWebSocket, transformRequestMessage, transformResponseMessage E2e test (llm_inference_handler.e2e.test.ts) demonstrates a single TestHandler subclass servicing both an HTTP turn (single-shot title generation) and a WebSocket turn (main agent turn) against a per-test in-process http+ws upstream that speaks the real CAPI shapes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review fixes for github/copilot-sdk-internal#88 (Node SDK side). - Honor the runtime's accepted=false ack: the response sink now aborts the provider's signal and stops emitting once the runtime drops the request (I1). - Add a staging backstop in the adapter so a body chunk that arrives before its start frame is buffered and replayed rather than silently dropped (B1). - Run the WebSocket request/response pumps concurrently and race their terminal states, so an upstream-closes-first (or runtime-cancels-first) case tears the other side down instead of hanging on a parked iterator (B2). - Buffer inbound WS frames in wrapGlobalWebSocket until onMessage is registered so the first frames of a fast upstream aren't dropped. - Collapse the dead send branch, hoist TextEncoder/TextDecoder singletons, and correct the LlmWebSocketUpstream.onClose contract doc. - Update CopilotClientOptions.llmInference docs: streaming SSE and WebSocket are intercepted, not bypassed (I6). - Add unit tests: chunk-before-start staging, accepted=false abort, WS upstream-close-first finalisation, and WS upstream-error propagation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Drives a CAPI session and a BYOK (openai/responses) session entirely through the LLM inference callback — the consumer fabricates every model-layer response, so the CAPI record/replay proxy is never the inference endpoint. Asserts each in-session inference request carries req.sessionId === session.sessionId and that the two session ids differ. The mock branches /responses on the request stream flag: BYOK turns whose config-derived model does not advertise streaming issue a buffered (non-streaming) /responses request expecting a single JSON response object, whereas the CAPI turn streams via SSE. This mirrors real upstream behaviour and confirms the callback transport faithfully delivers both shapes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirrors the TypeScript LLM inference callback feature in the .NET SDK so consumers can observe/mutate the model-layer HTTP/WebSocket requests the runtime issues (CAPI and BYOK), with the runtime session id threaded into each callback. - scripts/codegen/csharp.ts: emit the clientGlobal handler interface + registration so Rpc.cs gains the llmInference handler surface. - LlmInferenceProvider.cs: low-level ILlmInferenceProvider API + adapter (request staging, response sink state machine) behind an internal ILlmInferenceResponseChannel seam for unit testing. - LlmRequestHandler.cs: idiomatic pass-through base class mapping to HttpRequestMessage/HttpResponseMessage and ClientWebSocket, with virtual transform/forward hooks for both transports. - Types.cs/Client.cs: wire LlmInferenceConfig into the client and register the provider on start. - Tests: factored unit-test infra (recording channel/sink, inline provider, frame builders) with adapter + handler tests, plus CAPI+BYOK e2e tests asserting the session id reaches the callback. e2e provider emits raw JSON (reflection-free STJ) and serves all model-layer traffic off-network. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Hide the redundant low-level provider interface and adapter from the public surface in both SDKs; the sole public extension point is now the LlmRequestHandler base class. Replace the LlmInferenceConfig provider factory with a direct handler instance (the provider is client-global, constructed once with no args). .NET: ILlmInferenceProvider + the LlmInferenceRequest/ResponseInit/ResponseSink DTOs become internal; LlmRequestHandler implements the interface explicitly so OnLlmRequestAsync leaves its public surface. LlmInferenceConfig.Handler replaces the Func<LlmRequestHandler> factory. TS: stop exporting LlmInferenceProvider and createLlmInferenceAdapter from index.ts; LlmInferenceConfig.handler replaces createLlmInferenceProvider. The request/sink DTOs stay exported as onLlmRequest's contract (TS lacks explicit interface implementation). E2E providers become LlmRequestHandler subclasses overriding onLlmRequest. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Collapse the HTTP callback seam to SendRequest/sendRequest, replace websocket hooks with per-connection handlers, and update tests to use the forwarding handler model. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port the LLM inference callback feature to the Python SDK, mirroring the existing Node.js and .NET implementations. Consumers subclass `LlmRequestHandler` and override `send_request` (idiomatic httpx) for HTTP or `open_web_socket` (websockets) for the WebSocket transport; both default to transparent pass-through. Wired through `LlmInferenceConfig` on the client, registered on the `clientGlobal.llmInference` scope. Adds the low-level provider/adapter, the httpx-based handler base class, client wiring, public exports, and httpx as a core dependency. Extends the Python codegen to emit clientGlobal handler registration and regenerates the generated RPC bindings. Includes 8 e2e test files (10 tests) mirroring the Node.js suite — round trip, session-id threading (CAPI + BYOK), streaming SSE, error mapping, runtime cancel, consumer cancel, WebSocket transport, and the idiomatic handler against a real local HTTP+WebSocket upstream. All pass off-network. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirror the existing Node/.NET/Python LLM inference callback support in the Go SDK. Consumers register an LlmInferenceProvider (or the idiomatic LlmRequestHandler over net/http + coder/websocket) via ClientOptions.LlmInference; the runtime routes every model-layer HTTP and WebSocket request through it for both CAPI and BYOK sessions. - Codegen (scripts/codegen/go.ts) now emits the clientGlobal handler registration, regenerating go/rpc/zrpc.go. - New low-level provider types + adapter (llm_inference_provider.go) and the idiomatic forwarding handler (llm_request_handler.go). - Wire LlmInferenceConfig into ClientOptions and the connect/start paths. - 8 off-network e2e scenarios mirroring the other SDKs (basic, session id, stream, errors, cancel, consumer cancel, websocket, handler). Also fixes a pre-existing Go e2e compile break (AttachmentBlob.Data became *string in the Rust contract regen baseline) that blocked the e2e package. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Generated by SDK Consistency Review Agent for issue #1689 · sonnet46 3.4M
| /// override <see cref="SendRequestAsync"/> or <see cref="OpenWebSocketAsync"/>. | ||
| /// </summary> | ||
| [Experimental(Diagnostics.Experimental)] | ||
| public class CopilotRequestHandler |
There was a problem hiding this comment.
Cross-SDK naming note: This class is named CopilotRequestHandler (along with CopilotRequestContext and ForwardingCopilotWebSocketHandler), following the Node.js convention. However, the three other SDKs added in this same PR (Go, Java) and the two pre-existing ones (Python, Rust) all use LlmRequestHandler / LlmRequestContext / ForwardingWebSocketHandler.
After this PR the naming is 4-vs-2: LlmRequestHandler (Python, Rust, Go, Java) vs CopilotRequestHandler (Node.js, .NET). If this split is intentional (grouping .NET with Node.js), that's fine, but it would be worth noting somewhere—otherwise it may be worth aligning .NET with the majority Llm* naming.
| /// outbound call. | ||
| /// </summary> | ||
| [Experimental(Diagnostics.Experimental)] | ||
| public CopilotRequestHandler? RequestHandler { get; set; } |
There was a problem hiding this comment.
Cross-SDK config API note: The handler is set here as a flat RequestHandler CopilotRequestHandler? property (same pattern as Node.js's requestHandler). The other four SDKs (Python, Rust, Go, Java) all wrap it in an LlmInferenceConfig config object, e.g.:
// Java
options.setLlmInference(new LlmInferenceConfig().setHandler(new MyHandler()));// Go
options.LlmInference = &LlmInferenceConfig{Handler: &MyHandler{}}# Python
options.llm_inference = LlmInferenceConfig(handler=MyHandler())After this PR the pattern is 4-vs-2: config wrapper (Python, Rust, Go, Java) vs direct property (Node.js, .NET). Developers targeting multiple SDKs will encounter a different configuration shape for the same feature. Worth a deliberate decision here.
| * @throws Exception | ||
| * if the request could not be completed | ||
| */ | ||
| protected HttpResponse<InputStream> sendHttp(HttpRequest request, LlmRequestContext ctx) throws Exception { |
There was a problem hiding this comment.
Minor cross-SDK naming note: This method is named sendHttp (like Rust's send_http), but Node.js and Python — both also in the LlmRequestHandler family — name the equivalent hook sendRequest / send_request. All three ultimately receive an "HTTP request" object, so the naming is semantically equivalent, but within the LlmRequestHandler group it differs between languages:
| SDK | HTTP hook name |
|---|---|
| Python | send_request(httpx.Request, ctx) |
| Rust | send_http(LlmHttpRequest, ctx) |
| Java | sendHttp(HttpRequest, ctx) |
Suggest clarifying whether sendHttp or sendRequest should be the canonical hook name for the LlmRequestHandler family (it won't matter for users of a single SDK, but affects multi-language teams reading docs side-by-side).
This comment has been minimized.
This comment has been minimized.
Mirror the .NET/Node simplification + terminology rename in the Go SDK: consolidate the provider/handler two-layer design into a single copilot_request_handler.go, drop the accepted:false ack plumbing and the staged backstop, emit the WebSocket 101 upgrade head eagerly (a lazy bridge deadlocks the runtime connect), and rename the public Llm* types to Copilot* (types carry the prefix; fields/methods stay succinct). The client option becomes ClientOptions.RequestHandler *CopilotRequestHandler. Generated wire types are untouched. Consolidate the e2e suite to three files (copilot_request_handler covering HTTP + WebSocket + streaming, copilot_request_session_id, and copilot_request_cancel_error with the error and runtime-cancel cases) plus a shared helpers file, replacing the nine llm_inference_* test files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirror the .NET/Node simplification + terminology rename in the Java SDK: fold the low-level provider/adapter indirection and the response sink/init/ body DTOs into CopilotRequestHandler plus an internal exchange, drop the accepted:false ack plumbing and the staged backstop, emit the WebSocket 101 upgrade head eagerly (a lazy bridge deadlocks the runtime connect), and rename the public Llm* types to Copilot* (types carry the prefix; methods stay succinct). The client option becomes CopilotClientOptions.requestHandler of type CopilotRequestHandler; LlmInferenceConfig is removed. Generated wire types are untouched. Consolidate the e2e suite to three tests (CopilotRequestHandlerE2ETest covering HTTP + WebSocket + streaming, CopilotRequestSessionIdE2ETest, and CopilotRequestCancelErrorE2ETest with the error and runtime-cancel cases) plus CopilotRequestTestSupport, replacing the eight LlmInference*E2ETest files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirror the .NET/Node simplification + terminology rename in the Python SDK: consolidate the provider/handler two-layer design into a single copilot/copilot_request_handler.py, drop the accepted:false ack plumbing and the staged backstop, emit the WebSocket 101 upgrade head eagerly (a lazy bridge deadlocks the runtime connect), and rename the public Llm* types to Copilot* (types carry the prefix; attributes/methods stay succinct). The client option becomes request_handler: CopilotRequestHandler. Generated wire types are untouched. Stream in-memory httpx responses (built with content=) by forwarding their buffered body, since their raw stream is already consumed and cannot be re-iterated; real streamed responses still flow through aiter_raw. Consolidate the e2e suite to three files (test_copilot_request_handler covering HTTP + WebSocket + streaming, test_copilot_request_session_id, and test_copilot_request_cancel_error with the error and runtime-cancel cases) plus a shared helpers module, replacing the eight test_llm_inference_* files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirror the .NET/Node simplification + terminology rename in the Rust SDK: consolidate the inference/dispatch/handler modules into a single copilot_request_handler.rs with one CopilotRequestHandler trait (default methods) plus an internal exchange, drop the accepted:false ack plumbing and the staged backstop (cancellation flows via CancellationToken), emit the WebSocket 101 upgrade head eagerly (a lazy bridge deadlocks the runtime connect), and rename the public Llm* types to Copilot* (types carry the prefix; fields/methods stay succinct). The client option becomes ClientOptions.request_handler. Generated wire types are untouched. Consolidate the e2e coverage into copilot_request_handler.rs with four tests mirroring the Node scenarios (HTTP + WebSocket + streaming via one handler, session-id threading, handler errors, and runtime-driven cancel), replacing the llm_inference.rs e2e module. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| !inference.is_empty(), | ||
| "expected at least one intercepted inference request" | ||
| ); | ||
| for (_, session_id) in &inference { |
| inference.len() > before, | ||
| "expected at least one intercepted BYOK inference request" | ||
| ); | ||
| for (_, session_id) in &inference[before..] { |
| handler.inference_attempts.load(Ordering::SeqCst) > 0, | ||
| "expected the inference callback to be reached and raise" | ||
| ); | ||
| if let Err(err) = send_result { |
| if self._upstream is not None: | ||
| try: | ||
| await self._upstream.close() | ||
| except Exception: |
| task.cancel() | ||
| try: | ||
| await task | ||
| except (asyncio.CancelledError, Exception): |
| finally: | ||
| try: | ||
| await client.stop() | ||
| except Exception: |
| is_inference_url, | ||
| isolated_client_fixture, | ||
| ) | ||
| from .testharness import E2ETestContext # noqa: F401 (ctx fixture dependency) |
| # The callback throws on inference; the turn surfaces an error (or | ||
| # completes without an assistant message) rather than hanging. | ||
| await session.send_and_wait("Say OK.") | ||
| except BaseException: # noqa: BLE001 |
| finally: | ||
| try: | ||
| await client.stop() | ||
| except Exception: |
| is_inference_url, | ||
| isolated_client_fixture, | ||
| ) | ||
| from .testharness import E2ETestContext # noqa: F401 (ctx fixture dependency) |
This comment has been minimized.
This comment has been minimized.
The branch had dropped the net472 `<Compile Include="..\src\Polyfills\*.cs">` from the test project, on the assumption that the SDK's internal polyfills were visible via InternalsVisibleTo. No InternalsVisibleTo to the test assembly actually exists, so the net472 test build lost its IsExternalInit definition and failed with CS0518 across every file using init-only setters and records (the .NET SDK Tests (windows-latest) job). net8.0 builds were unaffected because IsExternalInit ships in the BCL there. Restore the direct polyfill compile for net472 to match main; net472 now builds clean (0 warnings, 0 errors). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces SDK-wide support for intercepting the runtime’s outbound model-layer LLM inference traffic (HTTP/SSE + WebSocket) by registering a single client-global handler over JSON-RPC (llmInference.*). It extends the shared codegen pipeline to generate client-global reverse-RPC handlers and wires language-specific handler abstractions plus E2E coverage across Node.js, .NET, Python, Go, Rust, and Java.
Changes:
- Extend RPC schema/codegen to support a new
clientGlobalAPI surface and generate registration helpers forllmInference.*. - Add per-language “request handler” APIs/adapters to forward/replace runtime model HTTP + WebSocket traffic, including cancellation and session-id threading.
- Add cross-language E2E tests validating session-id propagation, cancellation/error behavior, and forwarding behavior for both HTTP and WebSocket paths.
Show a summary per file
| File | Description |
|---|---|
| scripts/codegen/utils.ts | Adds clientGlobal support to the shared API schema shape. |
| scripts/codegen/typescript.ts | Generates TS client-global handler interfaces + registration function. |
| scripts/codegen/python.ts | Generates Python client-global handler Protocols/dataclass + registration function. |
| scripts/codegen/go.ts | Generates Go client-global handler interfaces + registration function. |
| scripts/codegen/csharp.ts | Generates C# client-global handler interfaces + registration plumbing. |
| rust/tests/e2e/support.rs | Adds no-snapshot E2E context and helper to start an LLM handler-wired client. |
| rust/tests/e2e.rs | Registers new Rust E2E module for request-handler coverage. |
| rust/src/types.rs | Re-exports request-handler types for the Rust public surface. |
| rust/src/router.rs | Routes llmInference.* requests via a client-global dispatcher rather than per-session routing. |
| rust/src/lib.rs | Adds request-handler option wiring, early router start for startup model calls, and internal handle reconstruction. |
| rust/Cargo.toml | Adds HTTP/WebSocket dependencies needed for the Rust forwarding/transport implementation. |
| python/pyproject.toml | Promotes httpx to a runtime dependency and adds websockets to dev deps for tests. |
| python/e2e/test_copilot_request_session_id_e2e.py | Verifies session-id threading into handler for CAPI and BYOK sessions. |
| python/e2e/test_copilot_request_handler_e2e.py | Validates mutate-and-forward behavior across HTTP + WebSocket via a real upstream. |
| python/e2e/test_copilot_request_cancel_error_e2e.py | Covers thrown-error mapping and runtime cancellation behavior via cancel_event. |
| python/e2e/_copilot_request_helpers.py | Shared synthetic model endpoint builders and isolated-client fixtures for request-handler tests. |
| python/copilot/generated/rpc.py | Adds generated client-global handler types and registration for llmInference.*. |
| python/copilot/client.py | Wires request-handler registration and llmInference.setProvider into client startup. |
| python/copilot/init.py | Re-exports request-handler types/helpers as part of the public Python API surface. |
| nodejs/test/e2e/copilot_request_session_id.e2e.test.ts | Node E2E for session-id threading through the request handler. |
| nodejs/test/e2e/copilot_request_cancel_error.e2e.test.ts | Node E2E for handler error propagation and runtime-driven cancellation. |
| nodejs/src/types.ts | Adds requestHandler option + exports request-handler types from the Node public surface. |
| nodejs/src/index.ts | Re-exports request-handler APIs from the Node entrypoint. |
| nodejs/src/generated/rpc.ts | Adds generated client-global handler interfaces + registerClientGlobalApiHandlers. |
| nodejs/src/client.ts | Registers client-global handlers on the connection and calls llmInference.setProvider when configured. |
| nodejs/package.json | Adds ws and @types/ws for Node E2E WebSocket upstream tests. |
| nodejs/package-lock.json | Locks new ws / @types/ws additions. |
| java/src/test/java/com/github/copilot/FakeUpstreamServer.java | Adds an in-process raw HTTP/WebSocket upstream for Java forwarding E2E. |
| java/src/test/java/com/github/copilot/CopilotRequestSessionIdE2ETest.java | Java E2E verifying session-id threading for CAPI + BYOK. |
| java/src/test/java/com/github/copilot/CopilotRequestHandlerE2ETest.java | Java E2E validating synthetic HTTP servicing and forwarding over HTTP + WebSocket. |
| java/src/test/java/com/github/copilot/CopilotRequestCancelErrorE2ETest.java | Java E2E for error propagation and runtime cancellation behavior. |
| java/src/main/java/module-info.java | Makes java.net.http a required module dependency (runtime usage). |
| java/src/main/java/com/github/copilot/rpc/CopilotClientOptions.java | Adds requestHandler option to configure the connection-level interception handler. |
| java/src/main/java/com/github/copilot/LlmWebSocketResponseBridge.java | Ensures WebSocket upgrade head is emitted before any body/terminal frames. |
| java/src/main/java/com/github/copilot/LlmInferenceExchange.java | Implements per-request body ingestion + strict response emission state machine. |
| java/src/main/java/com/github/copilot/LlmInferenceAdapter.java | Bridges llmInference.* reverse-RPC frames to CopilotRequestHandler execution. |
| java/src/main/java/com/github/copilot/ForwardingCopilotWebSocketHandler.java | Default Java WS forwarding implementation using java.net.http.WebSocket. |
| java/src/main/java/com/github/copilot/CopilotWebSocketMessage.java | Represents WS text/binary messages exchanged via the handler seam. |
| java/src/main/java/com/github/copilot/CopilotWebSocketHandler.java | Base per-connection WS handler contract for runtime↔upstream message forwarding. |
| java/src/main/java/com/github/copilot/CopilotWebSocketCloseStatus.java | Carries terminal close/error state for callback-owned WS connections. |
| java/src/main/java/com/github/copilot/CopilotRequestTransport.java | Exposes the runtime’s intended transport selection (HTTP vs WebSocket). |
| java/src/main/java/com/github/copilot/CopilotRequestHandler.java | Adds Java handler base class with default HTTP forwarding + WS forwarding hooks. |
| java/src/main/java/com/github/copilot/CopilotRequestContext.java | Provides request-id/session-id/headers/url/transport/cancellation context to handlers. |
| java/src/main/java/com/github/copilot/CopilotClient.java | Registers the adapter + calls llmInference.setProvider when handler is configured. |
| go/types.go | Adds RequestHandler to Go client options for connection-level interception. |
| go/rpc/zrpc.go | Adds generated client-global handler types and registration for llmInference.*. |
| go/internal/e2e/copilot_request_session_id_e2e_test.go | Go E2E for session-id threading into intercepted inference calls. |
| go/internal/e2e/copilot_request_helpers_test.go | Shared synthetic upstream builders for Go request-handler E2Es. |
| go/internal/e2e/copilot_request_handler_e2e_test.go | Go E2E validating mutate-and-forward over HTTP + WebSocket. |
| go/internal/e2e/copilot_request_cancel_error_e2e_test.go | Go E2E covering error propagation and runtime cancellation behavior. |
| go/go.sum | Adds checksum entries for new Go WebSocket dependency. |
| go/go.mod | Adds github.com/coder/websocket for Go WebSocket upstream tests. |
| go/client.go | Wires client-global handler registration and LlmInference.SetProvider when configured. |
| dotnet/test/E2E/CopilotRequestSessionIdE2ETests.cs | .NET E2E verifying session-id threading for CAPI + BYOK sessions. |
| dotnet/test/E2E/CopilotRequestE2EProvider.cs | Shared .NET test handler that fabricates model endpoint responses for off-network turns. |
| dotnet/test/E2E/CopilotRequestCancelErrorE2ETests.cs | .NET E2E for handler error mapping and runtime cancellation propagation. |
| dotnet/src/Types.cs | Adds experimental RequestHandler option to CopilotClientOptions. |
| dotnet/src/Generated/Rpc.cs | Adds generated types and client-global registration for llmInference.*. |
| dotnet/src/Client.cs | Registers client-global handlers and calls LlmInference.SetProviderAsync when configured. |
Copilot's findings
Files not reviewed (2)
- go/rpc/zrpc.go: Generated file
- nodejs/package-lock.json: Generated file
- Files reviewed: 62/69 changed files
- Comments generated: 1
|
|
||
| httpSrv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { | ||
| path := strings.ToLower(strings.SplitN(r.URL.Path, "?", 2)[0]) | ||
| _ = r.Body.Close |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Generated by SDK Consistency Review Agent for issue #1689 · sonnet46 3.1M
| return nil | ||
| } | ||
| return f.conn.Write(ctx, websocket.MessageText, out) | ||
| } |
There was a problem hiding this comment.
Cross-SDK consistency concern: binary WebSocket frame type is lost
The Go adapter decodes binary frames correctly in routeChunk (line ~603: base64-decodes when params.Binary == true) but drops the binary/text flag before pushing into the Body channel. As a result, SendRequestMessage here always forwards as websocket.MessageText, regardless of the original wire frame type.
All other SDKs carry the binary flag through the request message path:
- .NET:
CopilotWebSocketMessage.IsBinary - Java:
CopilotWebSocketMessage.binary() - Python:
str | bytes(type encodes the distinction) - Node.js:
string | Uint8Array(same) - Rust:
CopilotWebSocketMessage.binary: bool
In practice, LLM inference likely only uses text frames, so this may not cause immediate issues. But if binary frames ever appear this will corrupt the upstream message silently. Consider threading the binary flag through the Body channel (e.g. as a struct { data []byte; binary bool }) so SendRequestMessage can use the correct message type.
| type CopilotWebSocketCloseStatus struct { | ||
| Description string | ||
| Code string | ||
| Err error |
There was a problem hiding this comment.
Minor cross-SDK naming inconsistency: Code vs ErrorCode
All other SDKs name this field ErrorCode/errorCode/error_code:
- .NET:
public string? ErrorCode { get; init; } - Java:
public String errorCode() - Python:
error_code: str | None - Node.js:
readonly errorCode?: string
Go uses the shorter Code, which is not wrong, but this creates a small inconsistency for developers navigating across SDKs. Consider renaming to ErrorCode to match the cross-SDK convention (the Err field name is fine — Err vs Error is established Go idiom).
| &self, | ||
| ctx: &CopilotRequestContext, | ||
| response: CopilotWebSocketResponse, | ||
| ) -> Result<Box<dyn CopilotWebSocketHandler>, CopilotRequestError> { |
There was a problem hiding this comment.
API surface difference from other SDKs: extra response parameter
Every other SDK's open_websocket / openWebSocket / OpenWebSocketAsync takes only the context:
// Node.js
protected openWebSocket(ctx: CopilotRequestContext): Promise<CopilotWebSocketHandler>
// .NET
protected virtual Task<CopilotWebSocketHandler> OpenWebSocketAsync(CopilotRequestContext ctx)
// Java
protected CopilotWebSocketHandler openWebSocket(CopilotRequestContext ctx)
// Python
async def open_web_socket(self, ctx: CopilotRequestContext) -> CopilotWebSocketHandler
// Go
OpenWebSocket func(ctx *CopilotRequestContext) (CopilotWebSocketHandler, error)
Rust adds response: CopilotWebSocketResponse as a second argument, which consumers must store and use to write upstream→runtime messages.
This is architecturally necessary for Rust traits (no base-class state for a sendResponseMessage helper), so it isn't a bug. But it's worth documenting clearly in the CopilotRequestHandler trait doc comment — specifically that response must be stored in the returned handler struct and used to call response.send_message(...) in place of the sendResponseMessage base-class helper the other SDKs provide.
|
|
||
| private static int GetFreePort() | ||
| { | ||
| var probe = new TcpListener(IPAddress.Loopback, 0); |
| catch | ||
| { | ||
| // Accept loop unwinds on listener shutdown. | ||
| } |
| catch | ||
| { | ||
| // Already stopped. | ||
| } |
| catch | ||
| { | ||
| // Already torn down. | ||
| } |
| catch | ||
| { | ||
| // Best-effort: the runtime tears connections down as turns complete. | ||
| } |
| catch | ||
| { | ||
| // Best-effort; the socket may already be closed. | ||
| } |
| catch (Exception ex) | ||
| { | ||
| await CloseAsync(new CopilotWebSocketCloseStatus | ||
| { | ||
| Description = ex.Message, | ||
| Error = ex, | ||
| }).ConfigureAwait(false); | ||
| } |
| catch | ||
| { | ||
| // Some headers are managed by the handshake; ignore rejections. | ||
| } |
| private readonly string _url; | ||
| private readonly IReadOnlyDictionary<string, IReadOnlyList<string>> _headers; | ||
| private WebSocket? _upstream; | ||
| private CancellationTokenSource? _pumpCts; |
| { | ||
| private readonly string _url; | ||
| private readonly IReadOnlyDictionary<string, IReadOnlyList<string>> _headers; | ||
| private WebSocket? _upstream; |
The codegen-check job runs `cargo +nightly fmt --all` with `rust/.rustfmt.nightly.toml` (imports_granularity = "Module") over the whole crate and fails if the tree changes. The hand-written copilot_request_handler.rs was only stable-formatted, leaving two adjacent `tokio::sync` imports unconsolidated. Merge them into the canonical nightly-fmt form. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
staticcheck (SA1019) flags http.ProtocolError as deprecated. The throwing transport only needs to return some transport-level error to exercise the error path, so use errors.New instead. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
rustdoc's redundant_explicit_links lint (deny under -D warnings) flagged the [`CopilotRequestHandler`](crate::copilot_request_handler::CopilotRequestHandler) link, since the label already resolves to the same destination. Drop the explicit target and rely on intra-doc resolution. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The custom JsonRpc reader dispatches each incoming reverse-RPC method fire-and-forget on the thread pool, so llmInference.httpRequestStart and httpRequestChunk for the same request run concurrently. The start handler registered the exchange in _pending; the chunk handler looked it up and silently dropped the frame on a miss. When the dropped frame was the body End, the request-body drain blocked forever, the model HTTP request never completed, and the turn hung until sendAndWait's 60s idle timeout fired. Make the adapter ordering-independent: both httpRequestStart and httpRequestChunk get-or-create the exchange via _pending.GetOrAdd. A body chunk that races ahead of its start now buffers into the same exchange's channel instead of being dropped; start adopts that exchange, fills in Method/Context, and launches the handler exactly once. Locally reproduced the hang ~1-in-3 before; 20/20 green after. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
jsonrpc2 spawns a goroutine per incoming request, so httpRequestStart and httpRequestChunk for the same request run concurrently. The chunk handler dropped any frame whose requestId was not yet registered by start; when the dropped frame was the body End, the request-body drain blocked forever and the turn hung until the 60s idle timeout fired. Add getOrCreateExchange so both entry points get-or-create the exchange: a body chunk that races ahead of start buffers into the shared frame queue instead of being dropped, and start adopts that exchange's context/queue. 10/10 green after. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both handleRequestStart and handleRequestChunk now get-or-create the exchange via getOrCreateExchange, so a body chunk that races ahead of its start frame buffers into the same exchange instead of being dropped (which would hang the request-body drain). LlmInferenceExchange is constructed bare (requestId, rpcSupplier) with method set when the start frame arrives. Mirrors the .NET/Go root-cause fix for cross-SDK consistency; 5/5 e2e green. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Make the LLM-inference adapter ordering-independent, matching the .NET/Go root-cause fix. Both httpRequestStart and httpRequestChunk now get-or-create the exchange, so a body chunk (including the terminal end frame) that races ahead of its start frame buffers into the same exchange instead of being dropped — which would hang the request-body drain. CopilotRequestExchange is now constructed bare (requestId, getServerRpc) with its request context filled in via setContext when the start frame arrives. Node's vscode-jsonrpc reader currently dispatches in order so this race is not reachable today, but the get-or-create shape keeps the adapter correct regardless of dispatch order and consistent across all six SDKs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…apter Both http_request_start and http_request_chunk now get-or-create the exchange via _get_or_create, so a body chunk that races ahead of its start frame buffers into the same exchange instead of being dropped (which would hang the request-body drain). _CopilotRequestExchange is constructed bare (request_id, get_server_rpc); the request context is filled in via set_context when the start frame arrives. Mirrors the .NET/Go root-cause fix for cross-SDK consistency. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both handle_start and handle_chunk now get-or-create the exchange via get_or_create_exchange, so a body chunk that races ahead of its start frame buffers into the same exchange's body channel instead of being dropped (which would hang the request-body drain). The request metadata moves behind a OnceLock filled by set_context when the start frame arrives, letting the exchange be created bare from either entry point. Mirrors the .NET/Go root-cause fix for cross-SDK consistency; fmt (stable+nightly) and clippy clean, 4/4 e2e green. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
b8013ef to
32f4164
Compare
Cross-SDK Consistency ReviewOverall, this is an impressively consistent implementation across all six SDK languages — class names, option field names, and WebSocket API shapes are well-aligned. One naming inconsistency in the HTTP hook is worth flagging before merge.
|
| SDK | HTTP override method |
|---|---|
| Node.js | sendRequest(request: Request, ctx) |
| .NET | SendRequestAsync(HttpRequestMessage, ctx) |
| Python | send_request(self, request, ctx) |
| Rust | send_http(request: CopilotHttpRequest, ctx) |
| Java | sendHttp(HttpRequest request, ctx) |
| Go | n/a — struct field Transport http.RoundTripper (idiomatic) |
Rust and Java use send_http/sendHttp while the other three SDKs spell it sendRequest/SendRequestAsync/send_request. A developer porting a handler from Node.js to Java would look for sendRequest, not find it, and have to hunt for sendHttp. Aligning all five override-based SDKs on sendRequest/SendRequestAsync/send_request (or alternatively all on sendHttp/send_http) would improve discoverability across the SDK family.
i️ PR description has stale Java examples
The Java usage example in the PR description references types and method names that differ from the actual code:
| PR description | Actual code |
|---|---|
extends LlmRequestHandler |
extends CopilotRequestHandler |
LlmRequestContext |
CopilotRequestContext |
new ForwardingWebSocketHandler(ctx.url(), ctx.headers()) |
new ForwardingCopilotWebSocketHandler(context) |
protected byte[] onSendRequestMessage(byte[] data, boolean binary) |
public void sendRequestMessage(CopilotWebSocketMessage message) |
This doesn't affect the implementation itself but will confuse readers. Worth updating before merge so the description matches the actual API.
✅ Everything else looks consistent
- Class names:
CopilotRequestHandler,CopilotRequestContext,CopilotWebSocketHandler,CopilotWebSocketCloseStatus,ForwardingCopilotWebSocketHandler— consistent across all SDKs ✅ - Client option field:
requestHandler/RequestHandler/request_handler— consistent across all SDKs ✅ - WebSocket override:
openWebSocket/OpenWebSocketAsync/open_web_socket/open_websocket— consistent (language-case only) ✅ - WebSocket message type
CopilotWebSocketMessageand itssendRequestMessage/sendResponseMessagehooks — consistent ✅ - E2E test coverage appears symmetric across all six SDKs ✅
Generated by SDK Consistency Review Agent for issue #1689 · sonnet46 2.2M · ◷
There was a problem hiding this comment.
Generated by SDK Consistency Review Agent for issue #1689 · sonnet46 2.2M
| * @throws Exception | ||
| * if the request could not be completed | ||
| */ | ||
| protected HttpResponse<InputStream> sendHttp(HttpRequest request, CopilotRequestContext ctx) throws Exception { |
There was a problem hiding this comment.
Naming inconsistency with Node.js / .NET / Python: The HTTP interception hook is called sendHttp here (and send_http in Rust), but the equivalent in the other three SDKs is sendRequest / SendRequestAsync / send_request. A consumer who has read the Node.js docs will look for sendRequest and not find it.
Suggest aligning on one name across all five override-based SDKs — either rename this to sendRequest (matching the majority) or update the other three to sendHttp/send_http.
| /// Service one intercepted HTTP request. Default: forward to the real | ||
| /// upstream via [`forward_http`]. Override to mutate the request before | ||
| /// forwarding, mutate the response after, or replace the call entirely. | ||
| async fn send_http( |
There was a problem hiding this comment.
Naming inconsistency with Node.js / .NET / Python: The HTTP interception hook is called send_http here (and sendHttp in Java), but the equivalent in the other three SDKs is sendRequest / SendRequestAsync / send_request. Aligning on one name across all five override-based SDKs would improve cross-language discoverability.
Summary
This PR adds SDK support for intercepting LLM inference requests and handling them in user code across all six SDK languages: Node.js, .NET, Python, Go, Rust, and Java.
Consumers register one client-global
LlmRequestHandler(constructed once, no args). The runtime invokes it over JSON-RPC (llmInference.*) whenever it would otherwise issue a model-layer HTTP or WebSocket request — for both BYOK and CAPI — fully replacing the outbound call. A handler that overrides nothing is a transparent pass-through.It includes the full feature work on this branch:
LlmRequestHandlermodel with forwarding helpersWhat changed
Shared protocol and plumbing
llmInference.setProvider)Per-language ports
Each language exposes the same
LlmRequestHandlermodel, mapped onto the most canonical HTTP representation available in that ecosystem:Request/Response(Fetch)HttpRequestMessage/HttpResponseMessageClientWebSockethttpxrequest/response*http.Request/*http.Responsehttp::Request/http::Responsejava.net.httpHttpRequest/HttpResponsejava.net.http.WebSocketAll ports thread cancellation and session id through the request context, and provide a
ForwardingWebSocketHandler(or equivalent) for the common mutate-and-forward case.API shape
SendRequestAsync/sendRequest/sendHttp/ language equivalents)OpenWebSocketAsync/openWebSocket, returning a per-connection handler objectUsage examples
C#
Node.js
Java
Register the handler when constructing the client, e.g. in Java:
Tests
Each language adds e2e coverage (mirroring a shared reference suite) for:
Plus per-language unit coverage where applicable (e.g. .NET handler/adapter behavior, Node.js unit flows).
Resolves github/copilot-sdk-internal#88