Skip to content
Open
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .github/workflows/docker-image.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,9 @@ jobs:
registry: ghcr.io
username: ${{ github.repository_owner }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Derive image name
id: image
run: echo "name=ghcr.io/${GITHUB_REPOSITORY,,}" >> "$GITHUB_OUTPUT"
- name: Build and push
uses: docker/build-push-action@v7
with:
Expand All @@ -40,6 +43,8 @@ jobs:
platforms: linux/amd64
# platforms: linux/amd64,linux/arm64
push: ${{ github.event_name != 'pull_request' }}
tags: ghcr.io/${{ github.REPOSITORY }}:latest
tags: |
${{ steps.image.outputs.name }}:latest
${{ steps.image.outputs.name }}:${{ github.sha }}
# cache-from: type=gha
# cache-to: type=gha,mode=max
363 changes: 363 additions & 0 deletions .github/workflows/rolling-update.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,363 @@
name: Rolling update

# Manually-triggered production rollout. Joins the Tailnet, SSHes over
# MagicDNS into each node, and invokes scripts/rolling-update.sh.
# See docs/design/2026_04_24_proposed_deploy_via_tailscale.md.

on:
workflow_dispatch:
inputs:
ref:
description: Image tag/ref to deploy. Start this workflow from the repository default branch.
required: true
type: string
image_tag:
description: Override the image tag (default = ref). Used for rollbacks.
required: false
type: string
default: ""
nodes:
description: Comma-separated raft IDs to roll (e.g. "n1,n2"). Empty = all nodes in NODES_RAFT_MAP.
required: false
type: string
default: ""
dry_run:
description: Render the plan and run a reachability check only; do NOT touch containers.
required: true
type: boolean
default: true

permissions:
contents: read
id-token: write # required by tailscale/github-action OIDC flow
Comment on lines +30 to +32

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add packages:read for GHCR manifest check

This workflow narrows GITHUB_TOKEN to contents and id-token, which implicitly removes package scope, but the Verify image exists on ghcr.io step authenticates to GHCR and inspects a manifest with that token. In environments where the image is private (or package auth is otherwise required), docker login/docker manifest inspect will fail with authorization errors before rollout begins, so deploys are blocked even when the image exists.

Useful? React with 👍 / 👎.

packages: read # required by `docker manifest inspect` on ghcr.io private images

concurrency:
group: rolling-update
cancel-in-progress: false

jobs:
deploy:
runs-on: ubuntu-latest
# Approval gate — see GitHub environment settings for required reviewers.
# Dry-runs also use this environment so the secret wiring is identical;
# the environment's approval rule should be configured to auto-approve
# dry-runs if that distinction is desired (GitHub UI: "Deployment
# protection rules").
environment: production
timeout-minutes: 60

steps:
# The deploy script is executed after the tailnet join and SSH key load.
# Always take that script from the review-gated default branch; the
# workflow input only selects the image tag/ref to deploy.
- name: Resolve trusted checkout ref
id: trusted-ref
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REF: ${{ inputs.ref }}
RUN_REF_NAME: ${{ github.ref_name }}
RUN_REF_TYPE: ${{ github.ref_type }}
run: |
set -euo pipefail
default_branch=$(gh api "repos/${{ github.repository }}" --jq '.default_branch')
if [[ "$RUN_REF_TYPE" != "branch" || "$RUN_REF_NAME" != "$default_branch" ]]; then
echo "::error::rolling-update must be dispatched from the trusted default branch '$default_branch' (got ${RUN_REF_TYPE}:${RUN_REF_NAME})"
echo "::error::configure the production environment to allow deployments only from the default branch"
exit 1
fi
echo "checkout_ref=$default_branch" >> "$GITHUB_OUTPUT"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pin checkout to the dispatch SHA

When a production run waits for environment approval, the default branch can advance before these steps execute; because this output is the branch name, actions/checkout later fetches whatever is at main at that moment rather than the workflow run's github.sha. That can make an approved/audited rollout execute a different scripts/rolling-update.sh than the one associated with the dispatched workflow, so resolve and checkout the immutable dispatch SHA instead of the moving branch ref.

Useful? React with 👍 / 👎.

echo "deploy ref/image tag: $REF"

- name: Checkout trusted deploy script
uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6
with:
ref: ${{ steps.trusted-ref.outputs.checkout_ref }}
persist-credentials: false
Comment on lines +90 to +94

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Require dispatches to run from the trusted branch

When a manual workflow_dispatch run is started from a non-default branch (via the UI branch selector or gh workflow run --ref), the workflow YAML from that branch is already what GitHub executes; this checkout only changes the workspace contents afterward. Under the documented production environment setup, an unreviewed branch could modify later steps to use DEPLOY_SSH_PRIVATE_KEY/Tailscale secrets after approval, so checking out the default branch here does not actually keep the deploy workflow trusted. Add an early guard that fails unless the dispatch ref is the default branch, or require the production environment to allow deployments only from that branch.

Useful? React with 👍 / 👎.


- name: Verify image exists on ghcr.io
env:
IMAGE_BASE: ${{ vars.IMAGE_BASE }}
IMAGE_TAG: ${{ inputs.image_tag || inputs.ref }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Publish immutable image tags before deploying refs

When operators follow the documented default and enter a commit SHA/tag in ref, this resolves the deployment image tag from that value, but I checked the only image publishing workflow and it pushes only ghcr.io/${{ github.REPOSITORY }}:latest (.github/workflows/docker-image.yml:43). As a result docker manifest inspect fails for SHA/tag deploys before any rollout, and the documented rollback path using a previous SHA cannot work unless someone manually creates those tags; publish immutable tags in the build workflow or stop defaulting the deploy tag to ref.

Useful? React with 👍 / 👎.

GHCR_TOKEN: ${{ secrets.GITHUB_TOKEN }}
ACTOR: ${{ github.actor }}
run: |
set -euo pipefail
if [[ -z "$IMAGE_BASE" ]]; then
echo "::error::IMAGE_BASE repository variable is not set"
exit 1
fi
echo "Checking $IMAGE_BASE:$IMAGE_TAG"
echo "$GHCR_TOKEN" | docker login ghcr.io -u "$ACTOR" --password-stdin >/dev/null
if ! docker manifest inspect "$IMAGE_BASE:$IMAGE_TAG" >/dev/null; then
echo "::error::image $IMAGE_BASE:$IMAGE_TAG not found on ghcr.io"
exit 1
fi
Comment thread
coderabbitai[bot] marked this conversation as resolved.

- name: Join Tailnet (ephemeral)
uses: tailscale/github-action@6cae46e2d796f265265cfcf628b72a32b4d7cade # v3
with:
oauth-client-id: ${{ secrets.TS_OAUTH_CLIENT_ID }}
oauth-secret: ${{ secrets.TS_OAUTH_SECRET }}
tags: tag:ci-deploy

- name: Configure SSH
env:
SSH_KEY: ${{ secrets.DEPLOY_SSH_PRIVATE_KEY }}
KNOWN_HOSTS: ${{ secrets.DEPLOY_KNOWN_HOSTS }}
run: |
set -euo pipefail
mkdir -p ~/.ssh
chmod 700 ~/.ssh
printf '%s\n' "$SSH_KEY" > ~/.ssh/id_ed25519
chmod 600 ~/.ssh/id_ed25519
printf '%s\n' "$KNOWN_HOSTS" > ~/.ssh/known_hosts
chmod 644 ~/.ssh/known_hosts
# Sanity: no stray CRLF in the key, no empty file.
test -s ~/.ssh/id_ed25519 || { echo "::error::DEPLOY_SSH_PRIVATE_KEY is empty"; exit 1; }
ssh-keygen -lf ~/.ssh/id_ed25519 >/dev/null

- name: Render NODES and SSH_TARGETS
id: render
env:
NODES_RAFT_MAP: ${{ vars.NODES_RAFT_MAP }}
SSH_TARGETS_MAP: ${{ vars.SSH_TARGETS_MAP }}
NODES_FILTER: ${{ inputs.nodes }}
run: |
set -euo pipefail
if [[ -z "$NODES_RAFT_MAP" ]]; then
echo "::error::NODES_RAFT_MAP is not set in the production environment variables"
exit 1
fi

normalize_csv_map() {
local all="$1"
local out=""
local e key value
if [[ -z "$all" ]]; then
printf '%s' ""
return 0
fi
IFS=',' read -r -a entries <<< "$all"
for e in "${entries[@]}"; do
e="${e//[[:space:]]/}"
[[ -n "$e" ]] || continue
if [[ "$e" != *=* ]]; then
echo "::error::invalid map entry '$e' (expected raftId=value)"
exit 1
fi
key="${e%%=*}"
value="${e#*=}"
if [[ -z "$key" || -z "$value" ]]; then
echo "::error::invalid map entry '$e' (empty raft ID or value)"
exit 1
fi
out+="${out:+,}${key}=${value}"
done
printf '%s' "$out"
}

lookup_map() {
local key="$1"
local all="$2"
local e entry_key entry_value
[[ -n "$all" ]] || return 1
IFS=',' read -r -a entries <<< "$all"
for e in "${entries[@]}"; do
e="${e//[[:space:]]/}"
[[ -n "$e" ]] || continue
entry_key="${e%%=*}"
entry_value="${e#*=}"
if [[ "$entry_key" == "$key" ]]; then
printf '%s' "$entry_value"
return 0
fi
done
return 1
}

filter_csv() {
local all="$1"
local filter="$2"
local out=""
local w value
if [[ -z "$all" ]]; then
printf '%s' ""
return 0
fi
IFS=',' read -r -a list_wanted <<< "$filter"
for w in "${list_wanted[@]}"; do
w="${w//[[:space:]]/}"
[[ -n "$w" ]] || continue
value="$(lookup_map "$w" "$all" || true)"
if [[ -n "$value" ]]; then
out+="${out:+,}${w}=${value}"
fi
done
printf '%s' "$out"
}

known_ids_csv() {
local all="$1"
local out=""
local e key
IFS=',' read -r -a entries <<< "$all"
for e in "${entries[@]}"; do
e="${e//[[:space:]]/}"
[[ -n "$e" ]] || continue
key="${e%%=*}"
out+="${out:+,}$key"
done
printf '%s' "$out"
}

materialize_ssh_targets() {
local nodes="$1"
local ssh_targets="$2"
local out=""
local e key host target
if [[ -z "$nodes" ]]; then
printf '%s' ""
return 0
fi
IFS=',' read -r -a entries <<< "$nodes"
for e in "${entries[@]}"; do
e="${e//[[:space:]]/}"
[[ -n "$e" ]] || continue
key="${e%%=*}"
host="${e#*=}"
target="$(lookup_map "$key" "$ssh_targets" || true)"
if [[ -z "$target" ]]; then
target="$host"
fi
out+="${out:+,}${key}=${target}"
done
printf '%s' "$out"
}

NODES_RAFT_MAP="$(normalize_csv_map "$NODES_RAFT_MAP")"
SSH_TARGETS_MAP="$(normalize_csv_map "$SSH_TARGETS_MAP")"
if [[ -z "$NODES_RAFT_MAP" ]]; then
echo "::error::NODES_RAFT_MAP did not contain any nodes"
exit 1
fi
NODES_FILTER="${NODES_FILTER//[[:space:]]/}"

ROLLING_ORDER="$(known_ids_csv "$NODES_RAFT_MAP")"
if [[ -n "$NODES_FILTER" ]]; then
# Keep NODES_RAFT_MAP as the full cluster map. rolling-update.sh
# derives RAFT_TO_REDIS_MAP / RAFT_TO_S3_MAP and transfer
# candidates from NODES, so filtering it for a staged rollout would
# start the target node with an incomplete view of the cluster.
# The requested subset is passed separately as ROLLING_ORDER.
# Reject any filter ID that does not appear in the map: silently
# dropping unknown IDs would let a typo like "n1,n9" proceed as
# a one-node rollout of n1 alone, which is a staged-deploy
# footgun.
unknown=""
IFS=',' read -r -a wanted <<< "$NODES_FILTER"
for w in "${wanted[@]}"; do
[[ -n "$w" ]] || continue
if ! lookup_map "$w" "$NODES_RAFT_MAP" >/dev/null; then
unknown+="${unknown:+, }$w"
fi
done
if [[ -n "$unknown" ]]; then
echo "::error::nodes filter '$NODES_FILTER' references unknown raft IDs: $unknown. Known IDs: $(known_ids_csv "$NODES_RAFT_MAP")"
exit 1
fi
ROLLING_ORDER="$(known_ids_csv "$(filter_csv "$NODES_RAFT_MAP" "$NODES_FILTER")")"
if [[ -z "$ROLLING_ORDER" ]]; then
echo "::error::nodes filter '$NODES_FILTER' matches nothing in NODES_RAFT_MAP"
exit 1
fi
fi
SSH_TARGETS_MAP="$(materialize_ssh_targets "$NODES_RAFT_MAP" "$SSH_TARGETS_MAP")"
ROLLING_SSH_TARGETS="$(filter_csv "$SSH_TARGETS_MAP" "$ROLLING_ORDER")"
{
echo "NODES=$NODES_RAFT_MAP"
echo "SSH_TARGETS=$SSH_TARGETS_MAP"
echo "ROLLING_ORDER=$ROLLING_ORDER"
echo "ROLLING_SSH_TARGETS=$ROLLING_SSH_TARGETS"
} >> "$GITHUB_OUTPUT"
echo "::group::Deploy plan"
echo "NODES=$NODES_RAFT_MAP"
echo "SSH_TARGETS=$SSH_TARGETS_MAP"
echo "ROLLING_ORDER=$ROLLING_ORDER"
echo "ROLLING_SSH_TARGETS=$ROLLING_SSH_TARGETS"
echo "::endgroup::"

- name: SSH reachability check
env:
SSH_TARGETS: ${{ steps.render.outputs.ROLLING_SSH_TARGETS }}
SSH_USER: ${{ vars.SSH_USER }}
run: |
set -euo pipefail
IFS=',' read -r -a entries <<< "$SSH_TARGETS"
failed=0
for e in "${entries[@]}"; do
Comment on lines +314 to +316

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check reachability for all rollout nodes, not only SSH map entries

The reachability step iterates only over SSH_TARGETS, but rolling-update.sh resolves missing SSH mappings by falling back to each node host from NODES (ssh_target_by_id). If SSH_TARGETS_MAP is incomplete, dry-run can report success without probing some actual rollout targets, and the job can then fail mid-roll when it reaches an unvalidated node. Preflight should derive targets from NODES + SSH_TARGETS using the same fallback semantics or enforce one-to-one mapping coverage first.

Useful? React with 👍 / 👎.

target="${e##*=}"
if [[ "$target" != *@* ]]; then
target="${SSH_USER:-$(id -un)}@$target"
fi
ok=0
for attempt in 1 2 3 4 5 6; do
if ssh -o BatchMode=yes -o ConnectTimeout=10 -o StrictHostKeyChecking=yes "$target" true; then
echo " ok $target"
ok=1
break
fi
if [[ "$attempt" -lt 6 ]]; then
echo " wait $target (attempt $attempt failed; retrying)"
sleep 10
fi
done
if [[ "$ok" -ne 1 ]]; then
echo "::error::$target not reachable by SSH over tailnet"
failed=1
fi
done
if [[ "$failed" -ne 0 ]]; then
exit 1
fi

- name: Dry-run summary
if: ${{ inputs.dry_run }}
env:
NODES: ${{ steps.render.outputs.NODES }}
SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }}
ROLLING_ORDER: ${{ steps.render.outputs.ROLLING_ORDER }}
IMAGE: ${{ vars.IMAGE_BASE }}:${{ inputs.image_tag || inputs.ref }}
SSH_USER: ${{ vars.SSH_USER }}
ENABLE_S3: ${{ vars.ENABLE_S3 || 'false' }}
S3_CREDENTIALS_FILE: ${{ vars.S3_CREDENTIALS_FILE }}
DRY_RUN: "true"
REF: ${{ inputs.ref }}
run: |
set -euo pipefail
if [[ "$ENABLE_S3" == "true" && -z "$S3_CREDENTIALS_FILE" ]]; then
echo "::error::ENABLE_S3=true requires S3_CREDENTIALS_FILE in the production environment"
exit 1
fi
./scripts/rolling-update.sh --dry-run
echo "ref: $REF"
echo "Re-run with dry_run=false to apply."

- name: Roll cluster
if: ${{ !inputs.dry_run }}
env:
NODES: ${{ steps.render.outputs.NODES }}
SSH_TARGETS: ${{ steps.render.outputs.SSH_TARGETS }}
ROLLING_ORDER: ${{ steps.render.outputs.ROLLING_ORDER }}
SSH_USER: ${{ vars.SSH_USER }}
Comment on lines +412 to +416

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Forward S3 security settings into the rollout

For deployments that currently set ENABLE_S3=false or S3_CREDENTIALS_FILE in the manual rollout environment, this workflow does not forward those script settings into rolling-update.sh. The script defaults ENABLE_S3=true, and when S3_CREDENTIALS_FILE is empty it omits --s3CredentialsFile, causing a restarted node to expose the S3 adapter without the configured SigV4 credential file; forward these vars from the production environment (or fail closed) before making this workflow the canonical deploy path.

Useful? React with 👍 / 👎.

Comment on lines +412 to +416

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Forward non-default rollout settings

In deployments that rely on any non-default rolling-update.sh settings (for example RAFT_PORT, REDIS_PORT, DYNAMO_PORT, DATA_DIR, RAFT_ENGINE, S3_REGION, or EXTRA_ENV), this env block drops those overrides and the script falls back to its built-in defaults when recreating the container. That can make a node advertise the wrong ports or use a different data directory/engine than the current manual rollout configuration, so the workflow should either forward the existing script contract from environment vars or fail closed when unsupported overrides are required.

Useful? React with 👍 / 👎.

IMAGE: ${{ vars.IMAGE_BASE }}:${{ inputs.image_tag || inputs.ref }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Provide registry auth to remote pulls

The workflow authenticates only the Actions runner to GHCR for manifest inspection, then the live rollout passes just IMAGE to rolling-update.sh; the actual docker pull runs later on each remote host. For private GHCR packages, which this workflow explicitly accounts for with packages: read, nodes without preexisting Docker credentials will fail on the first target even though verification passed, blocking approved deploys; either document/enforce node-side registry login or perform a remote login with a deploy-scoped token before invoking the script.

Useful? React with 👍 / 👎.

ENABLE_S3: ${{ vars.ENABLE_S3 || 'false' }}
S3_CREDENTIALS_FILE: ${{ vars.S3_CREDENTIALS_FILE }}
SSH_STRICT_HOST_KEY_CHECKING: "yes"
run: |
set -euo pipefail
if [[ "$ENABLE_S3" == "true" && -z "$S3_CREDENTIALS_FILE" ]]; then
echo "::error::ENABLE_S3=true requires S3_CREDENTIALS_FILE in the production environment"
exit 1
Comment on lines +421 to +423

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check remote S3 credential files before rollout

When ENABLE_S3=true but the configured S3_CREDENTIALS_FILE is absent or unreadable on one target, this only checks that the path string is non-empty. The script's actual readability check happens inside run_container after stop_container, so a live rollout can remove a healthy container and then abort before starting its replacement; use the SSH preflight to run test -r on each rollout target (or move the script check before stopping).

Useful? React with 👍 / 👎.

Comment on lines +421 to +423

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preflight remote S3 credential files

When ENABLE_S3=true and S3_CREDENTIALS_FILE is set but the file is missing or unreadable on a node, this check passes and the workflow proceeds to scripts/rolling-update.sh. That script validates the file inside run_container only after it has already removed the existing container, so a stale path in GitHub vars can leave the first targeted node down before the rollout aborts. Since this workflow already renders ROLLING_SSH_TARGETS, add a preflight SSH test -r "$S3_CREDENTIALS_FILE" for each target before invoking the script.

Useful? React with 👍 / 👎.

fi
./scripts/rolling-update.sh

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Set up Go before invoking the live rollout

When dry_run=false, this step invokes rolling-update.sh; I checked the script's live path and it builds ./cmd/raftadmin with go build before rolling nodes, while the dry-run path exits before that build. This workflow never runs actions/setup-go for the go.mod toolchain (go1.26.4), unlike the repo's test workflows, so an approved live deploy can fail on a fresh ubuntu-latest runner because the required Go toolchain is missing or must be fetched uncached even though the dry-run passed; install Go before this step or provide a prebuilt RAFTADMIN_BIN.

Useful? React with 👍 / 👎.

Loading