Skip to content

CORS-4507: aws: support edge machine pool management with ClusterAPI#10625

Open
tthvo wants to merge 5 commits into
openshift:mainfrom
tthvo:CORS-4507
Open

CORS-4507: aws: support edge machine pool management with ClusterAPI#10625
tthvo wants to merge 5 commits into
openshift:mainfrom
tthvo:CORS-4507

Conversation

@tthvo

@tthvo tthvo commented Jun 15, 2026

Copy link
Copy Markdown
Member

Descriptions

This PR adds support for generating CAPI machinesets for edge machine pool in AWS. This work depends on node taints support in CAPI v1.12+ (see OCPCLOUD-2899).

For example, use the below install-config snippet to install (us-east-1):

compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 3
  management: ClusterAPI
- architecture: amd64
  hyperthreading: Enabled
  management: ClusterAPI
  name: edge
  platform:
    aws:
      zones:
      - us-east-1-atl-2a
      - us-east-1-bos-1a
  replicas: 2
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3
featureSet: DevPreviewNoUpgrade

Note: The installer already vendors CAPI v1.12.8 so we can generate the manifests without waiting on CCAPIO.

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Cluster API can now manage edge compute machine pools, applying per-availability-zone labels/taints and using zone-specific instance type preferences when available.
    • Edge compute machine pools now receive the appropriate default management setting when it isn’t explicitly set.
  • Bug Fixes

    • Install configuration validation no longer rejects edge compute pools managed by Cluster API.
    • Feature-gate checks now correctly evaluate all compute entries.
  • Tests

    • Added/expanded unit tests for edge compute defaulting and comprehensive AWS Cluster API MachineSet/Template scenarios, including edge-specific behavior.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 15, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

@tthvo: This pull request references CORS-4507 which is a valid jira issue.

Details

In response to this:

This PR adds support for generating CAPI machinesets for edge machine pool in AWS. This work depends on node taints support in CAPI v1.12+ (see OCPCLOUD-2899). For example, use the below install-config snippet to install (us-east-1):

compute:
- architecture: amd64
 hyperthreading: Enabled
 name: worker
 platform: {}
 replicas: 3
 management: ClusterAPI
- architecture: amd64
 hyperthreading: Enabled
 management: ClusterAPI
 name: edge
 platform:
   aws:
     zones:
     - us-east-1-atl-2a
     - us-east-1-bos-1a
 replicas: 2
controlPlane:
 architecture: amd64
 hyperthreading: Enabled
 name: master
 platform: {}
 replicas: 3
featureSet: DevPreviewNoUpgrade

Note: The installer already vendors CAPI v1.12.8 so we can generate the manifests without waiting on CCAPIO.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a2f814da-82f5-46ea-b07c-9738e3bfd118

📥 Commits

Reviewing files that changed from the base of the PR and between 247cefd and 8c75dbf.

📒 Files selected for processing (8)
  • pkg/asset/machines/aws/clusterapi_machinesets.go
  • pkg/asset/machines/aws/clusterapi_machinesets_test.go
  • pkg/types/defaults/machinepools.go
  • pkg/types/defaults/machinepools_test.go
  • pkg/types/validation/featuregate_test.go
  • pkg/types/validation/featuregates.go
  • pkg/types/validation/installconfig.go
  • pkg/types/validation/installconfig_test.go
💤 Files with no reviewable changes (2)
  • pkg/types/validation/installconfig_test.go
  • pkg/types/validation/installconfig.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • pkg/types/defaults/machinepools.go
  • pkg/types/defaults/machinepools_test.go
  • pkg/asset/machines/aws/clusterapi_machinesets.go
  • pkg/asset/machines/aws/clusterapi_machinesets_test.go

📝 Walkthrough

Walkthrough

Enables ClusterAPI management for edge compute machine pools. SetMachinePoolDefaults now assigns ClusterAPI to the edge role when the feature gate is active. The validation blocking edge+ClusterAPI is removed. ClusterAPIMachineSets adds edge-specific per-AZ instance type selection, zone labels, and a NoSchedule taint wired into MachineSpec.Taints. A comprehensive test suite validates the end-to-end behavior across multiple scenarios.

Changes

Edge Compute ClusterAPI Support

Layer / File(s) Summary
Feature gate infrastructure for edge ClusterAPI
pkg/types/validation/featuregates.go, pkg/types/validation/featuregate_test.go
The feature gate condition now scans all compute entries for Management == types.ClusterAPI rather than only the first entry, enabling gating across all compute pools. Test coverage added for edge compute role asserting that DevPreviewNoUpgrade allows edge+ClusterAPI while Default forbids it.
Default ClusterAPI assignment for edge machine pools
pkg/types/defaults/machinepools.go, pkg/types/defaults/machinepools_test.go
SetMachinePoolDefaults sets Management = ClusterAPI for the edge compute role when the feature gate is enabled and Management is unset. Test coverage validates behavior under both DevPreviewNoUpgrade (assigns ClusterAPI, preserves pre-set values) and Default (leaves empty).
Validation removal for edge pool ClusterAPI assignment
pkg/types/validation/installconfig.go, pkg/types/validation/installconfig_test.go
The validateCompute check that rejected edge compute pools with management set to ClusterAPI is removed. The corresponding test case "edge compute with cluster api" is deleted, unblocking valid edge+ClusterAPI configurations.
ClusterAPI MachineSet generation for edge pools
pkg/asset/machines/aws/clusterapi_machinesets.go
Imports pkg/types for edge role detection. Initializes a per-AZ nodeTaints slice, adds an edge pool branch that selects zone-specific PreferredInstanceType, applies edge/zone/parent-zone/type labels, appends a NoSchedule taint with always propagation, and wires taints into MachineSpec.Taints.
Comprehensive test suite for ClusterAPI MachineSets
pkg/asset/machines/aws/clusterapi_machinesets_test.go
New 533-line test file validates error cases, replica distribution across zones, BYO VPC and subnet handling, public subnet filtering, IMDS defaults, IAM instance profiles, user tag propagation, security group management, and edge-pool-specific instance type selection, labels, and tainting behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 13 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ❓ Inconclusive The custom check specifies "Ginkgo test code" review, but this PR contains only standard Go unit tests using testing.T, not Ginkgo BDD tests. No Ginkgo/Gomega imports exist in the codebase. Clarify whether the check applies to standard Go tests. If it does, the tests pass quality criteria: single responsibility per test case, meaningful assertion messages with context (e.g., zone index, expected/actual values), no cleanup n...
✅ Passed checks (13 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main change: adding ClusterAPI support for edge machine pool management in AWS.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in the modified/added test files are stable and deterministic. No dynamic values, generated identifiers, timestamps, UUIDs, or variable interpolation found in any test title.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. All test changes use standard Go unit testing (func Test* with testing.T), not Ginkgo patterns (It/Describe/Context/When), so the check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests added in this PR. All new/modified tests are standard Go unit tests (testing.T) in the installer codebase, not e2e tests requiring SNO compatibility checks.
Topology-Aware Scheduling Compatibility ✅ Passed PR exclusively modifies installer infrastructure provisioning code (MachineSet generation, validation, defaults) without introducing any pod scheduling constraints, pod affinity, topology spread, o...
Ote Binary Stdout Contract ✅ Passed PR contains no OTE binaries or suite-level code. All files are standard Go packages or unit tests with no process-level stdout writes (fmt.Print*, log.Print*, klog, or suite setup that could emit t...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests (It(), Describe(), etc.) are added. The new test file uses standard Go testing package with unit tests only. No IPv4 assumptions or external connectivity detected.
No-Weak-Crypto ✅ Passed No weak crypto (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons found in modified code.
Container-Privileges ✅ Passed PR only modifies Go source code; no Kubernetes manifests or container security configurations were added or modified, making this check inapplicable.
No-Sensitive-Data-In-Logs ✅ Passed No logging statements expose passwords, tokens, API keys, credentials, PII, session IDs, hostnames, or customer data. Code only logs non-sensitive infrastructure identifiers.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from mtulio and pawanpinjarkar June 15, 2026 19:22
@tthvo

tthvo commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

/label platform/aws
/hold
/cc @patrickdillon

@openshift-ci openshift-ci Bot requested a review from patrickdillon June 15, 2026 19:25
@openshift-ci openshift-ci Bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. platform/aws labels Jun 15, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/asset/machines/aws/clusterapi_machinesets.go`:
- Around line 85-90: The code accesses the map in.Zones using the key az without
checking if the key exists. In Go, accessing a non-existent map key silently
returns a zero-value, causing the code to proceed with an empty zone object and
skipping the preferred instance type selection. Fix this by using the two-value
form of map access (zone, ok := in.Zones[az]) in the edge pool handling section
where in.Pool.Name equals types.MachinePoolEdgeRoleName, and add a check to fail
fast when the zone key is missing (when ok is false) rather than continuing with
zero-value defaults.
- Around line 85-105: Add unit tests for the edge pool logic in
ClusterAPIMachineSets. Create a new test file clusterapi_machinesets_test.go
with test cases that exercise the code path when in.Pool.Name equals
types.MachinePoolEdgeRoleName. The tests must verify that when
zone.PreferredInstanceType is set, it is correctly assigned to the instanceType
variable, that all four edge node labels (node-role.kubernetes.io/edge,
machine.openshift.io/zone-type, machine.openshift.io/zone-group,
machine.openshift.io/parent-zone-name) are properly set on nodeLabels, and that
a MachineTaint with Key node-role.kubernetes.io/edge and Effect NoSchedule is
appended to nodeTaints with Propagation set to
capi.MachineTaintPropagationAlways. Use fixture data that includes zone objects
with PreferredInstanceType populated to ensure the complete edge pool branch is
exercised.

In `@pkg/types/defaults/machinepools_test.go`:
- Around line 209-215: The test case at line 211 is named "edge compute with
management already set" but uses types.MachinePoolComputeRoleName when creating
the MachinePool, which duplicates the previous compute case and fails to verify
that explicit edge Management is preserved. Change the Name field in the pool
creation from types.MachinePoolComputeRoleName to types.MachinePoolEdgeRoleName
so the test properly validates the edge role case with management preservation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 935d0e1b-5fcd-4768-b659-0d13b36cb9ff

📥 Commits

Reviewing files that changed from the base of the PR and between e114840 and 1cfc304.

📒 Files selected for processing (5)
  • pkg/asset/machines/aws/clusterapi_machinesets.go
  • pkg/types/defaults/machinepools.go
  • pkg/types/defaults/machinepools_test.go
  • pkg/types/validation/installconfig.go
  • pkg/types/validation/installconfig_test.go
💤 Files with no reviewable changes (2)
  • pkg/types/validation/installconfig.go
  • pkg/types/validation/installconfig_test.go

Comment thread pkg/asset/machines/aws/clusterapi_machinesets.go
Comment thread pkg/asset/machines/aws/clusterapi_machinesets.go
Comment thread pkg/types/defaults/machinepools_test.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/asset/machines/aws/clusterapi_machinesets_test.go`:
- Around line 262-275: The test case for edge pool instance type override sets
the zone's PreferredInstanceType to "m5.xlarge", which is the same as the base
AWS instance type, so the test cannot actually validate that the override logic
is working. Change the PreferredInstanceType value in the zone to a different
instance type (e.g., "m6i.xlarge") to ensure the assertion proves the override
behavior is correct. Apply this same fix wherever else this pattern appears in
the test file (as noted in the "Also applies to" comment).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 0b7e1456-8173-4123-b1f7-bdd3a29c77da

📥 Commits

Reviewing files that changed from the base of the PR and between 1cfc304 and 247cefd.

📒 Files selected for processing (6)
  • pkg/asset/machines/aws/clusterapi_machinesets.go
  • pkg/asset/machines/aws/clusterapi_machinesets_test.go
  • pkg/types/defaults/machinepools.go
  • pkg/types/defaults/machinepools_test.go
  • pkg/types/validation/installconfig.go
  • pkg/types/validation/installconfig_test.go
💤 Files with no reviewable changes (2)
  • pkg/types/validation/installconfig.go
  • pkg/types/validation/installconfig_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • pkg/types/defaults/machinepools.go
  • pkg/types/defaults/machinepools_test.go
  • pkg/asset/machines/aws/clusterapi_machinesets.go

Comment thread pkg/asset/machines/aws/clusterapi_machinesets_test.go
@tthvo

tthvo commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

/test e2e-aws-ovn-devpreview

tthvo added 5 commits June 16, 2026 08:36
Note: we use the same feature gate ClusterAPIComputeInstall as worker
compute pool.
Defaults the edge machine pool management to CAPI when the appropriate
feature gate is enabled.
Edge compute pools require MachineTaintPropagation, which is now
available after we bump CAPI to v1.12.
@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tthvo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tthvo

tthvo commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

/test e2e-aws-ovn-devpreview

@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@tthvo: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-heterogeneous 8c75dbf link false /test e2e-aws-ovn-heterogeneous
ci/prow/e2e-aws-ovn-fips 8c75dbf link false /test e2e-aws-ovn-fips
ci/prow/e2e-aws-ovn-imdsv2 8c75dbf link false /test e2e-aws-ovn-imdsv2
ci/prow/e2e-aws-byo-subnet-role-security-groups 8c75dbf link false /test e2e-aws-byo-subnet-role-security-groups
ci/prow/e2e-aws-ovn-devpreview 8c75dbf link false /test e2e-aws-ovn-devpreview
ci/prow/e2e-aws-ovn-shared-vpc-custom-security-groups 8c75dbf link false /test e2e-aws-ovn-shared-vpc-custom-security-groups

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@tthvo

tthvo commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

/hold cancel

CAPI 1.12 is already running in-cluster.

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 25, 2026
@tthvo

tthvo commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

/test e2e-aws-ovn-shared-vpc-edge-zones

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. platform/aws

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants