Skip to content

fix: CI test run stability improvements#1832

Open
TravisStark wants to merge 41 commits into
release/v0.48.xfrom
fix/vpc-tf-version
Open

fix: CI test run stability improvements#1832
TravisStark wants to merge 41 commits into
release/v0.48.xfrom
fix/vpc-tf-version

Conversation

@TravisStark

@TravisStark TravisStark commented May 13, 2026

Copy link
Copy Markdown
Contributor

Title:

fix: CI stability — prevent instance leaks, reduce
timeouts, add subnet capacity

Details

  • Trap signals in executeTerraformTest.sh so cancelled runs
    always destroy resources
  • Bail on failed destroy instead of retrying on broken
    state (root cause of 668-instance leak)
  • Run terraform init before destroy in cleanup script to
    fix "Module not installed" failures
  • Reduce apply timeout 45m→30m, retries 2→1 for faster
    failure
  • Bump EKS kubernetes_service create timeout 20m→30m to
    accommodate Classic ELB tail latency
  • Parallelize EC2 wait-patch calls (15m worst case, was
    30m)
  • Wire in 4th public subnet (us-west-2a-1, 251 IPs) for
    additional capacity
  • Append cache misses to file for downstream visibility
  • Pin VPC module version

@TravisStark TravisStark requested a review from a team as a code owner May 13, 2026 13:29
@TravisStark TravisStark force-pushed the fix/vpc-tf-version branch from 6b9492d to b8eda22 Compare May 13, 2026 13:51
TravisStark and others added 4 commits May 13, 2026 11:48
- Switch kubernetes/kubectl/helm providers from static token to exec-based
  auth (aws eks get-token). Static tokens expire after 15m, causing
  Unauthorized errors on long-running applies.
- Switch LoadBalancer services to NLB (1-2 min provision vs 15-20 min
  for Classic ELBs under concurrency pressure).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace docker-compose v1.27.4 with built-in docker compose v2
- Add --quiet pull to suppress layer-by-layer noise in logs
- Add ::group:: annotations for collapsible log sections per testcase
- Add timestamped phase markers (init, apply, destroy)
- Write step summary table with pass/fail and duration per testcase

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The AL2 Docker package from amazon-linux-extras doesn't include the
compose plugin. Install it as a CLI plugin from GitHub releases.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TravisStark and others added 22 commits May 13, 2026 13:45
- Batch generator now groups tests by platform (EC2/0, EKS/1, etc.)
  so batches never mix platforms. Easier to identify platform-specific
  failures in the sidebar.
- RetryHelper now logs the exception message on each retry attempt.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ment (all)

ALBs and EFS mount targets reject multiple subnets in the same AZ.
Keep original 3-subnet list for those consumers, use expanded 4-subnet
list only for random instance placement.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The validator runs in a Docker container without the AWS CLI, so
exec-based auth in the kubeconfig doesn't work. Generate a fresh
token at apply time via data.external and embed it. The validator
only runs for ~2 min so the 15-min token TTL is sufficient.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Under high concurrency (110 batch jobs), X-Ray ingestion can take
longer than the 100s window (5 retries × 20s). Doubling to 10
retries gives 200s which should cover Kafka pipeline latency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sidebar now shows EC2/amazonlinux2/0, EC2/windows2022/0,
EKS/us-west-2|cluster-name/0, ECS/EC2/0, etc. Makes it trivial
to identify AMI-specific or cluster-specific failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove yum update -y (causes RPMTransaction fatal errors under load)
- Add 3-retry loop on amazon-linux-extras install docker
- Use systemctl start docker (more reliable than service docker start)
- Reduce sleep 30 to sleep 10 (docker starts in <5s)
- Bump X-Ray trace validation retries from 10 to 15 (300s window)
  for Kafka pipeline tests under high concurrency

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Keeps system packages current while preventing RPM transaction
failures from aborting the provisioner.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The spark sample app emits totalApiBytesSent reliably but
apiBytesSent only appears after a full EMF aggregation window.
Under high concurrency this window often doesn't complete before
validation retries exhaust. Replaced with totalApiBytesSent which
is always present.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…etries

- Remove /var/run/docker.sock and /var/lib/docker mounts from the
  container insights daemonset. These don't exist on AL2023/containerd
  2.1 (EKS 1.33+) and cause pod scheduling warnings.
- Bump default MAX_RETRIES from 10 to 15 (300s window) to handle
  ECS task startup latency under high concurrency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- MAX_RETRIES 15 -> 20 (400s default window for mocked server/CW validation)
- X-Ray trace inner retry 15 -> 20 (400s for Kafka pipeline ingestion)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rely on 20-retry bump (400s) to find it instead of removing
validation coverage.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Increase haproxy/nginx helm release timeout from 300s to 600s to fix
  containerinsight_eks_prometheus failures on busy clusters
- Remove protocol_version from kafka test configs; MSK clusters report
  "3.9.x" which is unparseable by the collector's kafka client — let
  the client auto-negotiate instead
- Strip region and collector-ci- prefix from batch job display names
  for readability (EKS/amd64-1-33 instead of EKS/us-west-2|collector-ci-amd64-1-33)
Docker 25.0.14-1.amzn2.0.5 breaks port forwarding on Amazon Linux 2,
causing the validator to be unable to reach the sample app on port 8080
despite SSH (port 22) working to the same host. Pin to the last known
working version.
Replace all 0.0.0.0/0 ingress rules in the shared security group with
10.0.0.0/16 (VPC-internal only). Add an ephemeral per-run security
group in the EC2 module that discovers the GH Actions runner's public
IP via checkip.amazonaws.com and opens only the required ports (22,
80, 8080, 5985) from that single IP for the duration of the test.

This addresses Sev-2 ticket D452320721 where an external scanner
exploited the open WinRM port.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ECS tests use public-facing ALBs that share this SG. HTTP ports are
non-sensitive (no auth, ephemeral test data). Only SSH/WinRM/RDP
remain restricted to VPC CIDR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
checkCacheHits was running sequentially taking ~10min. With 20-way
parallelism it should complete in under 30s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sends shell/PowerShell commands to instances via SSM RunCommand,
waits for completion, and streams output. Auto-detects platform.
Building block for VPC migration (replacing remote-exec provisioners).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All file provisioners now upload to S3 and pull via SSM.
All remote-exec provisioners now use aotutil ssm run-command.
No more direct SSH/WinRM connections from the runner to instances.

Converted resources:
- setup_mocked_server_cert_for_windows (file+remote-exec -> S3+SSM)
- setup_mocked_server_cert_for_linux (file+remote-exec -> S3+SSM)
- download_collector_from_local (file -> S3+SSM)
- download_collector_from_s3 (remote-exec -> SSM)
- collector_file_configuration (file -> S3+SSM)
- start_collector (remote-exec -> SSM)
- install_collector_from_ssm (remote-exec wait -> SSM)
- setup_sample_app_and_mock_server (file+remote-exec -> S3+SSM)
- install_cwagent (file+remote-exec -> S3+SSM)
- ssm_validation (remote-exec -> SSM)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Since all provisioning now uses SSM, the per-run SG only needs HTTP
ports (80, 8080) for the validator to reach the sidecar. Removed
login_user, connection_type locals and get_password_data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The per-run SG was hitting the account's SG quota (3000, mostly
consumed by 2383 orphaned k8s-elb-* SGs). Since provisioning now
uses SSM (no SSH/WinRM), and HTTP ports 80/8080 are already open
on the shared SG for ALBs, the per-run SG is unnecessary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… region fixes)

The SSM approach hit two issues in CI:
1. aotutil binary not found (relative path doesn't exist on runner)
2. S3 bucket aws-otel-collector-test is in us-east-1, not us-west-2

Reverting to per-run SG + remote-exec which now works after cleaning
2000+ orphaned k8s-elb security groups.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sioners

The aws-otel-collector-test bucket is in us-east-1 but the default
provider is us-west-2. Added provider alias 'aws.s3' pointing to
us-east-1 for all aws_s3_object resources.

Also re-enables the SSM provisioner migration (reverts the revert).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TravisStark and others added 4 commits May 14, 2026 14:36
- Remove public IPs from both AOC and sidecar instances
- Remove WinRM config from Windows user_data (no longer needed)
- Move validator to run on the sidecar via SSM (localhost endpoints)
- Upload validator source to S3, build on sidecar, run via docker compose
- Switch sample_app_listen_address_host to private_ip

Zero public ingress surface. All instances in private subnets with
NAT gateway for outbound only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three fixes for SSM provisioner failures:

1. PATH: Prepend /usr/local/bin to PATH for Linux SSM commands
   (fixes 'aws: command not found' on Ubuntu/RedHat)

2. --stdin flag: Read commands from stdin (one per line) to avoid
   shell quoting issues when commands contain $(), &, or nested quotes

3. start_collector + ssm_validation: Use heredoc + --stdin to pass
   commands without local shell interpretation (fixes presigned URL
   & escaping and Windows PowerShell syntax errors)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove public IPs from both AOC and sidecar instances
- Remove WinRM config from Windows user_data
- Build validator image on runner (fast), upload to S3 as tarball
- Sidecar pulls and loads pre-built image (docker load, no build needed)
- Validator runs with network_mode:host using localhost endpoints

Zero public ingress surface. Runner build (~60s) replaces sidecar
build (~10min).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Public subnets + no public IP = no internet. Instead, place instances
in private subnets which route through the existing NAT gateway.

- basic_components: random_subnet uses private subnets now
- EC2: associate_public_ip_address = false
- ECS: EFS EC2 + launch template = false
- ALBs still use public subnets (correct for internet-facing LBs)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@TravisStark TravisStark force-pushed the fix/vpc-tf-version branch from 729b344 to a43fa84 Compare May 14, 2026 23:52
jefchien and others added 10 commits May 15, 2026 06:41
Re-enable public IPs and SSH from anywhere to match Jeffrey's working
run on test/fix-ami. The private subnet + SSM-only approach needs
more work (runners keep dying). For now, use the pattern that works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stock AL2 AMI had Docker daemon instability during validator builds.
CWAgent AMIs are pre-configured with SSM agent and stable Docker.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. SUSE 15: wait for RPM lock release before installing collector
2. ssm-install-aoc.sh: exponential backoff with jitter on API calls
3. NodeDiskIO.json: allow additional properties from containerd 2.x
4. Windows: remove -Verbose from Import-Certificate (eliminates 40K
   lines of cert PEM dump in logs, saves 3-5 min per test)
5. Windows: use Start-Process msiexec -Wait with exit code check
   (msiexec returns 0 even on failure in certain SSM contexts)
…r-compose, quiet+readable logs

1. SUSE 15: wait for RPM lock release before installing collector
2. ssm-install-aoc.sh: exponential backoff with jitter on API calls
3. NodeDiskIO.json: allow additional properties from containerd 2.x
4. Windows: remove -Verbose from Import-Certificate
5. Windows: use Start-Process msiexec -Wait with exit code check
6. Docker-compose: download from S3 instead of GitHub
7. Mark cert output as sensitive (suppresses plan noise)
8. Quiet terraform: init silenced, destroy silenced, -compact-warnings
9. Clear test progress banners with test number and pass/fail status
1. Try S3 first, fall back to GitHub if not in S3 yet
2. Replace 'aws ecr get-login' (v1) with 'aws ecr get-login-password' (v2)
docker load puts the image in with the tag, docker compose should
reference that tag not try to build from a nonexistent path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants