fix: CI test run stability improvements#1832
Open
TravisStark wants to merge 41 commits into
Open
Conversation
6b9492d to
b8eda22
Compare
- Switch kubernetes/kubectl/helm providers from static token to exec-based auth (aws eks get-token). Static tokens expire after 15m, causing Unauthorized errors on long-running applies. - Switch LoadBalancer services to NLB (1-2 min provision vs 15-20 min for Classic ELBs under concurrency pressure). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace docker-compose v1.27.4 with built-in docker compose v2 - Add --quiet pull to suppress layer-by-layer noise in logs - Add ::group:: annotations for collapsible log sections per testcase - Add timestamped phase markers (init, apply, destroy) - Write step summary table with pass/fail and duration per testcase Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The AL2 Docker package from amazon-linux-extras doesn't include the compose plugin. Install it as a CLI plugin from GitHub releases. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
alexperez52
approved these changes
May 13, 2026
- Batch generator now groups tests by platform (EC2/0, EKS/1, etc.) so batches never mix platforms. Easier to identify platform-specific failures in the sidebar. - RetryHelper now logs the exception message on each retry attempt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ment (all) ALBs and EFS mount targets reject multiple subnets in the same AZ. Keep original 3-subnet list for those consumers, use expanded 4-subnet list only for random instance placement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The validator runs in a Docker container without the AWS CLI, so exec-based auth in the kubeconfig doesn't work. Generate a fresh token at apply time via data.external and embed it. The validator only runs for ~2 min so the 15-min token TTL is sufficient. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Under high concurrency (110 batch jobs), X-Ray ingestion can take longer than the 100s window (5 retries × 20s). Doubling to 10 retries gives 200s which should cover Kafka pipeline latency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sidebar now shows EC2/amazonlinux2/0, EC2/windows2022/0, EKS/us-west-2|cluster-name/0, ECS/EC2/0, etc. Makes it trivial to identify AMI-specific or cluster-specific failures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove yum update -y (causes RPMTransaction fatal errors under load) - Add 3-retry loop on amazon-linux-extras install docker - Use systemctl start docker (more reliable than service docker start) - Reduce sleep 30 to sleep 10 (docker starts in <5s) - Bump X-Ray trace validation retries from 10 to 15 (300s window) for Kafka pipeline tests under high concurrency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Keeps system packages current while preventing RPM transaction failures from aborting the provisioner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The spark sample app emits totalApiBytesSent reliably but apiBytesSent only appears after a full EMF aggregation window. Under high concurrency this window often doesn't complete before validation retries exhaust. Replaced with totalApiBytesSent which is always present. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…etries - Remove /var/run/docker.sock and /var/lib/docker mounts from the container insights daemonset. These don't exist on AL2023/containerd 2.1 (EKS 1.33+) and cause pod scheduling warnings. - Bump default MAX_RETRIES from 10 to 15 (300s window) to handle ECS task startup latency under high concurrency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- MAX_RETRIES 15 -> 20 (400s default window for mocked server/CW validation) - X-Ray trace inner retry 15 -> 20 (400s for Kafka pipeline ingestion) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rely on 20-retry bump (400s) to find it instead of removing validation coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Increase haproxy/nginx helm release timeout from 300s to 600s to fix containerinsight_eks_prometheus failures on busy clusters - Remove protocol_version from kafka test configs; MSK clusters report "3.9.x" which is unparseable by the collector's kafka client — let the client auto-negotiate instead - Strip region and collector-ci- prefix from batch job display names for readability (EKS/amd64-1-33 instead of EKS/us-west-2|collector-ci-amd64-1-33)
Docker 25.0.14-1.amzn2.0.5 breaks port forwarding on Amazon Linux 2, causing the validator to be unable to reach the sample app on port 8080 despite SSH (port 22) working to the same host. Pin to the last known working version.
Replace all 0.0.0.0/0 ingress rules in the shared security group with 10.0.0.0/16 (VPC-internal only). Add an ephemeral per-run security group in the EC2 module that discovers the GH Actions runner's public IP via checkip.amazonaws.com and opens only the required ports (22, 80, 8080, 5985) from that single IP for the duration of the test. This addresses Sev-2 ticket D452320721 where an external scanner exploited the open WinRM port. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ECS tests use public-facing ALBs that share this SG. HTTP ports are non-sensitive (no auth, ephemeral test data). Only SSH/WinRM/RDP remain restricted to VPC CIDR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
checkCacheHits was running sequentially taking ~10min. With 20-way parallelism it should complete in under 30s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sends shell/PowerShell commands to instances via SSM RunCommand, waits for completion, and streams output. Auto-detects platform. Building block for VPC migration (replacing remote-exec provisioners). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All file provisioners now upload to S3 and pull via SSM. All remote-exec provisioners now use aotutil ssm run-command. No more direct SSH/WinRM connections from the runner to instances. Converted resources: - setup_mocked_server_cert_for_windows (file+remote-exec -> S3+SSM) - setup_mocked_server_cert_for_linux (file+remote-exec -> S3+SSM) - download_collector_from_local (file -> S3+SSM) - download_collector_from_s3 (remote-exec -> SSM) - collector_file_configuration (file -> S3+SSM) - start_collector (remote-exec -> SSM) - install_collector_from_ssm (remote-exec wait -> SSM) - setup_sample_app_and_mock_server (file+remote-exec -> S3+SSM) - install_cwagent (file+remote-exec -> S3+SSM) - ssm_validation (remote-exec -> SSM) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Since all provisioning now uses SSM, the per-run SG only needs HTTP ports (80, 8080) for the validator to reach the sidecar. Removed login_user, connection_type locals and get_password_data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The per-run SG was hitting the account's SG quota (3000, mostly consumed by 2383 orphaned k8s-elb-* SGs). Since provisioning now uses SSM (no SSH/WinRM), and HTTP ports 80/8080 are already open on the shared SG for ALBs, the per-run SG is unnecessary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… region fixes) The SSM approach hit two issues in CI: 1. aotutil binary not found (relative path doesn't exist on runner) 2. S3 bucket aws-otel-collector-test is in us-east-1, not us-west-2 Reverting to per-run SG + remote-exec which now works after cleaning 2000+ orphaned k8s-elb security groups. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sioners The aws-otel-collector-test bucket is in us-east-1 but the default provider is us-west-2. Added provider alias 'aws.s3' pointing to us-east-1 for all aws_s3_object resources. Also re-enables the SSM provisioner migration (reverts the revert). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove public IPs from both AOC and sidecar instances - Remove WinRM config from Windows user_data (no longer needed) - Move validator to run on the sidecar via SSM (localhost endpoints) - Upload validator source to S3, build on sidecar, run via docker compose - Switch sample_app_listen_address_host to private_ip Zero public ingress surface. All instances in private subnets with NAT gateway for outbound only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three fixes for SSM provisioner failures: 1. PATH: Prepend /usr/local/bin to PATH for Linux SSM commands (fixes 'aws: command not found' on Ubuntu/RedHat) 2. --stdin flag: Read commands from stdin (one per line) to avoid shell quoting issues when commands contain $(), &, or nested quotes 3. start_collector + ssm_validation: Use heredoc + --stdin to pass commands without local shell interpretation (fixes presigned URL & escaping and Windows PowerShell syntax errors) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove public IPs from both AOC and sidecar instances - Remove WinRM config from Windows user_data - Build validator image on runner (fast), upload to S3 as tarball - Sidecar pulls and loads pre-built image (docker load, no build needed) - Validator runs with network_mode:host using localhost endpoints Zero public ingress surface. Runner build (~60s) replaces sidecar build (~10min). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Public subnets + no public IP = no internet. Instead, place instances in private subnets which route through the existing NAT gateway. - basic_components: random_subnet uses private subnets now - EC2: associate_public_ip_address = false - ECS: EFS EC2 + launch template = false - ALBs still use public subnets (correct for internet-facing LBs) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
729b344 to
a43fa84
Compare
Re-enable public IPs and SSH from anywhere to match Jeffrey's working run on test/fix-ami. The private subnet + SSM-only approach needs more work (runners keep dying). For now, use the pattern that works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stock AL2 AMI had Docker daemon instability during validator builds. CWAgent AMIs are pre-configured with SSM agent and stable Docker. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. SUSE 15: wait for RPM lock release before installing collector 2. ssm-install-aoc.sh: exponential backoff with jitter on API calls 3. NodeDiskIO.json: allow additional properties from containerd 2.x 4. Windows: remove -Verbose from Import-Certificate (eliminates 40K lines of cert PEM dump in logs, saves 3-5 min per test) 5. Windows: use Start-Process msiexec -Wait with exit code check (msiexec returns 0 even on failure in certain SSM contexts)
…r-compose, quiet+readable logs 1. SUSE 15: wait for RPM lock release before installing collector 2. ssm-install-aoc.sh: exponential backoff with jitter on API calls 3. NodeDiskIO.json: allow additional properties from containerd 2.x 4. Windows: remove -Verbose from Import-Certificate 5. Windows: use Start-Process msiexec -Wait with exit code check 6. Docker-compose: download from S3 instead of GitHub 7. Mark cert output as sensitive (suppresses plan noise) 8. Quiet terraform: init silenced, destroy silenced, -compact-warnings 9. Clear test progress banners with test number and pass/fail status
1. Try S3 first, fall back to GitHub if not in S3 yet 2. Replace 'aws ecr get-login' (v1) with 'aws ecr get-login-password' (v2)
docker load puts the image in with the tag, docker compose should reference that tag not try to build from a nonexistent path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Title:
fix: CI stability — prevent instance leaks, reduce
timeouts, add subnet capacity
Details
always destroy resources
state (root cause of 668-instance leak)
fix "Module not installed" failures
failure
accommodate Classic ELB tail latency
30m)
additional capacity