Skip to content

Conversation

@smodak-rh
Copy link
Contributor

This change fixes a fundamental accuracy issue in memory percentile computation
(p95 / p90 / median) caused by per-batch aggregation in shell.

Previously, percentiles were calculated independently for each batch of pods
and then merged by taking the maximum across batches. This approach is
mathematically incorrect and could significantly overestimate memory pressure,
especially when batch sizes or pod lifetimes varied.

Key changes:

  • Move percentile computation fully into PromQL using quantile_over_time()
  • Restrict batching to transport concerns only (pod regex size), not statistics
  • Preserve exact max memory calculation using max_over_time()
  • Switch percentile metric to container_memory_working_set_bytes for more
    representative memory pressure
  • Floor PromQL floating-point samples in jq to ensure bash-safe arithmetic
  • Add a debug guard when no pods are found for a task/cluster
  • Make the script compatible with older bash versions (no mapfile usage)

Results:

  • Statistically correct percentiles across all pods and time windows
  • Stable and reproducible results independent of batch size
  • No bash arithmetic errors from floating-point Prometheus samples
  • Cleaner separation between data collection and aggregation logic
  • Results BEFORE Patching:
$ ./wrapper_for_promql_for_all_clusters.sh 1 --csv
"cluster","task","step","max_mem_mb","pod","namespace","component","application","p95_mb","p90_mb","median_mb"
"stone-prd-rh01","buildah","step-build","8192","maestro-on-pull-request-xj7dc-build-container-pod","maestro-rhtap-tenant","N/A","N/A","32","32","32"
"stone-prd-rh01","buildah","step-push","4096","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","N/A","N/A","5","5","5"
"stone-prd-rh01","buildah","step-sbom-syft-generate","4096","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","N/A","N/A","6","6","6"
"stone-prd-rh01","buildah","step-prepare-sboms","189","rhsm-api-proxy-on-pull-request-4tbbp-build-container-pod","teamnado-konflux-tenant","N/A","N/A","5","5","5"
"stone-prd-rh01","buildah","step-upload-sbom","46","crc-caddy-plugin-on-pull-request-ss22g-build-container-pod","hcm-eng-prod-tenant","N/A","N/A","5","5","5"
  • Results AFTER Patching:
$ ./wrapper_for_promql_for_all_clusters.sh 1 --csv
"cluster","task","step","max_mem_mb","pod","namespace","component","application","p95_mb","p90_mb","median_mb"
"stone-prd-rh01","buildah","step-build","8192","maestro-on-pull-request-xj7dc-build-container-pod","maestro-rhtap-tenant","N/A","N/A","8191","8190","8183"
"stone-prd-rh01","buildah","step-push","4096","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","N/A","N/A","471","445","59"
"stone-prd-rh01","buildah","step-sbom-syft-generate","4096","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","N/A","N/A","1264","726","49"
"stone-prd-rh01","buildah","step-prepare-sboms","189","rhsm-api-proxy-on-pull-request-4tbbp-build-container-pod","teamnado-konflux-tenant","N/A","N/A","154","154","33"
"stone-prd-rh01","buildah","step-upload-sbom","46","crc-caddy-plugin-on-pull-request-ss22g-build-container-pod","hcm-eng-prod-tenant","N/A","N/A","30","23","7"

While the MAX Mem usage data is correct in both the cases, but from the above data we can see that the accuracy of 95 percentile, 90th percentile & median data has improved a lot.

This commit adds comprehensive CPU usage metrics collection to complement
the existing memory metrics functionality. The implementation includes CPU
max, P95, P90, and median calculations with proper pod attribution.

Changes:
--------

1. CPU Metrics Collection:
   - Added CPU max, P95, P90, and median calculations using
     container_cpu_usage_seconds_total with rate() aggregation
   - CPU values are converted to millicores (m) format for readability
   - Separate pod attribution for max CPU usage (pod_max_cpu, pod_namespace_cpu)

2. Query Optimizations:
   - Implemented intelligent batching (50 pods per batch) to handle unlimited
     pod counts without hitting Prometheus URL/query limits
   - Added adaptive step sizing in query_prometheus_range.py:
     * ≤1 day: 30s step (fine-grained)
     * ≤7 days: 5m step (optimized for Prometheus limits)
     * ≤30 days: 15m step
     * >30 days: 1h step
   - This enables reliable querying for 7+ day time ranges

3. Task-Scoped Query Validation:
   - Added pod validation to ensure all metrics come only from pods
     belonging to the specified task (filtered by label_tekton_dev_task)
   - Prevents accidental inclusion of pods from other tasks with same step names
   - Uses bash 3.2-compatible pod membership checking

4. Output Format Updates:
   - Updated CSV header to include CPU columns:
     pod_max_cpu, pod_namespace_cpu, cpu_max, cpu_p95, cpu_p90, cpu_median
   - CPU values displayed in millicores format (e.g., "3569m")
   - Maintains backward compatibility with existing memory metrics

5. Code Improvements:
   - Fixed bash 3.2 compatibility issues (removed associative arrays)
   - Improved error handling for empty query results
   - Added debug logging for query execution and batch processing

Technical Details:
-----------------

- CPU queries use: rate(container_cpu_usage_seconds_total[5m]) with subqueries
- Percentiles calculated using: max(quantile_over_time(...)) across all pods
- Memory max uses: container_memory_max_usage_bytes (peak usage)
- Memory percentiles use: container_memory_working_set_bytes (working set)
- All queries filtered by task label to ensure accuracy

Testing:
-------

- Tested with 1 day range: ✓ Works correctly
- Tested with 7 day range: ✓ Works correctly with adaptive step sizing
- Tested with 1068+ pods: ✓ Batching handles large pod counts efficiently
- Verified task-scoped queries: ✓ Only includes pods from specified task

Example Output:
--------------

"stone-prd-rh01","buildah","step-build",
"maestro-on-pull-request-wtpkk-build-container-pod","maestro-rhtap-tenant",
"N/A","N/A","8192","8191","8190","8183",
"operator-on-pull-request-45m69-build-container-pod","vp-operator-release-tenant",
"3569m","3569m","3569m","3569m"

Files Modified:
--------------
- wrapper_for_promql.sh: Added CPU metrics collection and batching logic
- wrapper_for_promql_for_all_clusters.sh: Updated CSV header format
- query_prometheus_range.py: Added adaptive step sizing for long ranges
- README.md: Updated documentation with new features

Related Issue:
--------------
Addresses requirements for comprehensive resource usage monitoring in
Konflux clusters, enabling both memory and CPU analysis for task/step
combinations.
- Add CPU max, P95, P90, and median calculations with pod attribution
- Implement intelligent batching (50 pods/batch) for unlimited pod counts
- Add adaptive step sizing for 7+ day time ranges (5m for 7 days)
- Add task-scoped query validation to ensure accuracy
- Update CSV format to include CPU columns
- Fix bash 3.2 compatibility issues
@smodak-rh
Copy link
Contributor Author

@jhutar, after the CPU data collection metrics was added, ran for 1 day and 7 days data for a single cluster to ascertain that data is being received:

$ time ./wrapper_for_promql_for_all_clusters.sh 1 --csv
"cluster","task","step","pod_max_memory","pod_namespace_mem","component","application","mem_max_mb","mem_p95_mb","mem_p90_mb","mem_median_mb","pod_max_cpu","pod_namespace_cpu","cpu_max","cpu_p95","cpu_p90","cpu_median"

"stone-prd-rh01","buildah","step-build","maestro-on-pull-request-wtpkk-build-container-pod","maestro-rhtap-tenant","N/A","N/A","8192","8191","8190","8183","operator-on-pull-request-fwzfh-build-container-pod","vp-operator-release-tenant","3212m","3212m","3569m","3212m"

"stone-prd-rh01","buildah","step-push","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","N/A","N/A","4096","495","494","112","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","405m","337m","273m","208m"

"stone-prd-rh01","buildah","step-sbom-syft-generate","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","N/A","N/A","4096","1302","726","32","mintmaker-renovate-image-onc26469ef381800b70bc95f9761b2eb32-pod","konflux-mintmaker-tenant","971m","692m","449m","174m"

"stone-prd-rh01","buildah","step-prepare-sboms","notifications-connector-pag925b30637da6a0f9c98c8fa740948df5-pod","hcc-integrations-tenant","N/A","N/A","153","90","75","7","notifications-aggregator-on617c8cafd306f1717a77f9c4a44b83d1-pod","hcc-integrations-tenant","10m","9m","16m","8m"

"stone-prd-rh01","buildah","step-upload-sbom","notifications-c8680fe203b4ad4f623fb7c28374f6a8b5a1d368512e2-pod","hcc-integrations-tenant","N/A","N/A","45","30","25","6","mintmaker-osv-database-on-push-jk7mb-build-container-pod","konflux-mintmaker-tenant","5m","5m","5m","5m"
 
real	6m39.147s
user	2m48.881s
sys	0m47.104s

and

$ time ./wrapper_for_promql_for_all_clusters.sh 7 --csv
"cluster","task","step","pod_max_memory","pod_namespace_mem","component","application","mem_max_mb","mem_p95_mb","mem_p90_mb","mem_median_mb","pod_max_cpu","pod_namespace_cpu","cpu_max","cpu_p95","cpu_p90","cpu_median"

"stone-prd-rh01","buildah","step-build","maestro-on-pull-request-cs8f8-build-container-pod","maestro-rhtap-tenant","N/A","N/A","8192","8191","8190","8183","trustification-service-on-p9460116ade3444143420a76f9cb8a182-pod","trusted-content-tenant","8955m","8611m","7720m","5233m"

"stone-prd-rh01","buildah","step-push","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","N/A","N/A","4096","495","494","84","mintmaker-renovate-image-on54ff13b99bd61c85fed3a7b94a230a56-pod","konflux-mintmaker-tenant","429m","374m","303m","208m"

"stone-prd-rh01","buildah","step-sbom-syft-generate","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","N/A","N/A","4096","1302","1049","25","mintmaker-renovate-image-onc26469ef381800b70bc95f9761b2eb32-pod","konflux-mintmaker-tenant","971m","692m","449m","213m"

"stone-prd-rh01","buildah","step-prepare-sboms","rhsm-api-proxy-on-pull-request-4tbbp-build-container-pod","teamnado-konflux-tenant","N/A","N/A","189","154","154","33","rhsm-api-proxy-on-pull-request-2t5g4-build-container-pod","teamnado-konflux-tenant","228m","233m","221m","233m"

"stone-prd-rh01","buildah","step-upload-sbom","notifications-aggregator-on8ffcaf8341fb68132b38fcb7df9c7ed8-pod","hcc-integrations-tenant","N/A","N/A","47","30","24","11","mintmaker-osv-database-on-push-jk7mb-build-container-pod","konflux-mintmaker-tenant","5m","4m","5m","5m"

real	8m14.323s
user	2m43.201s
sys	0m58.631s

Fixes linting issues: shellcheck warnings, black formatting, flake8 config.
Added retry logic for transient interruptions. All linters passing.

Generated-by: Curson
@smodak-rh
Copy link
Contributor Author

After all the shellcheck, black and flake8 fixes re-run again and everything is fine. Also add the --trailer to acknowledge cursor AI's help.

$ time ./wrapper_for_promql_for_all_clusters.sh 1 --csv
"cluster","task","step","pod_max_memory","pod_namespace_mem","component","application","mem_max_mb","mem_p95_mb","mem_p90_mb","mem_median_mb","pod_max_cpu","pod_namespace_cpu","cpu_max","cpu_p95","cpu_p90","cpu_median"
"stone-prd-rh01","buildah","step-build","aus-cli-main-on-pull-request-j9bbl-build-container-pod","app-sre-tenant","N/A","N/A","8192","8191","8190","8180","ocm-cli-on-pull-request-g4lkv-build-container-pod","ocm-cli-clients-tenant","4195m","4195m","4195m","4195m"
"stone-prd-rh01","buildah","step-push","mintmaker-renovate-image-on8b34766315da7fe7901fcec0bf4012fc-pod","konflux-mintmaker-tenant","N/A","N/A","4096","495","494","138","mintmaker-renovate-image-on8b34766315da7fe7901fcec0bf4012fc-pod","konflux-mintmaker-tenant","396m","309m","260m","208m"
"stone-prd-rh01","buildah","step-sbom-syft-generate","mintmaker-renovate-image-on8b34766315da7fe7901fcec0bf4012fc-pod","konflux-mintmaker-tenant","N/A","N/A","4096","1118","731","71","mintmaker-renovate-image-on5ff950605f0d2fbad539308fe3fd651c-pod","konflux-mintmaker-tenant","953m","674m","585m","281m"
"stone-prd-rh01","buildah","step-prepare-sboms","aws-generated-data-main-on-05a4f3c09e0be4ac79fe40ef209fe3bd-pod","app-sre-tenant","N/A","N/A","146","90","75","21","notifications-aggregator-on47f441ff809ae644aeaa6c0823552be0-pod","hcc-integrations-tenant","7m","8m","8m","8m"
"stone-prd-rh01","buildah","step-upload-sbom","notifications-c8680fe203b4ad4f623fb7c28374f6a8b5a1d368512e2-pod","hcc-integrations-tenant","N/A","N/A","45","28","24","17","integration-tests-main-on-p81efcae1838217408e8f024650a908e7-pod","app-sre-tenant","5m","5m","4m","4m"

real	27m14.445s
user	3m18.803s
sys	1m4.333s

Cc: @jhutar

Adds component/application label extraction for both memory and CPU max pods.
Uses range queries to find deleted pods and correctly matches Prometheus
label keys (label_appstudio_openshift_io_component/application).

- Added component/application lookup for max CPU pod
- Fixed label key matching (openshift.io not redhat.com)
- Switched to range queries for deleted pods
- Fixed JSON parsing by separating debug output from JSON
- Updated CSV format with component_max_mem, application_max_mem, component_max_cpu, application_max_cpu

Tested: ✓ Extracts actual values (e.g., 'aus-cli-main', 'ocm-cli')
Generated-by: Cursor-AI
@smodak-rh
Copy link
Contributor Author

Latest test runs for all clusters after we are successfully able to generate application and component names consuming max memory and max cpu:

$ time ./wrapper_for_promql_for_all_clusters.sh 7 --csv
"cluster", "task", "step", "pod_max_mem", "namespace_max_mem", "component_max_mem", "application_max_mem", "mem_max_mb", "mem_p95_mb", "mem_p90_mb", "mem_median_mb", "pod_max_cpu", "namespace_max_cpu", "component_max_cpu", "application_max_cpu", "cpu_max", "cpu_p95", "cpu_p90", "cpu_median"
"kflux-prd-rh02", "buildah", "step-build", "crc-binary-on-pull-request-xtbth-build-container-pod", "crc-tenant", "crc-binary", "crc", "8192", "3344", "2844", "1044", "crc-binary-on-push-srnrq-build-container-pod", "crc-tenant", "crc-binary", "crc", "4672m", "4672m", "4672m", "4672m"
"kflux-prd-rh02", "buildah", "step-push", "crc-binary-on-push-b4fkx-build-container-pod", "crc-tenant", "crc-binary", "crc", "538", "79", "74", "23", "crc-binary-on-pull-request-hxsvc-build-container-pod", "crc-tenant", "crc-binary", "crc", "41m", "38m", "36m", "29m"
"kflux-prd-rh02", "buildah", "step-sbom-syft-generate", "git-init-on-push-dxkcn-build-container-pod", "tekton-ecosystem-tenant", "git-init", "tektoncd-git-clone", "1218", "306", "181", "4", "git-init-on-pull-request-stpsx-build-container-pod", "tekton-ecosystem-tenant", "git-init", "tektoncd-git-clone", "141m", "89m", "85m", "70m"
"kflux-prd-rh02", "buildah", "step-prepare-sboms", "rhobs-synthetics-api-main-oe669bc2d2a9f6185b716e15c45c73d30-pod", "rhobs-synthetics-tenant", "rhobs-synthetics-api-main", "rhobs-synthetics-api-main", "90", "61", "61", "4", "rhobs-synthetics-api-main-odcfe23c9221535dd781c5523269840e3-pod", "rhobs-synthetics-tenant", "rhobs-synthetics-api-main", "rhobs-synthetics-api-main", "172m", "83m", "83m", "172m"
"kflux-prd-rh02", "buildah", "step-upload-sbom", "rhobs-token-ref814d8f6c3d504c9abaecd3ba6d27c8fb6137ede8415c-pod", "rhobs-mco-tenant", "rhobs-token-refresher-main", "rhobs-token-refresher-main", "31", "25", "23", "4", "rhobs-token-refd2f9c53285e9206f0a2c003ddfc560e7112b8ff12ee4-pod", "rhobs-mco-tenant", "rhobs-token-refresher-main", "rhobs-token-refresher-main", "3m", "3m", "3m", "0m"
"kflux-prd-rh03", "buildah", "step-build", "rosa-log-router-processor-go-on-push-7lbbj-build-container-pod", "rosa-log-router-tenant", "rosa-log-router-processor-go", "rosa-log-router", "2263", "486", "434", "256", "rosa-log-routerb7de1582684acbf99107709874b8fb3373a259b8f316-pod", "rosa-log-router-tenant", "rosa-log-router-processor-go", "rosa-log-router", "119m", "119m", "141m", "119m"
"kflux-prd-rh03", "buildah", "step-push", "rosa-log-router-api-on-push-7m4q6-build-container-pod", "rosa-log-router-tenant", "rosa-log-router-api", "rosa-log-router", "291", "111", "105", "59", "rosa-log-router-api-on-push-7m4q6-build-container-pod", "rosa-log-router-tenant", "rosa-log-router-api", "rosa-log-router", "29m", "29m", "29m", "29m"
"kflux-prd-rh03", "buildah", "step-sbom-syft-generate", "rosa-log-router-api-on-push-ljp7m-build-container-pod", "rosa-log-router-tenant", "rosa-log-router-api", "rosa-log-router", "768", "250", "207", "12", "rosa-log-router-api-on-push-ljp7m-build-container-pod", "rosa-log-router-tenant", "rosa-log-router-api", "rosa-log-router", "39m", "39m", "39m", "0m"
"kflux-prd-rh03", "buildah", "step-prepare-sboms", "rosa-clusters-service-main-on-push-889lb-build-container-pod", "ocm-tenant", "rosa-clusters-service-main", "rosa-clusters-service-main", "10", "5", "5", "5", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"kflux-prd-rh03", "buildah", "step-upload-sbom", "app-on-push-mrv5z-build-image-2-pod", "rosa-log-router-tenant", "app", "konflux-test", "11", "5", "5", "5", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"kflux-rhel-p01", "buildah", "step-build", "rpm-repo-mappin7c36db7ab6cab31e64ae665990a9f1b96bf4f8f6758a-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-utils-container", "tooling", "909", "347", "347", "134", "", "N/A", "N/A", "N/A", "0m", "0m", "38m", "38m"
"kflux-rhel-p01", "buildah", "step-push", "rpm-repo-mappin621a7e0622d34579df9c44fdbe4e3e18c4a941ed50a0-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-utils-container", "tooling", "32", "25", "21", "4", "rpm-repo-mappin7c36db7ab6cab31e64ae665990a9f1b96bf4f8f6758a-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-utils-container", "tooling", "13m", "0m", "0m", "13m"
"kflux-rhel-p01", "buildah", "step-sbom-syft-generate", "rpm-repo-mappin621a7e0622d34579df9c44fdbe4e3e18c4a941ed50a0-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-utils-container", "tooling", "10", "4", "60", "4", "rpm-repo-mappina2b8f502f2584fa3f294c756793add4cd5635a0aef42-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-compare-with-plm-container", "tooling", "27m", "0m", "0m", "0m"
"kflux-rhel-p01", "buildah", "step-prepare-sboms", "rpm-repo-mappin621a7e0622d34579df9c44fdbe4e3e18c4a941ed50a0-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-utils-container", "tooling", "12", "4", "4", "4", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"kflux-rhel-p01", "buildah", "step-upload-sbom", "rpm-repo-mappinc5b4e29486b9f825221d0577ddb40671ff7999d7add9-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-compare-with-plm-container", "tooling", "10", "4", "4", "4", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"stone-prd-rh01", "buildah", "step-build", "aus-cli-main-on-pull-request-j9bbl-build-container-pod", "app-sre-tenant", "aus-cli-main", "aus-cli-main", "8192", "8191", "8190", "8180", "ocm-cli-on-pull-request-g4lkv-build-container-pod", "ocm-cli-clients-tenant", "ocm-cli", "ocm-cli", "4195m", "4195m", "4195m", "4195m"
"stone-prd-rh01", "buildah", "step-push", "mintmaker-renovate-image-onb8d55b89a25a8b55af0bd2d8c05e9224-pod", "konflux-mintmaker-tenant", "mintmaker-renovate-image", "mintmaker-renovate-image", "4096", "495", "494", "89", "mintmaker-renovate-image-onb8d55b89a25a8b55af0bd2d8c05e9224-pod", "konflux-mintmaker-tenant", "mintmaker-renovate-image", "mintmaker-renovate-image", "440m", "350m", "276m", "208m"
"stone-prd-rh01", "buildah", "step-sbom-syft-generate", "mintmaker-renovate-image-on-push-w8f7h-build-container-pod", "konflux-mintmaker-tenant", "mintmaker-renovate-image", "mintmaker-renovate-image", "4096", "1269", "716", "43", "mintmaker-renovate-image-onc26469ef381800b70bc95f9761b2eb32-pod", "konflux-mintmaker-tenant", "mintmaker-renovate-image", "mintmaker-renovate-image", "971m", "705m", "585m", "281m"
"stone-prd-rh01", "buildah", "step-prepare-sboms", "rhsm-api-proxy-on-pull-request-4tbbp-build-container-pod", "teamnado-konflux-tenant", "rhsm-api-proxy", "rhsm-api-proxy", "189", "154", "154", "33", "rhsm-api-proxy-on-pull-request-2t5g4-build-container-pod", "teamnado-konflux-tenant", "rhsm-api-proxy", "rhsm-api-proxy", "228m", "221m", "233m", "221m"
"stone-prd-rh01", "buildah", "step-upload-sbom", "notifications-c8680fe203b4ad4f623fb7c28374f6a8b5a1d368512e2-pod", "hcc-integrations-tenant", "notifications-connector-splunk", "notifications", "45", "30", "26", "17", "cluster-observa84ddd62f3ee5a93dbb032cc519c930a852f4967f3e97-pod", "cluster-observabilit-tenant", "cluster-observability-operator-bundle-1-3", "cluster-observability-operator-1-3", "5m", "5m", "5m", "5m"
"stone-prod-p02", "buildah", "step-build", "uhc-clusters-sea97dd9b79f64f9d791de081214fa616a3df0a5d0c5fb-pod", "ocm-tenant", "uhc-clusters-service-master", "uhc-clusters-service-master", "8193", "5252", "4942", "3431", "uhc-clusters-sed0fbccabe1f1f96904b2605ada059625dd25648eb671-pod", "ocm-tenant", "uhc-clusters-service-master", "uhc-clusters-service-master", "4159m", "4236m", "4236m", "4236m"
"stone-prod-p02", "buildah", "step-push", "ocmci-on-push-pcnx5-build-container-pod", "ocmci-tenant", "ocmci", "ocm-backend-tests", "2377", "130", "128", "48", "ocmci-on-pull-request-4ktsp-build-container-pod", "ocmci-tenant", "ocmci", "ocm-backend-tests", "147m", "138m", "130m", "107m"
"stone-prod-p02", "buildah", "step-sbom-syft-generate", "lifecycle-api-on-pull-request-5m9px-build-container-pod", "plcm-tenant", "lifecycle-api", "plcm", "2834", "741", "416", "174", "lifecycle-api-on-pull-request-5m9px-build-container-pod", "plcm-tenant", "lifecycle-api", "plcm", "116m", "116m", "151m", "151m"
"stone-prod-p02", "buildah", "step-prepare-sboms", "product-experience-apps-on-push-s88nz-build-container-pod", "cpla-tenant", "product-experience-apps", "product-experience-apps", "99", "65", "36", "7", "product-experience-apps-on-push-s88nz-build-container-pod", "cpla-tenant", "product-experience-apps", "product-experience-apps", "5m", "5m", "5m", "3m"
"stone-prod-p02", "buildah", "step-upload-sbom", "fbc-4-19-on-push-75dwm-build-container-pod", "rhdh-tenant", "fbc-4-19", "fbc-4-19", "42", "29", "28", "20", "fbc-4-20-on-push-jth4l-build-container-pod", "rhdh-tenant", "fbc-4-20", "fbc-4-20", "3m", "5m", "3m", "5m"
"stone-stg-rh01", "buildah", "step-build", "konflux-tests-on-push-shd2v-build-container-pod", "rh-ee-athorp-tenant", "konflux-tests", "test-first-application", "6", "0", "0", "4", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"stone-stg-rh01", "buildah", "step-push", "", "N/A", "N/A", "N/A", "0", "0", "0", "4", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"stone-stg-rh01", "buildah", "step-sbom-syft-generate", "", "N/A", "N/A", "N/A", "0", "0", "0", "0", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"stone-stg-rh01", "buildah", "step-prepare-sboms", "", "N/A", "N/A", "N/A", "0", "0", "0", "0", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"stone-stg-rh01", "buildah", "step-upload-sbom", "", "N/A", "N/A", "N/A", "0", "0", "0", "0", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"

real	18m6.323s
user	5m46.391s
sys	1m58.815s

Cc: @jhutar

Summary:
Add analyze_resource_limits.py script to analyze resource consumption data
and provide recommendations for Kubernetes resource limits based on P95
percentile analysis with configurable safety margins.

Description:
This commit introduces a new tool for analyzing resource consumption data
collected from Prometheus and generating recommendations for Kubernetes
resource limits. The tool addresses the need to optimize resource allocation
in Tekton tasks based on actual usage patterns.

Key Features:
- Analyzes CSV data from wrapper_for_promql_for_all_clusters.sh
- Calculates recommendations using P95 percentile + configurable margin (default 10%)
- Automatically rounds memory to standard Kubernetes values:
  * Values < 1Gi: Rounds to nearest power of 2 (32Mi, 64Mi, 128Mi, 256Mi, 512Mi)
  * Values >= 1Gi: Rounds to whole Gi values (1Gi, 2Gi, 3Gi, etc.)
- Always formats CPU values in millicores for consistency
- Can parse Tekton Task YAML files (local or from GitHub URLs)
- Automatically extracts task name and step names from YAML
- Optionally runs data collection automatically
- Can update YAML files with recommended resource limits

Usage Examples:
  # From piped input:
  ./wrapper_for_promql_for_all_clusters.sh 7 --csv | ./analyze_resource_limits.py

  # From YAML file (auto-runs data collection):
  ./analyze_resource_limits.py --file /path/to/buildah.yaml
  ./analyze_resource_limits.py --file https://github.com/.../buildah.yaml

  # Update YAML file with recommendations:
  ./analyze_resource_limits.py --file /path/to/buildah.yaml --update

The tool uses intelligent rounding to ensure resource limits are set to
standard Kubernetes values while ensuring we don't go below the recommended
values calculated from actual usage data.

Files Changed:
- analyze_resource_limits.py: New script for resource limit analysis
- README.md: Updated with documentation for the new tool
Generated-by: Cursor-AI
@jhutar jhutar merged commit 8f287f0 into redhat-appstudio:main Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants