Fix percentile accuracy by moving memory stats computation into PromQL and harden batching #18

smodak-rh · 2025-12-15T20:07:15Z

This change fixes a fundamental accuracy issue in memory percentile computation
(p95 / p90 / median) caused by per-batch aggregation in shell.

Previously, percentiles were calculated independently for each batch of pods
and then merged by taking the maximum across batches. This approach is
mathematically incorrect and could significantly overestimate memory pressure,
especially when batch sizes or pod lifetimes varied.

Key changes:

Move percentile computation fully into PromQL using quantile_over_time()
Restrict batching to transport concerns only (pod regex size), not statistics
Preserve exact max memory calculation using max_over_time()
Switch percentile metric to container_memory_working_set_bytes for more
representative memory pressure
Floor PromQL floating-point samples in jq to ensure bash-safe arithmetic
Add a debug guard when no pods are found for a task/cluster
Make the script compatible with older bash versions (no mapfile usage)

Results:

Statistically correct percentiles across all pods and time windows
Stable and reproducible results independent of batch size
No bash arithmetic errors from floating-point Prometheus samples
Cleaner separation between data collection and aggregation logic

Results BEFORE Patching:

$ ./wrapper_for_promql_for_all_clusters.sh 1 --csv
"cluster","task","step","max_mem_mb","pod","namespace","component","application","p95_mb","p90_mb","median_mb"
"stone-prd-rh01","buildah","step-build","8192","maestro-on-pull-request-xj7dc-build-container-pod","maestro-rhtap-tenant","N/A","N/A","32","32","32"
"stone-prd-rh01","buildah","step-push","4096","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","N/A","N/A","5","5","5"
"stone-prd-rh01","buildah","step-sbom-syft-generate","4096","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","N/A","N/A","6","6","6"
"stone-prd-rh01","buildah","step-prepare-sboms","189","rhsm-api-proxy-on-pull-request-4tbbp-build-container-pod","teamnado-konflux-tenant","N/A","N/A","5","5","5"
"stone-prd-rh01","buildah","step-upload-sbom","46","crc-caddy-plugin-on-pull-request-ss22g-build-container-pod","hcm-eng-prod-tenant","N/A","N/A","5","5","5"

Results AFTER Patching:

$ ./wrapper_for_promql_for_all_clusters.sh 1 --csv
"cluster","task","step","max_mem_mb","pod","namespace","component","application","p95_mb","p90_mb","median_mb"
"stone-prd-rh01","buildah","step-build","8192","maestro-on-pull-request-xj7dc-build-container-pod","maestro-rhtap-tenant","N/A","N/A","8191","8190","8183"
"stone-prd-rh01","buildah","step-push","4096","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","N/A","N/A","471","445","59"
"stone-prd-rh01","buildah","step-sbom-syft-generate","4096","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","N/A","N/A","1264","726","49"
"stone-prd-rh01","buildah","step-prepare-sboms","189","rhsm-api-proxy-on-pull-request-4tbbp-build-container-pod","teamnado-konflux-tenant","N/A","N/A","154","154","33"
"stone-prd-rh01","buildah","step-upload-sbom","46","crc-caddy-plugin-on-pull-request-ss22g-build-container-pod","hcm-eng-prod-tenant","N/A","N/A","30","23","7"

While the MAX Mem usage data is correct in both the cases, but from the above data we can see that the accuracy of 95 percentile, 90th percentile & median data has improved a lot.

…L and harden batching

… changes)

This commit adds comprehensive CPU usage metrics collection to complement the existing memory metrics functionality. The implementation includes CPU max, P95, P90, and median calculations with proper pod attribution. Changes: -------- 1. CPU Metrics Collection: - Added CPU max, P95, P90, and median calculations using container_cpu_usage_seconds_total with rate() aggregation - CPU values are converted to millicores (m) format for readability - Separate pod attribution for max CPU usage (pod_max_cpu, pod_namespace_cpu) 2. Query Optimizations: - Implemented intelligent batching (50 pods per batch) to handle unlimited pod counts without hitting Prometheus URL/query limits - Added adaptive step sizing in query_prometheus_range.py: * ≤1 day: 30s step (fine-grained) * ≤7 days: 5m step (optimized for Prometheus limits) * ≤30 days: 15m step * >30 days: 1h step - This enables reliable querying for 7+ day time ranges 3. Task-Scoped Query Validation: - Added pod validation to ensure all metrics come only from pods belonging to the specified task (filtered by label_tekton_dev_task) - Prevents accidental inclusion of pods from other tasks with same step names - Uses bash 3.2-compatible pod membership checking 4. Output Format Updates: - Updated CSV header to include CPU columns: pod_max_cpu, pod_namespace_cpu, cpu_max, cpu_p95, cpu_p90, cpu_median - CPU values displayed in millicores format (e.g., "3569m") - Maintains backward compatibility with existing memory metrics 5. Code Improvements: - Fixed bash 3.2 compatibility issues (removed associative arrays) - Improved error handling for empty query results - Added debug logging for query execution and batch processing Technical Details: ----------------- - CPU queries use: rate(container_cpu_usage_seconds_total[5m]) with subqueries - Percentiles calculated using: max(quantile_over_time(...)) across all pods - Memory max uses: container_memory_max_usage_bytes (peak usage) - Memory percentiles use: container_memory_working_set_bytes (working set) - All queries filtered by task label to ensure accuracy Testing: ------- - Tested with 1 day range: ✓ Works correctly - Tested with 7 day range: ✓ Works correctly with adaptive step sizing - Tested with 1068+ pods: ✓ Batching handles large pod counts efficiently - Verified task-scoped queries: ✓ Only includes pods from specified task Example Output: -------------- "stone-prd-rh01","buildah","step-build", "maestro-on-pull-request-wtpkk-build-container-pod","maestro-rhtap-tenant", "N/A","N/A","8192","8191","8190","8183", "operator-on-pull-request-45m69-build-container-pod","vp-operator-release-tenant", "3569m","3569m","3569m","3569m" Files Modified: -------------- - wrapper_for_promql.sh: Added CPU metrics collection and batching logic - wrapper_for_promql_for_all_clusters.sh: Updated CSV header format - query_prometheus_range.py: Added adaptive step sizing for long ranges - README.md: Updated documentation with new features Related Issue: -------------- Addresses requirements for comprehensive resource usage monitoring in Konflux clusters, enabling both memory and CPU analysis for task/step combinations.

- Add CPU max, P95, P90, and median calculations with pod attribution - Implement intelligent batching (50 pods/batch) for unlimited pod counts - Add adaptive step sizing for 7+ day time ranges (5m for 7 days) - Add task-scoped query validation to ensure accuracy - Update CSV format to include CPU columns - Fix bash 3.2 compatibility issues

smodak-rh · 2025-12-16T08:00:17Z

@jhutar, after the CPU data collection metrics was added, ran for 1 day and 7 days data for a single cluster to ascertain that data is being received:

$ time ./wrapper_for_promql_for_all_clusters.sh 1 --csv
"cluster","task","step","pod_max_memory","pod_namespace_mem","component","application","mem_max_mb","mem_p95_mb","mem_p90_mb","mem_median_mb","pod_max_cpu","pod_namespace_cpu","cpu_max","cpu_p95","cpu_p90","cpu_median"

"stone-prd-rh01","buildah","step-build","maestro-on-pull-request-wtpkk-build-container-pod","maestro-rhtap-tenant","N/A","N/A","8192","8191","8190","8183","operator-on-pull-request-fwzfh-build-container-pod","vp-operator-release-tenant","3212m","3212m","3569m","3212m"

"stone-prd-rh01","buildah","step-push","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","N/A","N/A","4096","495","494","112","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","405m","337m","273m","208m"

"stone-prd-rh01","buildah","step-sbom-syft-generate","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","N/A","N/A","4096","1302","726","32","mintmaker-renovate-image-onc26469ef381800b70bc95f9761b2eb32-pod","konflux-mintmaker-tenant","971m","692m","449m","174m"

"stone-prd-rh01","buildah","step-prepare-sboms","notifications-connector-pag925b30637da6a0f9c98c8fa740948df5-pod","hcc-integrations-tenant","N/A","N/A","153","90","75","7","notifications-aggregator-on617c8cafd306f1717a77f9c4a44b83d1-pod","hcc-integrations-tenant","10m","9m","16m","8m"

"stone-prd-rh01","buildah","step-upload-sbom","notifications-c8680fe203b4ad4f623fb7c28374f6a8b5a1d368512e2-pod","hcc-integrations-tenant","N/A","N/A","45","30","25","6","mintmaker-osv-database-on-push-jk7mb-build-container-pod","konflux-mintmaker-tenant","5m","5m","5m","5m"
 
real	6m39.147s
user	2m48.881s
sys	0m47.104s

and

$ time ./wrapper_for_promql_for_all_clusters.sh 7 --csv
"cluster","task","step","pod_max_memory","pod_namespace_mem","component","application","mem_max_mb","mem_p95_mb","mem_p90_mb","mem_median_mb","pod_max_cpu","pod_namespace_cpu","cpu_max","cpu_p95","cpu_p90","cpu_median"

"stone-prd-rh01","buildah","step-build","maestro-on-pull-request-cs8f8-build-container-pod","maestro-rhtap-tenant","N/A","N/A","8192","8191","8190","8183","trustification-service-on-p9460116ade3444143420a76f9cb8a182-pod","trusted-content-tenant","8955m","8611m","7720m","5233m"

"stone-prd-rh01","buildah","step-push","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","N/A","N/A","4096","495","494","84","mintmaker-renovate-image-on54ff13b99bd61c85fed3a7b94a230a56-pod","konflux-mintmaker-tenant","429m","374m","303m","208m"

"stone-prd-rh01","buildah","step-sbom-syft-generate","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","N/A","N/A","4096","1302","1049","25","mintmaker-renovate-image-onc26469ef381800b70bc95f9761b2eb32-pod","konflux-mintmaker-tenant","971m","692m","449m","213m"

"stone-prd-rh01","buildah","step-prepare-sboms","rhsm-api-proxy-on-pull-request-4tbbp-build-container-pod","teamnado-konflux-tenant","N/A","N/A","189","154","154","33","rhsm-api-proxy-on-pull-request-2t5g4-build-container-pod","teamnado-konflux-tenant","228m","233m","221m","233m"

"stone-prd-rh01","buildah","step-upload-sbom","notifications-aggregator-on8ffcaf8341fb68132b38fcb7df9c7ed8-pod","hcc-integrations-tenant","N/A","N/A","47","30","24","11","mintmaker-osv-database-on-push-jk7mb-build-container-pod","konflux-mintmaker-tenant","5m","4m","5m","5m"

real	8m14.323s
user	2m43.201s
sys	0m58.631s

Fixes linting issues: shellcheck warnings, black formatting, flake8 config. Added retry logic for transient interruptions. All linters passing. Generated-by: Curson

smodak-rh · 2025-12-16T18:31:35Z

After all the shellcheck, black and flake8 fixes re-run again and everything is fine. Also add the --trailer to acknowledge cursor AI's help.

$ time ./wrapper_for_promql_for_all_clusters.sh 1 --csv
"cluster","task","step","pod_max_memory","pod_namespace_mem","component","application","mem_max_mb","mem_p95_mb","mem_p90_mb","mem_median_mb","pod_max_cpu","pod_namespace_cpu","cpu_max","cpu_p95","cpu_p90","cpu_median"
"stone-prd-rh01","buildah","step-build","aus-cli-main-on-pull-request-j9bbl-build-container-pod","app-sre-tenant","N/A","N/A","8192","8191","8190","8180","ocm-cli-on-pull-request-g4lkv-build-container-pod","ocm-cli-clients-tenant","4195m","4195m","4195m","4195m"
"stone-prd-rh01","buildah","step-push","mintmaker-renovate-image-on8b34766315da7fe7901fcec0bf4012fc-pod","konflux-mintmaker-tenant","N/A","N/A","4096","495","494","138","mintmaker-renovate-image-on8b34766315da7fe7901fcec0bf4012fc-pod","konflux-mintmaker-tenant","396m","309m","260m","208m"
"stone-prd-rh01","buildah","step-sbom-syft-generate","mintmaker-renovate-image-on8b34766315da7fe7901fcec0bf4012fc-pod","konflux-mintmaker-tenant","N/A","N/A","4096","1118","731","71","mintmaker-renovate-image-on5ff950605f0d2fbad539308fe3fd651c-pod","konflux-mintmaker-tenant","953m","674m","585m","281m"
"stone-prd-rh01","buildah","step-prepare-sboms","aws-generated-data-main-on-05a4f3c09e0be4ac79fe40ef209fe3bd-pod","app-sre-tenant","N/A","N/A","146","90","75","21","notifications-aggregator-on47f441ff809ae644aeaa6c0823552be0-pod","hcc-integrations-tenant","7m","8m","8m","8m"
"stone-prd-rh01","buildah","step-upload-sbom","notifications-c8680fe203b4ad4f623fb7c28374f6a8b5a1d368512e2-pod","hcc-integrations-tenant","N/A","N/A","45","28","24","17","integration-tests-main-on-p81efcae1838217408e8f024650a908e7-pod","app-sre-tenant","5m","5m","4m","4m"

real	27m14.445s
user	3m18.803s
sys	1m4.333s

Cc: @jhutar

Adds component/application label extraction for both memory and CPU max pods. Uses range queries to find deleted pods and correctly matches Prometheus label keys (label_appstudio_openshift_io_component/application). - Added component/application lookup for max CPU pod - Fixed label key matching (openshift.io not redhat.com) - Switched to range queries for deleted pods - Fixed JSON parsing by separating debug output from JSON - Updated CSV format with component_max_mem, application_max_mem, component_max_cpu, application_max_cpu Tested: ✓ Extracts actual values (e.g., 'aus-cli-main', 'ocm-cli') Generated-by: Cursor-AI

smodak-rh · 2025-12-17T02:38:44Z

Latest test runs for all clusters after we are successfully able to generate application and component names consuming max memory and max cpu:

$ time ./wrapper_for_promql_for_all_clusters.sh 7 --csv
"cluster", "task", "step", "pod_max_mem", "namespace_max_mem", "component_max_mem", "application_max_mem", "mem_max_mb", "mem_p95_mb", "mem_p90_mb", "mem_median_mb", "pod_max_cpu", "namespace_max_cpu", "component_max_cpu", "application_max_cpu", "cpu_max", "cpu_p95", "cpu_p90", "cpu_median"
"kflux-prd-rh02", "buildah", "step-build", "crc-binary-on-pull-request-xtbth-build-container-pod", "crc-tenant", "crc-binary", "crc", "8192", "3344", "2844", "1044", "crc-binary-on-push-srnrq-build-container-pod", "crc-tenant", "crc-binary", "crc", "4672m", "4672m", "4672m", "4672m"
"kflux-prd-rh02", "buildah", "step-push", "crc-binary-on-push-b4fkx-build-container-pod", "crc-tenant", "crc-binary", "crc", "538", "79", "74", "23", "crc-binary-on-pull-request-hxsvc-build-container-pod", "crc-tenant", "crc-binary", "crc", "41m", "38m", "36m", "29m"
"kflux-prd-rh02", "buildah", "step-sbom-syft-generate", "git-init-on-push-dxkcn-build-container-pod", "tekton-ecosystem-tenant", "git-init", "tektoncd-git-clone", "1218", "306", "181", "4", "git-init-on-pull-request-stpsx-build-container-pod", "tekton-ecosystem-tenant", "git-init", "tektoncd-git-clone", "141m", "89m", "85m", "70m"
"kflux-prd-rh02", "buildah", "step-prepare-sboms", "rhobs-synthetics-api-main-oe669bc2d2a9f6185b716e15c45c73d30-pod", "rhobs-synthetics-tenant", "rhobs-synthetics-api-main", "rhobs-synthetics-api-main", "90", "61", "61", "4", "rhobs-synthetics-api-main-odcfe23c9221535dd781c5523269840e3-pod", "rhobs-synthetics-tenant", "rhobs-synthetics-api-main", "rhobs-synthetics-api-main", "172m", "83m", "83m", "172m"
"kflux-prd-rh02", "buildah", "step-upload-sbom", "rhobs-token-ref814d8f6c3d504c9abaecd3ba6d27c8fb6137ede8415c-pod", "rhobs-mco-tenant", "rhobs-token-refresher-main", "rhobs-token-refresher-main", "31", "25", "23", "4", "rhobs-token-refd2f9c53285e9206f0a2c003ddfc560e7112b8ff12ee4-pod", "rhobs-mco-tenant", "rhobs-token-refresher-main", "rhobs-token-refresher-main", "3m", "3m", "3m", "0m"
"kflux-prd-rh03", "buildah", "step-build", "rosa-log-router-processor-go-on-push-7lbbj-build-container-pod", "rosa-log-router-tenant", "rosa-log-router-processor-go", "rosa-log-router", "2263", "486", "434", "256", "rosa-log-routerb7de1582684acbf99107709874b8fb3373a259b8f316-pod", "rosa-log-router-tenant", "rosa-log-router-processor-go", "rosa-log-router", "119m", "119m", "141m", "119m"
"kflux-prd-rh03", "buildah", "step-push", "rosa-log-router-api-on-push-7m4q6-build-container-pod", "rosa-log-router-tenant", "rosa-log-router-api", "rosa-log-router", "291", "111", "105", "59", "rosa-log-router-api-on-push-7m4q6-build-container-pod", "rosa-log-router-tenant", "rosa-log-router-api", "rosa-log-router", "29m", "29m", "29m", "29m"
"kflux-prd-rh03", "buildah", "step-sbom-syft-generate", "rosa-log-router-api-on-push-ljp7m-build-container-pod", "rosa-log-router-tenant", "rosa-log-router-api", "rosa-log-router", "768", "250", "207", "12", "rosa-log-router-api-on-push-ljp7m-build-container-pod", "rosa-log-router-tenant", "rosa-log-router-api", "rosa-log-router", "39m", "39m", "39m", "0m"
"kflux-prd-rh03", "buildah", "step-prepare-sboms", "rosa-clusters-service-main-on-push-889lb-build-container-pod", "ocm-tenant", "rosa-clusters-service-main", "rosa-clusters-service-main", "10", "5", "5", "5", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"kflux-prd-rh03", "buildah", "step-upload-sbom", "app-on-push-mrv5z-build-image-2-pod", "rosa-log-router-tenant", "app", "konflux-test", "11", "5", "5", "5", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"kflux-rhel-p01", "buildah", "step-build", "rpm-repo-mappin7c36db7ab6cab31e64ae665990a9f1b96bf4f8f6758a-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-utils-container", "tooling", "909", "347", "347", "134", "", "N/A", "N/A", "N/A", "0m", "0m", "38m", "38m"
"kflux-rhel-p01", "buildah", "step-push", "rpm-repo-mappin621a7e0622d34579df9c44fdbe4e3e18c4a941ed50a0-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-utils-container", "tooling", "32", "25", "21", "4", "rpm-repo-mappin7c36db7ab6cab31e64ae665990a9f1b96bf4f8f6758a-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-utils-container", "tooling", "13m", "0m", "0m", "13m"
"kflux-rhel-p01", "buildah", "step-sbom-syft-generate", "rpm-repo-mappin621a7e0622d34579df9c44fdbe4e3e18c4a941ed50a0-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-utils-container", "tooling", "10", "4", "60", "4", "rpm-repo-mappina2b8f502f2584fa3f294c756793add4cd5635a0aef42-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-compare-with-plm-container", "tooling", "27m", "0m", "0m", "0m"
"kflux-rhel-p01", "buildah", "step-prepare-sboms", "rpm-repo-mappin621a7e0622d34579df9c44fdbe4e3e18c4a941ed50a0-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-utils-container", "tooling", "12", "4", "4", "4", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"kflux-rhel-p01", "buildah", "step-upload-sbom", "rpm-repo-mappinc5b4e29486b9f825221d0577ddb40671ff7999d7add9-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-compare-with-plm-container", "tooling", "10", "4", "4", "4", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"stone-prd-rh01", "buildah", "step-build", "aus-cli-main-on-pull-request-j9bbl-build-container-pod", "app-sre-tenant", "aus-cli-main", "aus-cli-main", "8192", "8191", "8190", "8180", "ocm-cli-on-pull-request-g4lkv-build-container-pod", "ocm-cli-clients-tenant", "ocm-cli", "ocm-cli", "4195m", "4195m", "4195m", "4195m"
"stone-prd-rh01", "buildah", "step-push", "mintmaker-renovate-image-onb8d55b89a25a8b55af0bd2d8c05e9224-pod", "konflux-mintmaker-tenant", "mintmaker-renovate-image", "mintmaker-renovate-image", "4096", "495", "494", "89", "mintmaker-renovate-image-onb8d55b89a25a8b55af0bd2d8c05e9224-pod", "konflux-mintmaker-tenant", "mintmaker-renovate-image", "mintmaker-renovate-image", "440m", "350m", "276m", "208m"
"stone-prd-rh01", "buildah", "step-sbom-syft-generate", "mintmaker-renovate-image-on-push-w8f7h-build-container-pod", "konflux-mintmaker-tenant", "mintmaker-renovate-image", "mintmaker-renovate-image", "4096", "1269", "716", "43", "mintmaker-renovate-image-onc26469ef381800b70bc95f9761b2eb32-pod", "konflux-mintmaker-tenant", "mintmaker-renovate-image", "mintmaker-renovate-image", "971m", "705m", "585m", "281m"
"stone-prd-rh01", "buildah", "step-prepare-sboms", "rhsm-api-proxy-on-pull-request-4tbbp-build-container-pod", "teamnado-konflux-tenant", "rhsm-api-proxy", "rhsm-api-proxy", "189", "154", "154", "33", "rhsm-api-proxy-on-pull-request-2t5g4-build-container-pod", "teamnado-konflux-tenant", "rhsm-api-proxy", "rhsm-api-proxy", "228m", "221m", "233m", "221m"
"stone-prd-rh01", "buildah", "step-upload-sbom", "notifications-c8680fe203b4ad4f623fb7c28374f6a8b5a1d368512e2-pod", "hcc-integrations-tenant", "notifications-connector-splunk", "notifications", "45", "30", "26", "17", "cluster-observa84ddd62f3ee5a93dbb032cc519c930a852f4967f3e97-pod", "cluster-observabilit-tenant", "cluster-observability-operator-bundle-1-3", "cluster-observability-operator-1-3", "5m", "5m", "5m", "5m"
"stone-prod-p02", "buildah", "step-build", "uhc-clusters-sea97dd9b79f64f9d791de081214fa616a3df0a5d0c5fb-pod", "ocm-tenant", "uhc-clusters-service-master", "uhc-clusters-service-master", "8193", "5252", "4942", "3431", "uhc-clusters-sed0fbccabe1f1f96904b2605ada059625dd25648eb671-pod", "ocm-tenant", "uhc-clusters-service-master", "uhc-clusters-service-master", "4159m", "4236m", "4236m", "4236m"
"stone-prod-p02", "buildah", "step-push", "ocmci-on-push-pcnx5-build-container-pod", "ocmci-tenant", "ocmci", "ocm-backend-tests", "2377", "130", "128", "48", "ocmci-on-pull-request-4ktsp-build-container-pod", "ocmci-tenant", "ocmci", "ocm-backend-tests", "147m", "138m", "130m", "107m"
"stone-prod-p02", "buildah", "step-sbom-syft-generate", "lifecycle-api-on-pull-request-5m9px-build-container-pod", "plcm-tenant", "lifecycle-api", "plcm", "2834", "741", "416", "174", "lifecycle-api-on-pull-request-5m9px-build-container-pod", "plcm-tenant", "lifecycle-api", "plcm", "116m", "116m", "151m", "151m"
"stone-prod-p02", "buildah", "step-prepare-sboms", "product-experience-apps-on-push-s88nz-build-container-pod", "cpla-tenant", "product-experience-apps", "product-experience-apps", "99", "65", "36", "7", "product-experience-apps-on-push-s88nz-build-container-pod", "cpla-tenant", "product-experience-apps", "product-experience-apps", "5m", "5m", "5m", "3m"
"stone-prod-p02", "buildah", "step-upload-sbom", "fbc-4-19-on-push-75dwm-build-container-pod", "rhdh-tenant", "fbc-4-19", "fbc-4-19", "42", "29", "28", "20", "fbc-4-20-on-push-jth4l-build-container-pod", "rhdh-tenant", "fbc-4-20", "fbc-4-20", "3m", "5m", "3m", "5m"
"stone-stg-rh01", "buildah", "step-build", "konflux-tests-on-push-shd2v-build-container-pod", "rh-ee-athorp-tenant", "konflux-tests", "test-first-application", "6", "0", "0", "4", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"stone-stg-rh01", "buildah", "step-push", "", "N/A", "N/A", "N/A", "0", "0", "0", "4", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"stone-stg-rh01", "buildah", "step-sbom-syft-generate", "", "N/A", "N/A", "N/A", "0", "0", "0", "0", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"stone-stg-rh01", "buildah", "step-prepare-sboms", "", "N/A", "N/A", "N/A", "0", "0", "0", "0", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"stone-stg-rh01", "buildah", "step-upload-sbom", "", "N/A", "N/A", "N/A", "0", "0", "0", "0", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"

real	18m6.323s
user	5m46.391s
sys	1m58.815s

Cc: @jhutar

Summary: Add analyze_resource_limits.py script to analyze resource consumption data and provide recommendations for Kubernetes resource limits based on P95 percentile analysis with configurable safety margins. Description: This commit introduces a new tool for analyzing resource consumption data collected from Prometheus and generating recommendations for Kubernetes resource limits. The tool addresses the need to optimize resource allocation in Tekton tasks based on actual usage patterns. Key Features: - Analyzes CSV data from wrapper_for_promql_for_all_clusters.sh - Calculates recommendations using P95 percentile + configurable margin (default 10%) - Automatically rounds memory to standard Kubernetes values: * Values < 1Gi: Rounds to nearest power of 2 (32Mi, 64Mi, 128Mi, 256Mi, 512Mi) * Values >= 1Gi: Rounds to whole Gi values (1Gi, 2Gi, 3Gi, etc.) - Always formats CPU values in millicores for consistency - Can parse Tekton Task YAML files (local or from GitHub URLs) - Automatically extracts task name and step names from YAML - Optionally runs data collection automatically - Can update YAML files with recommended resource limits Usage Examples: # From piped input: ./wrapper_for_promql_for_all_clusters.sh 7 --csv | ./analyze_resource_limits.py # From YAML file (auto-runs data collection): ./analyze_resource_limits.py --file /path/to/buildah.yaml ./analyze_resource_limits.py --file https://github.com/.../buildah.yaml # Update YAML file with recommendations: ./analyze_resource_limits.py --file /path/to/buildah.yaml --update The tool uses intelligent rounding to ensure resource limits are set to standard Kubernetes values while ensuring we don't go below the recommended values calculated from actual usage data. Files Changed: - analyze_resource_limits.py: New script for resource limit analysis - README.md: Updated with documentation for the new tool Generated-by: Cursor-AI

smodak-rh added 4 commits December 15, 2025 15:03

Fix percentile accuracy by moving memory stats computation into PromQ…

81c0d7a

…L and harden batching

Harden shell and Python scripts with lint-safe cleanup (no functional…

4ea26e7

… changes)

Fix code quality: shellcheck, black, and flake8 compliance

cadc6c8

Fixes linting issues: shellcheck warnings, black formatting, flake8 config. Added retry logic for transient interruptions. All linters passing. Generated-by: Curson

jhutar merged commit 8f287f0 into redhat-appstudio:main Dec 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix percentile accuracy by moving memory stats computation into PromQL and harden batching #18

Fix percentile accuracy by moving memory stats computation into PromQL and harden batching #18

Uh oh!

smodak-rh commented Dec 15, 2025

Uh oh!

smodak-rh commented Dec 16, 2025

Uh oh!

smodak-rh commented Dec 16, 2025

Uh oh!

smodak-rh commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix percentile accuracy by moving memory stats computation into PromQL and harden batching #18

Fix percentile accuracy by moving memory stats computation into PromQL and harden batching #18

Uh oh!

Conversation

smodak-rh commented Dec 15, 2025

Uh oh!

smodak-rh commented Dec 16, 2025

Uh oh!

smodak-rh commented Dec 16, 2025

Uh oh!

smodak-rh commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants