-
Notifications
You must be signed in to change notification settings - Fork 4
Fix percentile accuracy by moving memory stats computation into PromQL and harden batching #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…L and harden batching
This commit adds comprehensive CPU usage metrics collection to complement
the existing memory metrics functionality. The implementation includes CPU
max, P95, P90, and median calculations with proper pod attribution.
Changes:
--------
1. CPU Metrics Collection:
- Added CPU max, P95, P90, and median calculations using
container_cpu_usage_seconds_total with rate() aggregation
- CPU values are converted to millicores (m) format for readability
- Separate pod attribution for max CPU usage (pod_max_cpu, pod_namespace_cpu)
2. Query Optimizations:
- Implemented intelligent batching (50 pods per batch) to handle unlimited
pod counts without hitting Prometheus URL/query limits
- Added adaptive step sizing in query_prometheus_range.py:
* ≤1 day: 30s step (fine-grained)
* ≤7 days: 5m step (optimized for Prometheus limits)
* ≤30 days: 15m step
* >30 days: 1h step
- This enables reliable querying for 7+ day time ranges
3. Task-Scoped Query Validation:
- Added pod validation to ensure all metrics come only from pods
belonging to the specified task (filtered by label_tekton_dev_task)
- Prevents accidental inclusion of pods from other tasks with same step names
- Uses bash 3.2-compatible pod membership checking
4. Output Format Updates:
- Updated CSV header to include CPU columns:
pod_max_cpu, pod_namespace_cpu, cpu_max, cpu_p95, cpu_p90, cpu_median
- CPU values displayed in millicores format (e.g., "3569m")
- Maintains backward compatibility with existing memory metrics
5. Code Improvements:
- Fixed bash 3.2 compatibility issues (removed associative arrays)
- Improved error handling for empty query results
- Added debug logging for query execution and batch processing
Technical Details:
-----------------
- CPU queries use: rate(container_cpu_usage_seconds_total[5m]) with subqueries
- Percentiles calculated using: max(quantile_over_time(...)) across all pods
- Memory max uses: container_memory_max_usage_bytes (peak usage)
- Memory percentiles use: container_memory_working_set_bytes (working set)
- All queries filtered by task label to ensure accuracy
Testing:
-------
- Tested with 1 day range: ✓ Works correctly
- Tested with 7 day range: ✓ Works correctly with adaptive step sizing
- Tested with 1068+ pods: ✓ Batching handles large pod counts efficiently
- Verified task-scoped queries: ✓ Only includes pods from specified task
Example Output:
--------------
"stone-prd-rh01","buildah","step-build",
"maestro-on-pull-request-wtpkk-build-container-pod","maestro-rhtap-tenant",
"N/A","N/A","8192","8191","8190","8183",
"operator-on-pull-request-45m69-build-container-pod","vp-operator-release-tenant",
"3569m","3569m","3569m","3569m"
Files Modified:
--------------
- wrapper_for_promql.sh: Added CPU metrics collection and batching logic
- wrapper_for_promql_for_all_clusters.sh: Updated CSV header format
- query_prometheus_range.py: Added adaptive step sizing for long ranges
- README.md: Updated documentation with new features
Related Issue:
--------------
Addresses requirements for comprehensive resource usage monitoring in
Konflux clusters, enabling both memory and CPU analysis for task/step
combinations.
- Add CPU max, P95, P90, and median calculations with pod attribution - Implement intelligent batching (50 pods/batch) for unlimited pod counts - Add adaptive step sizing for 7+ day time ranges (5m for 7 days) - Add task-scoped query validation to ensure accuracy - Update CSV format to include CPU columns - Fix bash 3.2 compatibility issues
|
@jhutar, after the CPU data collection metrics was added, ran for 1 day and 7 days data for a single cluster to ascertain that data is being received: $ time ./wrapper_for_promql_for_all_clusters.sh 1 --csv
"cluster","task","step","pod_max_memory","pod_namespace_mem","component","application","mem_max_mb","mem_p95_mb","mem_p90_mb","mem_median_mb","pod_max_cpu","pod_namespace_cpu","cpu_max","cpu_p95","cpu_p90","cpu_median"
"stone-prd-rh01","buildah","step-build","maestro-on-pull-request-wtpkk-build-container-pod","maestro-rhtap-tenant","N/A","N/A","8192","8191","8190","8183","operator-on-pull-request-fwzfh-build-container-pod","vp-operator-release-tenant","3212m","3212m","3569m","3212m"
"stone-prd-rh01","buildah","step-push","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","N/A","N/A","4096","495","494","112","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","405m","337m","273m","208m"
"stone-prd-rh01","buildah","step-sbom-syft-generate","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","N/A","N/A","4096","1302","726","32","mintmaker-renovate-image-onc26469ef381800b70bc95f9761b2eb32-pod","konflux-mintmaker-tenant","971m","692m","449m","174m"
"stone-prd-rh01","buildah","step-prepare-sboms","notifications-connector-pag925b30637da6a0f9c98c8fa740948df5-pod","hcc-integrations-tenant","N/A","N/A","153","90","75","7","notifications-aggregator-on617c8cafd306f1717a77f9c4a44b83d1-pod","hcc-integrations-tenant","10m","9m","16m","8m"
"stone-prd-rh01","buildah","step-upload-sbom","notifications-c8680fe203b4ad4f623fb7c28374f6a8b5a1d368512e2-pod","hcc-integrations-tenant","N/A","N/A","45","30","25","6","mintmaker-osv-database-on-push-jk7mb-build-container-pod","konflux-mintmaker-tenant","5m","5m","5m","5m"
real 6m39.147s
user 2m48.881s
sys 0m47.104sand $ time ./wrapper_for_promql_for_all_clusters.sh 7 --csv
"cluster","task","step","pod_max_memory","pod_namespace_mem","component","application","mem_max_mb","mem_p95_mb","mem_p90_mb","mem_median_mb","pod_max_cpu","pod_namespace_cpu","cpu_max","cpu_p95","cpu_p90","cpu_median"
"stone-prd-rh01","buildah","step-build","maestro-on-pull-request-cs8f8-build-container-pod","maestro-rhtap-tenant","N/A","N/A","8192","8191","8190","8183","trustification-service-on-p9460116ade3444143420a76f9cb8a182-pod","trusted-content-tenant","8955m","8611m","7720m","5233m"
"stone-prd-rh01","buildah","step-push","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","N/A","N/A","4096","495","494","84","mintmaker-renovate-image-on54ff13b99bd61c85fed3a7b94a230a56-pod","konflux-mintmaker-tenant","429m","374m","303m","208m"
"stone-prd-rh01","buildah","step-sbom-syft-generate","mintmaker-renovate-image-on-push-w8f7h-build-container-pod","konflux-mintmaker-tenant","N/A","N/A","4096","1302","1049","25","mintmaker-renovate-image-onc26469ef381800b70bc95f9761b2eb32-pod","konflux-mintmaker-tenant","971m","692m","449m","213m"
"stone-prd-rh01","buildah","step-prepare-sboms","rhsm-api-proxy-on-pull-request-4tbbp-build-container-pod","teamnado-konflux-tenant","N/A","N/A","189","154","154","33","rhsm-api-proxy-on-pull-request-2t5g4-build-container-pod","teamnado-konflux-tenant","228m","233m","221m","233m"
"stone-prd-rh01","buildah","step-upload-sbom","notifications-aggregator-on8ffcaf8341fb68132b38fcb7df9c7ed8-pod","hcc-integrations-tenant","N/A","N/A","47","30","24","11","mintmaker-osv-database-on-push-jk7mb-build-container-pod","konflux-mintmaker-tenant","5m","4m","5m","5m"
real 8m14.323s
user 2m43.201s
sys 0m58.631s |
Fixes linting issues: shellcheck warnings, black formatting, flake8 config. Added retry logic for transient interruptions. All linters passing. Generated-by: Curson
|
After all the $ time ./wrapper_for_promql_for_all_clusters.sh 1 --csv
"cluster","task","step","pod_max_memory","pod_namespace_mem","component","application","mem_max_mb","mem_p95_mb","mem_p90_mb","mem_median_mb","pod_max_cpu","pod_namespace_cpu","cpu_max","cpu_p95","cpu_p90","cpu_median"
"stone-prd-rh01","buildah","step-build","aus-cli-main-on-pull-request-j9bbl-build-container-pod","app-sre-tenant","N/A","N/A","8192","8191","8190","8180","ocm-cli-on-pull-request-g4lkv-build-container-pod","ocm-cli-clients-tenant","4195m","4195m","4195m","4195m"
"stone-prd-rh01","buildah","step-push","mintmaker-renovate-image-on8b34766315da7fe7901fcec0bf4012fc-pod","konflux-mintmaker-tenant","N/A","N/A","4096","495","494","138","mintmaker-renovate-image-on8b34766315da7fe7901fcec0bf4012fc-pod","konflux-mintmaker-tenant","396m","309m","260m","208m"
"stone-prd-rh01","buildah","step-sbom-syft-generate","mintmaker-renovate-image-on8b34766315da7fe7901fcec0bf4012fc-pod","konflux-mintmaker-tenant","N/A","N/A","4096","1118","731","71","mintmaker-renovate-image-on5ff950605f0d2fbad539308fe3fd651c-pod","konflux-mintmaker-tenant","953m","674m","585m","281m"
"stone-prd-rh01","buildah","step-prepare-sboms","aws-generated-data-main-on-05a4f3c09e0be4ac79fe40ef209fe3bd-pod","app-sre-tenant","N/A","N/A","146","90","75","21","notifications-aggregator-on47f441ff809ae644aeaa6c0823552be0-pod","hcc-integrations-tenant","7m","8m","8m","8m"
"stone-prd-rh01","buildah","step-upload-sbom","notifications-c8680fe203b4ad4f623fb7c28374f6a8b5a1d368512e2-pod","hcc-integrations-tenant","N/A","N/A","45","28","24","17","integration-tests-main-on-p81efcae1838217408e8f024650a908e7-pod","app-sre-tenant","5m","5m","4m","4m"
real 27m14.445s
user 3m18.803s
sys 1m4.333sCc: @jhutar |
Adds component/application label extraction for both memory and CPU max pods. Uses range queries to find deleted pods and correctly matches Prometheus label keys (label_appstudio_openshift_io_component/application). - Added component/application lookup for max CPU pod - Fixed label key matching (openshift.io not redhat.com) - Switched to range queries for deleted pods - Fixed JSON parsing by separating debug output from JSON - Updated CSV format with component_max_mem, application_max_mem, component_max_cpu, application_max_cpu Tested: ✓ Extracts actual values (e.g., 'aus-cli-main', 'ocm-cli') Generated-by: Cursor-AI
|
Latest test runs for all clusters after we are successfully able to generate $ time ./wrapper_for_promql_for_all_clusters.sh 7 --csv
"cluster", "task", "step", "pod_max_mem", "namespace_max_mem", "component_max_mem", "application_max_mem", "mem_max_mb", "mem_p95_mb", "mem_p90_mb", "mem_median_mb", "pod_max_cpu", "namespace_max_cpu", "component_max_cpu", "application_max_cpu", "cpu_max", "cpu_p95", "cpu_p90", "cpu_median"
"kflux-prd-rh02", "buildah", "step-build", "crc-binary-on-pull-request-xtbth-build-container-pod", "crc-tenant", "crc-binary", "crc", "8192", "3344", "2844", "1044", "crc-binary-on-push-srnrq-build-container-pod", "crc-tenant", "crc-binary", "crc", "4672m", "4672m", "4672m", "4672m"
"kflux-prd-rh02", "buildah", "step-push", "crc-binary-on-push-b4fkx-build-container-pod", "crc-tenant", "crc-binary", "crc", "538", "79", "74", "23", "crc-binary-on-pull-request-hxsvc-build-container-pod", "crc-tenant", "crc-binary", "crc", "41m", "38m", "36m", "29m"
"kflux-prd-rh02", "buildah", "step-sbom-syft-generate", "git-init-on-push-dxkcn-build-container-pod", "tekton-ecosystem-tenant", "git-init", "tektoncd-git-clone", "1218", "306", "181", "4", "git-init-on-pull-request-stpsx-build-container-pod", "tekton-ecosystem-tenant", "git-init", "tektoncd-git-clone", "141m", "89m", "85m", "70m"
"kflux-prd-rh02", "buildah", "step-prepare-sboms", "rhobs-synthetics-api-main-oe669bc2d2a9f6185b716e15c45c73d30-pod", "rhobs-synthetics-tenant", "rhobs-synthetics-api-main", "rhobs-synthetics-api-main", "90", "61", "61", "4", "rhobs-synthetics-api-main-odcfe23c9221535dd781c5523269840e3-pod", "rhobs-synthetics-tenant", "rhobs-synthetics-api-main", "rhobs-synthetics-api-main", "172m", "83m", "83m", "172m"
"kflux-prd-rh02", "buildah", "step-upload-sbom", "rhobs-token-ref814d8f6c3d504c9abaecd3ba6d27c8fb6137ede8415c-pod", "rhobs-mco-tenant", "rhobs-token-refresher-main", "rhobs-token-refresher-main", "31", "25", "23", "4", "rhobs-token-refd2f9c53285e9206f0a2c003ddfc560e7112b8ff12ee4-pod", "rhobs-mco-tenant", "rhobs-token-refresher-main", "rhobs-token-refresher-main", "3m", "3m", "3m", "0m"
"kflux-prd-rh03", "buildah", "step-build", "rosa-log-router-processor-go-on-push-7lbbj-build-container-pod", "rosa-log-router-tenant", "rosa-log-router-processor-go", "rosa-log-router", "2263", "486", "434", "256", "rosa-log-routerb7de1582684acbf99107709874b8fb3373a259b8f316-pod", "rosa-log-router-tenant", "rosa-log-router-processor-go", "rosa-log-router", "119m", "119m", "141m", "119m"
"kflux-prd-rh03", "buildah", "step-push", "rosa-log-router-api-on-push-7m4q6-build-container-pod", "rosa-log-router-tenant", "rosa-log-router-api", "rosa-log-router", "291", "111", "105", "59", "rosa-log-router-api-on-push-7m4q6-build-container-pod", "rosa-log-router-tenant", "rosa-log-router-api", "rosa-log-router", "29m", "29m", "29m", "29m"
"kflux-prd-rh03", "buildah", "step-sbom-syft-generate", "rosa-log-router-api-on-push-ljp7m-build-container-pod", "rosa-log-router-tenant", "rosa-log-router-api", "rosa-log-router", "768", "250", "207", "12", "rosa-log-router-api-on-push-ljp7m-build-container-pod", "rosa-log-router-tenant", "rosa-log-router-api", "rosa-log-router", "39m", "39m", "39m", "0m"
"kflux-prd-rh03", "buildah", "step-prepare-sboms", "rosa-clusters-service-main-on-push-889lb-build-container-pod", "ocm-tenant", "rosa-clusters-service-main", "rosa-clusters-service-main", "10", "5", "5", "5", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"kflux-prd-rh03", "buildah", "step-upload-sbom", "app-on-push-mrv5z-build-image-2-pod", "rosa-log-router-tenant", "app", "konflux-test", "11", "5", "5", "5", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"kflux-rhel-p01", "buildah", "step-build", "rpm-repo-mappin7c36db7ab6cab31e64ae665990a9f1b96bf4f8f6758a-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-utils-container", "tooling", "909", "347", "347", "134", "", "N/A", "N/A", "N/A", "0m", "0m", "38m", "38m"
"kflux-rhel-p01", "buildah", "step-push", "rpm-repo-mappin621a7e0622d34579df9c44fdbe4e3e18c4a941ed50a0-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-utils-container", "tooling", "32", "25", "21", "4", "rpm-repo-mappin7c36db7ab6cab31e64ae665990a9f1b96bf4f8f6758a-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-utils-container", "tooling", "13m", "0m", "0m", "13m"
"kflux-rhel-p01", "buildah", "step-sbom-syft-generate", "rpm-repo-mappin621a7e0622d34579df9c44fdbe4e3e18c4a941ed50a0-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-utils-container", "tooling", "10", "4", "60", "4", "rpm-repo-mappina2b8f502f2584fa3f294c756793add4cd5635a0aef42-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-compare-with-plm-container", "tooling", "27m", "0m", "0m", "0m"
"kflux-rhel-p01", "buildah", "step-prepare-sboms", "rpm-repo-mappin621a7e0622d34579df9c44fdbe4e3e18c4a941ed50a0-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-utils-container", "tooling", "12", "4", "4", "4", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"kflux-rhel-p01", "buildah", "step-upload-sbom", "rpm-repo-mappinc5b4e29486b9f825221d0577ddb40671ff7999d7add9-pod", "rhel-on-konflux-tenant", "rpm-repo-mapping-compare-with-plm-container", "tooling", "10", "4", "4", "4", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"stone-prd-rh01", "buildah", "step-build", "aus-cli-main-on-pull-request-j9bbl-build-container-pod", "app-sre-tenant", "aus-cli-main", "aus-cli-main", "8192", "8191", "8190", "8180", "ocm-cli-on-pull-request-g4lkv-build-container-pod", "ocm-cli-clients-tenant", "ocm-cli", "ocm-cli", "4195m", "4195m", "4195m", "4195m"
"stone-prd-rh01", "buildah", "step-push", "mintmaker-renovate-image-onb8d55b89a25a8b55af0bd2d8c05e9224-pod", "konflux-mintmaker-tenant", "mintmaker-renovate-image", "mintmaker-renovate-image", "4096", "495", "494", "89", "mintmaker-renovate-image-onb8d55b89a25a8b55af0bd2d8c05e9224-pod", "konflux-mintmaker-tenant", "mintmaker-renovate-image", "mintmaker-renovate-image", "440m", "350m", "276m", "208m"
"stone-prd-rh01", "buildah", "step-sbom-syft-generate", "mintmaker-renovate-image-on-push-w8f7h-build-container-pod", "konflux-mintmaker-tenant", "mintmaker-renovate-image", "mintmaker-renovate-image", "4096", "1269", "716", "43", "mintmaker-renovate-image-onc26469ef381800b70bc95f9761b2eb32-pod", "konflux-mintmaker-tenant", "mintmaker-renovate-image", "mintmaker-renovate-image", "971m", "705m", "585m", "281m"
"stone-prd-rh01", "buildah", "step-prepare-sboms", "rhsm-api-proxy-on-pull-request-4tbbp-build-container-pod", "teamnado-konflux-tenant", "rhsm-api-proxy", "rhsm-api-proxy", "189", "154", "154", "33", "rhsm-api-proxy-on-pull-request-2t5g4-build-container-pod", "teamnado-konflux-tenant", "rhsm-api-proxy", "rhsm-api-proxy", "228m", "221m", "233m", "221m"
"stone-prd-rh01", "buildah", "step-upload-sbom", "notifications-c8680fe203b4ad4f623fb7c28374f6a8b5a1d368512e2-pod", "hcc-integrations-tenant", "notifications-connector-splunk", "notifications", "45", "30", "26", "17", "cluster-observa84ddd62f3ee5a93dbb032cc519c930a852f4967f3e97-pod", "cluster-observabilit-tenant", "cluster-observability-operator-bundle-1-3", "cluster-observability-operator-1-3", "5m", "5m", "5m", "5m"
"stone-prod-p02", "buildah", "step-build", "uhc-clusters-sea97dd9b79f64f9d791de081214fa616a3df0a5d0c5fb-pod", "ocm-tenant", "uhc-clusters-service-master", "uhc-clusters-service-master", "8193", "5252", "4942", "3431", "uhc-clusters-sed0fbccabe1f1f96904b2605ada059625dd25648eb671-pod", "ocm-tenant", "uhc-clusters-service-master", "uhc-clusters-service-master", "4159m", "4236m", "4236m", "4236m"
"stone-prod-p02", "buildah", "step-push", "ocmci-on-push-pcnx5-build-container-pod", "ocmci-tenant", "ocmci", "ocm-backend-tests", "2377", "130", "128", "48", "ocmci-on-pull-request-4ktsp-build-container-pod", "ocmci-tenant", "ocmci", "ocm-backend-tests", "147m", "138m", "130m", "107m"
"stone-prod-p02", "buildah", "step-sbom-syft-generate", "lifecycle-api-on-pull-request-5m9px-build-container-pod", "plcm-tenant", "lifecycle-api", "plcm", "2834", "741", "416", "174", "lifecycle-api-on-pull-request-5m9px-build-container-pod", "plcm-tenant", "lifecycle-api", "plcm", "116m", "116m", "151m", "151m"
"stone-prod-p02", "buildah", "step-prepare-sboms", "product-experience-apps-on-push-s88nz-build-container-pod", "cpla-tenant", "product-experience-apps", "product-experience-apps", "99", "65", "36", "7", "product-experience-apps-on-push-s88nz-build-container-pod", "cpla-tenant", "product-experience-apps", "product-experience-apps", "5m", "5m", "5m", "3m"
"stone-prod-p02", "buildah", "step-upload-sbom", "fbc-4-19-on-push-75dwm-build-container-pod", "rhdh-tenant", "fbc-4-19", "fbc-4-19", "42", "29", "28", "20", "fbc-4-20-on-push-jth4l-build-container-pod", "rhdh-tenant", "fbc-4-20", "fbc-4-20", "3m", "5m", "3m", "5m"
"stone-stg-rh01", "buildah", "step-build", "konflux-tests-on-push-shd2v-build-container-pod", "rh-ee-athorp-tenant", "konflux-tests", "test-first-application", "6", "0", "0", "4", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"stone-stg-rh01", "buildah", "step-push", "", "N/A", "N/A", "N/A", "0", "0", "0", "4", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"stone-stg-rh01", "buildah", "step-sbom-syft-generate", "", "N/A", "N/A", "N/A", "0", "0", "0", "0", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"stone-stg-rh01", "buildah", "step-prepare-sboms", "", "N/A", "N/A", "N/A", "0", "0", "0", "0", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
"stone-stg-rh01", "buildah", "step-upload-sbom", "", "N/A", "N/A", "N/A", "0", "0", "0", "0", "", "N/A", "N/A", "N/A", "0m", "0m", "0m", "0m"
real 18m6.323s
user 5m46.391s
sys 1m58.815sCc: @jhutar |
Summary: Add analyze_resource_limits.py script to analyze resource consumption data and provide recommendations for Kubernetes resource limits based on P95 percentile analysis with configurable safety margins. Description: This commit introduces a new tool for analyzing resource consumption data collected from Prometheus and generating recommendations for Kubernetes resource limits. The tool addresses the need to optimize resource allocation in Tekton tasks based on actual usage patterns. Key Features: - Analyzes CSV data from wrapper_for_promql_for_all_clusters.sh - Calculates recommendations using P95 percentile + configurable margin (default 10%) - Automatically rounds memory to standard Kubernetes values: * Values < 1Gi: Rounds to nearest power of 2 (32Mi, 64Mi, 128Mi, 256Mi, 512Mi) * Values >= 1Gi: Rounds to whole Gi values (1Gi, 2Gi, 3Gi, etc.) - Always formats CPU values in millicores for consistency - Can parse Tekton Task YAML files (local or from GitHub URLs) - Automatically extracts task name and step names from YAML - Optionally runs data collection automatically - Can update YAML files with recommended resource limits Usage Examples: # From piped input: ./wrapper_for_promql_for_all_clusters.sh 7 --csv | ./analyze_resource_limits.py # From YAML file (auto-runs data collection): ./analyze_resource_limits.py --file /path/to/buildah.yaml ./analyze_resource_limits.py --file https://github.com/.../buildah.yaml # Update YAML file with recommendations: ./analyze_resource_limits.py --file /path/to/buildah.yaml --update The tool uses intelligent rounding to ensure resource limits are set to standard Kubernetes values while ensuring we don't go below the recommended values calculated from actual usage data. Files Changed: - analyze_resource_limits.py: New script for resource limit analysis - README.md: Updated with documentation for the new tool Generated-by: Cursor-AI
This change fixes a fundamental accuracy issue in memory percentile computation
(p95 / p90 / median) caused by per-batch aggregation in shell.
Previously, percentiles were calculated independently for each batch of pods
and then merged by taking the maximum across batches. This approach is
mathematically incorrect and could significantly overestimate memory pressure,
especially when batch sizes or pod lifetimes varied.
Key changes:
representative memory pressure
Results:
While the MAX Mem usage data is correct in both the cases, but from the above data we can see that the accuracy of 95 percentile, 90th percentile & median data has improved a lot.