ingester: cache list of files, speedup on large queue#1850
ingester: cache list of files, speedup on large queue#1850nuclearcat merged 1 commit intokernelci:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Optimizes the ingester’s spool monitoring loop to avoid repeatedly scanning extremely large flat directories, improving throughput when the queue contains millions of files.
Changes:
- Add a per-process cache of discovered
.jsonspool files and process them in chunks per cycle. - Introduce
INGEST_CYCLE_BATCH_SIZE(env-configurable) to control how many cached files are processed each loop iteration.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
backend/kernelCI_app/management/commands/monitor_submissions.py |
Cache scandir results and process cached entries in batches to reduce repeated directory enumeration. |
backend/kernelCI_app/constants/ingester.py |
Add INGEST_CYCLE_BATCH_SIZE constant sourced from environment with basic parsing fallback. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
backend/kernelCI_app/management/commands/monitor_submissions.py
Outdated
Show resolved
Hide resolved
backend/kernelCI_app/management/commands/monitor_submissions.py
Outdated
Show resolved
Hide resolved
backend/kernelCI_app/management/commands/monitor_submissions.py
Outdated
Show resolved
Hide resolved
backend/kernelCI_app/management/commands/monitor_submissions.py
Outdated
Show resolved
Hide resolved
936f5b6 to
1adc362
Compare
There was a problem hiding this comment.
Pull request overview
Optimizes the submissions ingester loop to avoid repeatedly scanning extremely large spool directories by caching scan results and processing files in configurable-sized chunks per cycle.
Changes:
- Add
INGEST_CYCLE_BATCH_SIZEto control how many cached files are processed per monitoring cycle. - Refactor
monitor_submissionsto scan the spool directory only when the cache is depleted and process cached paths in batches. - Switch ingestion/performance-test plumbing from
os.DirEntryobjects to plain path strings.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| backend/kernelCI_app/tests/performanceTests/test_ingest_perf.py | Updates perf tests to pass file path strings instead of DirEntry objects. |
| backend/kernelCI_app/management/commands/monitor_submissions.py | Adds cached scanning + per-cycle batch processing; refactors Prometheus setup and scan handling. |
| backend/kernelCI_app/management/commands/helpers/kcidbng_ingester.py | Updates ingestion API to accept list[str] and builds metadata using basename/getsize. |
| backend/kernelCI_app/constants/ingester.py | Introduces configurable INGEST_CYCLE_BATCH_SIZE env var with default of 50,000. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Before: Every 5-second cycle calls os.scandir() on the 2M-entry directory. Each scan enumerates all entries via readdir(), which is extremely slow on a flat directory that large. Might take 20-30 seconds. After: 1. scandir() runs once, caching all .json entries 2. Each cycle pops a chunk of up to INGEST_CYCLE_BATCH_SIZE (default 50,000) files from the cache and processes them 3. Re-scan only happens when the cache is fully drained 4. On scandir error, cache is cleared → forces re-scan next cycle With 2M files: one scan instead of ~40 scans (2M / 50K chunks = 40 cycles of scan-free processing). If each scandir of 2M entries takes ~30 seconds, that saves ~20 minutes of pure directory enumeration overhead. The INGEST_CYCLE_BATCH_SIZE(50k default) is configurable via env var if you want to tune the chunk size. Also note this fixed a latent bug in the old code where json_files would retain stale data from the previous iteration if scandir threw an exception (the except: pass didn't reset it). Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>
1adc362 to
90e3c0e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Before: Every 5-second cycle calls os.scandir() on the 2M-entry directory. Each scan enumerates all entries via readdir(), which is extremely slow on a flat directory that large. Might take 20-30 seconds.
After:
With 2M files: one scan instead of ~40 scans (2M / 50K chunks = 40 cycles of scan-free processing). If each scandir of 2M entries takes ~30 seconds, that saves ~20 minutes of pure directory enumeration overhead.
The INGEST_CYCLE_BATCH_SIZE(50k default) is configurable via env var if you want to tune the chunk size. Also note this fixed a latent bug in the old code where json_files would retain stale data from the previous iteration if scandir threw an exception (the except: pass didn't reset it).