ingester: cache list of files, speedup on large queue by nuclearcat · Pull Request #1850 · kernelci/dashboard

nuclearcat · 2026-04-13T10:51:33Z

Before: Every 5-second cycle calls os.scandir() on the 2M-entry directory. Each scan enumerates all entries via readdir(), which is extremely slow on a flat directory that large. Might take 20-30 seconds.

After:

scandir() runs once, caching all .json entries
Each cycle pops a chunk of up to INGEST_CYCLE_BATCH_SIZE (default 50,000) files from the cache and processes them
Re-scan only happens when the cache is fully drained
On scandir error, cache is cleared → forces re-scan next cycle

With 2M files: one scan instead of ~40 scans (2M / 50K chunks = 40 cycles of scan-free processing). If each scandir of 2M entries takes ~30 seconds, that saves ~20 minutes of pure directory enumeration overhead.

The INGEST_CYCLE_BATCH_SIZE(50k default) is configurable via env var if you want to tune the chunk size. Also note this fixed a latent bug in the old code where json_files would retain stale data from the previous iteration if scandir threw an exception (the except: pass didn't reset it).

Copilot

Pull request overview

Optimizes the ingester’s spool monitoring loop to avoid repeatedly scanning extremely large flat directories, improving throughput when the queue contains millions of files.

Changes:

Add a per-process cache of discovered .json spool files and process them in chunks per cycle.
Introduce INGEST_CYCLE_BATCH_SIZE (env-configurable) to control how many cached files are processed each loop iteration.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
`backend/kernelCI_app/management/commands/monitor_submissions.py`	Cache scandir results and process cached entries in batches to reduce repeated directory enumeration.
`backend/kernelCI_app/constants/ingester.py`	Add `INGEST_CYCLE_BATCH_SIZE` constant sourced from environment with basic parsing fallback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

backend/kernelCI_app/management/commands/monitor_submissions.py

backend/kernelCI_app/constants/ingester.py

Copilot

Pull request overview

Optimizes the submissions ingester loop to avoid repeatedly scanning extremely large spool directories by caching scan results and processing files in configurable-sized chunks per cycle.

Changes:

Add INGEST_CYCLE_BATCH_SIZE to control how many cached files are processed per monitoring cycle.
Refactor monitor_submissions to scan the spool directory only when the cache is depleted and process cached paths in batches.
Switch ingestion/performance-test plumbing from os.DirEntry objects to plain path strings.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
backend/kernelCI_app/tests/performanceTests/test_ingest_perf.py	Updates perf tests to pass file path strings instead of `DirEntry` objects.
backend/kernelCI_app/management/commands/monitor_submissions.py	Adds cached scanning + per-cycle batch processing; refactors Prometheus setup and scan handling.
backend/kernelCI_app/management/commands/helpers/kcidbng_ingester.py	Updates ingestion API to accept `list[str]` and builds metadata using basename/getsize.
backend/kernelCI_app/constants/ingester.py	Introduces configurable `INGEST_CYCLE_BATCH_SIZE` env var with default of 50,000.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

backend/kernelCI_app/tests/performanceTests/test_ingest_perf.py

backend/kernelCI_app/management/commands/monitor_submissions.py

backend/kernelCI_app/management/commands/helpers/kcidbng_ingester.py

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

backend/kernelCI_app/management/commands/helpers/kcidbng_ingester.py

Before: Every 5-second cycle calls os.scandir() on the 2M-entry directory. Each scan enumerates all entries via readdir(), which is extremely slow on a flat directory that large. Might take 20-30 seconds. After: 1. scandir() runs once, caching all .json entries 2. Each cycle pops a chunk of up to INGEST_CYCLE_BATCH_SIZE (default 50,000) files from the cache and processes them 3. Re-scan only happens when the cache is fully drained 4. On scandir error, cache is cleared → forces re-scan next cycle With 2M files: one scan instead of ~40 scans (2M / 50K chunks = 40 cycles of scan-free processing). If each scandir of 2M entries takes ~30 seconds, that saves ~20 minutes of pure directory enumeration overhead. The INGEST_CYCLE_BATCH_SIZE(50k default) is configurable via env var if you want to tune the chunk size. Also note this fixed a latent bug in the old code where json_files would retain stale data from the previous iteration if scandir threw an exception (the except: pass didn't reset it). Signed-off-by: Denys Fedoryshchenko <denys.f@collabora.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nuclearcat requested a review from Copilot April 13, 2026 10:51

Copilot started reviewing on behalf of nuclearcat April 13, 2026 10:52 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

nuclearcat force-pushed the ingester-speedup branch 4 times, most recently from 936f5b6 to 1adc362 Compare April 13, 2026 11:12

nuclearcat requested a review from Copilot April 13, 2026 11:16

Copilot started reviewing on behalf of nuclearcat April 13, 2026 11:17 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

backend/kernelCI_app/tests/performanceTests/test_ingest_perf.py Show resolved Hide resolved

backend/kernelCI_app/management/commands/monitor_submissions.py Show resolved Hide resolved

backend/kernelCI_app/management/commands/helpers/kcidbng_ingester.py Show resolved Hide resolved

MarceloRobert assigned nuclearcat Apr 13, 2026

MarceloRobert added enhancement New feature or request Ingester The issue relates to the ingester tool, including the command itself and related functions. labels Apr 13, 2026

nuclearcat requested a review from Copilot April 13, 2026 11:34

Copilot started reviewing on behalf of nuclearcat April 13, 2026 11:35 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

backend/kernelCI_app/management/commands/helpers/kcidbng_ingester.py Show resolved Hide resolved

nuclearcat force-pushed the ingester-speedup branch from 1adc362 to 90e3c0e Compare April 13, 2026 11:45

nuclearcat requested a review from Copilot April 13, 2026 11:48

Copilot started reviewing on behalf of nuclearcat April 13, 2026 11:49 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

nuclearcat added this pull request to the merge queue Apr 13, 2026

Merged via the queue into kernelci:main with commit c3ffaf1 Apr 13, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingester: cache list of files, speedup on large queue#1850

ingester: cache list of files, speedup on large queue#1850
nuclearcat merged 1 commit intokernelci:mainfrom
nuclearcat:ingester-speedup

nuclearcat commented Apr 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nuclearcat commented Apr 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants