Github Crawler optimization by Alexandre-DSL · Pull Request #410 · digital-sustainability/oss-github-benchmark

Alexandre-DSL · 2026-02-10T15:18:13Z

I observed in the log that the way the crawler is scheduled makes it so that sometimes the API rate limit is still reached and instead of trying again after a few minutes, we wait a full hour.

Since this is not optimal when there are a lot of repositories to be crawled, I changed the cron schedule to run every 5 minutes between 00 and 49 of the hour, leaving the time between 50 to 59 of the hour free for the data service to proceed without any risk of collision.

I implemented a service method that actively queries the API rate limit and check how many API calls we have left. In addition, I also added a safety margin of 10 calls in an effort to reduce the amount of time where we actually break the API rate limit. Due to the nature of the Github Crawler service, we might still reach the limit because we don't check at every possible occasion (a refactor would have been to time intensive).

I separated the cron job out of the prepareInstitutions function to make it more clear what is the actual scheduler and what conditions trigger a crawling round (of 5000 - 10 API calls).

Finally, I fixed some typos in the logs and added a log line to mention which Institutions is being crawled at a given moment.

…ely reach (with a safety margin to avoid actually breaching the limit).

Alexandre-DSL added 3 commits February 10, 2026 15:54

Added service function to explicitely query the rate limit status

e9e53d7

Precise comment

1c74e64

Added crawler running state, check whether the rate limit is effectiv…

027db67

…ely reach (with a safety margin to avoid actually breaching the limit).

Alexandre-DSL requested a review from holdan-8 February 10, 2026 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Github Crawler optimization#410

Github Crawler optimization#410
Alexandre-DSL wants to merge 3 commits intomainfrom
crawler-optimization

Alexandre-DSL commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

Alexandre-DSL commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant