Skip to content

Comments

Github Crawler optimization#410

Open
Alexandre-DSL wants to merge 3 commits intomainfrom
crawler-optimization
Open

Github Crawler optimization#410
Alexandre-DSL wants to merge 3 commits intomainfrom
crawler-optimization

Conversation

@Alexandre-DSL
Copy link
Contributor

I observed in the log that the way the crawler is scheduled makes it so that sometimes the API rate limit is still reached and instead of trying again after a few minutes, we wait a full hour.

Since this is not optimal when there are a lot of repositories to be crawled, I changed the cron schedule to run every 5 minutes between 00 and 49 of the hour, leaving the time between 50 to 59 of the hour free for the data service to proceed without any risk of collision.

I implemented a service method that actively queries the API rate limit and check how many API calls we have left. In addition, I also added a safety margin of 10 calls in an effort to reduce the amount of time where we actually break the API rate limit. Due to the nature of the Github Crawler service, we might still reach the limit because we don't check at every possible occasion (a refactor would have been to time intensive).

I separated the cron job out of the prepareInstitutions function to make it more clear what is the actual scheduler and what conditions trigger a crawling round (of 5000 - 10 API calls).

Finally, I fixed some typos in the logs and added a log line to mention which Institutions is being crawled at a given moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant