Open
Conversation
…ely reach (with a safety margin to avoid actually breaching the limit).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I observed in the log that the way the crawler is scheduled makes it so that sometimes the API rate limit is still reached and instead of trying again after a few minutes, we wait a full hour.
Since this is not optimal when there are a lot of repositories to be crawled, I changed the cron schedule to run every 5 minutes between 00 and 49 of the hour, leaving the time between 50 to 59 of the hour free for the data service to proceed without any risk of collision.
I implemented a service method that actively queries the API rate limit and check how many API calls we have left. In addition, I also added a safety margin of 10 calls in an effort to reduce the amount of time where we actually break the API rate limit. Due to the nature of the Github Crawler service, we might still reach the limit because we don't check at every possible occasion (a refactor would have been to time intensive).
I separated the cron job out of the
prepareInstitutionsfunction to make it more clear what is the actual scheduler and what conditions trigger a crawling round (of 5000 - 10 API calls).Finally, I fixed some typos in the logs and added a log line to mention which Institutions is being crawled at a given moment.