Interleaving Requests with Scheduler
The ScraperUtils::Scheduler
provides a lightweight utility that:
- Works on other authorities while in the delay period for an authority's next request
- Optimizes the total scraper run time
- Allows you to increase the random delay for authorities without undue effect on total run time
- For the curious, it uses ruby fibers rather than threads as that is a simpler system and thus easier to get right, understand and debug!
- Cycles around the authorities when compliant_mode, max_load and random_delay are disabled
Implementation
To enable fiber scheduling, change your scrape method as per example scrape with fibers
Logging with Scheduler
Use ScraperUtils::LogUtils.log instead of puts
when logging within the authority processing code.
This will prefix the output lines with the authority name, which is needed since the system will interleave the work and
thus the output.
Testing Considerations
This uses ScraperUtils::RandomizeUtils for determining the order of operations. Remember to add the following line to
spec/spec_helper.rb
:
ScraperUtils::RandomizeUtils.random = false
ScraperUtils::Scheduler.max_workers = 1
For full details, see the Scheduler.