ScraperUtils (Ruby)
Utilities to help make planningalerts scrapers, especially multis, easier to develop, run and debug.
For Server Administrators
The ScraperUtils library is designed to be a respectful citizen of the web. If you're a server administrator and notice our scraper accessing your systems, here's what you should know:
We play nice with your servers
Our goal is to access public planning information with minimal impact on your services. The following features are on by default:
Limit server load:
- We limit the max load we present to your server to less than a half of one of your cpu cores
- The more loaded your server is, the longer we wait between requests!
- We respect Crawl-delay from robots.txt (see section below), so you can tell us an acceptable rate
- Scarper developers can
- reduce the max_load we present to your server even further
- add random extra delays to give your server a chance to catch up with background tasks
- We limit the max load we present to your server to less than a half of one of your cpu cores
Identify themselves: Our user agent clearly indicates who we are and provides a link to the project repository:
Mozilla/5.0 (compatible; ScraperUtils/0.2.0 2025-02-22; +https://github.com/ianheggie-oaf/scraper_utils)
How to Control Our Behavior
Our scraper utilities respect the standard server robots.txt control mechanisms (by default). To control our access:
- Add a section for our user agent:
User-agent: ScraperUtils
- Set a crawl delay, eg:
Crawl-delay: 20
- If needed specify disallowed paths:
Disallow: /private/
For Scraper Developers
We provide utilities to make developing, running and debugging your scraper easier in addition to the base utilities mentioned above.
Installation & Configuration
Add to your scraper's Gemfile:
gem "scraperwiki", git: "https://github.com/openaustralia/scraperwiki-ruby.git", branch: "morph_defaults"
gem 'scraper_utils'
For detailed setup and configuration options, see Getting Started guide
Key Features
Well-Behaved Web Client
- Configure Mechanize agents with sensible defaults
- Automatic rate limiting based on server response times
- Supports robots.txt and crawl-delay directives
- Supports extra actions required to get to results page
- Learn more about Mechanize utilities
Optimize Server Load
- Intelligent date range selection (reduce server load by up to 60%)
- Cycle utilities for rotating search parameters
- Learn more about reducing server load
Improve Scraper Efficiency
- Interleaves requests to optimize run time
- Use ScraperUtils::Scheduler.execute_request so Mechanize network requests will be performed by threads in parallel
- Parallel Request - see Usage section for installation instructions
- Randomize processing order for more natural request patterns
- Learn more about randomizing requests - see Usage section for installation instructions
Error Handling & Quality Monitoring
- Record-level error handling with appropriate thresholds
- Data quality monitoring during scraping
- Detailed logging and reporting
Developer Tools
- Enhanced debugging utilities
- Simple logging with authority context
- Learn more about debugging
API Documentation
Complete API documentation is available at scraper_utils | RubyDoc.info.
Ruby Versions
This gem is designed to be compatible with Ruby versions supported by morph.io:
- Ruby 3.2.2 - requires the
platform
file to containheroku_18
in the scraper - Ruby 2.5.8 -
heroku_16
(the default)
Development
After checking out the repo, run bin/setup
to install dependencies.
Then, run rake test
to run the tests.
To install this gem onto your local machine, run bundle exec rake install
.
Contributing
Bug reports and pull requests with working tests are welcome on ianheggie-oaf/scraper_utils | GitHub.
License
The gem is available as open source under the terms of the MIT License.