Testing Custom Scrapers

Custom scrapers often lack proper specs but still need quality validation. This gem provides a simple executable to validate your scraped data.

Setup

Add to your Gemfile:

gem "scraper_utils", "~> 0.8.2"
gem "rake", "~> 13.0"

Copy the example files to your scraper directory:

  • docs/example_custom_RakefileRakefile
  • docs/example_dot_scraper_validation.yml.scraper_validation.yml

Usage

Command Line

Once you have run the scraper (bundle exec ruby scraper.rb) you can manually run the command:

# Run with defaults
validate_scraper_data

# Custom thresholds
validate_scraper_data --geocodable-percentage 80 --description-percentage 60

# For scrapers with bot protection
validate_scraper_data --bot-check-expected

Rake Tasks

bundle exec rake           # Run scraper then validate (default)
bundle exec rake test      # Same as above
bundle exec rake scrape    # Just run scraper
bundle exec rake validate  # Just validate existing data
bundle exec rake clean     # Remove data.sqlite

Configuration

Edit .scraper_validation.yml to set validation thresholds:

geocodable_percentage: 70    # Min % of geocodable addresses
description_percentage: 55   # Min % of reasonable descriptions
info_url_percentage: 75      # Min % of info URL detail checks
bot_check_expected: false    # Set true when detail pages have reCAPTCHA/Cloudflare bot checks

What Gets Validated

  • Addresses: Checks for proper Australian address format (street type, state, postcode)
  • Descriptions: Ensures descriptions aren't placeholders and have 3+ words
  • Info URLs: Validates URLs return status 200 and the page contains expected details
  • Global URLs: Auto-detects if all records use the same info URL and validates it's accessible

Integration with CI

Add to your CI pipeline:

bundle install
rake test

The validation will fail with clear error messages if data quality is insufficient, helping catch scraping issues early.