Getting Started with ScraperUtils

This guide will help you get started with ScraperUtils for your PlanningAlerts scraper.

Installation

Add these lines to your scraper's Gemfile:

# Below:
gem "scraperwiki", git: "https://github.com/openaustralia/scraperwiki-ruby.git", branch: "morph_defaults"

# Add:
gem 'scraper_utils'

And then execute:

bundle install

Environment Variables

`MORPH_AUSTRALIAN_PROXY`

On morph.io set the environment variable MORPH_AUSTRALIAN_PROXY to http://morph:[email protected]:8888 replacing password with the real password. Alternatively enter your own AUSTRALIAN proxy details when testing.

`MORPH_EXPECT_BAD`

To avoid morph complaining about sites that are known to be bad, but you want them to keep being tested, list them on MORPH_EXPECT_BAD, for example:

`MORPH_AUTHORITIES`

Optionally filter authorities for multi authority scrapers via environment variable in morph > scraper > settings or in your dev environment:

export MORPH_AUTHORITIES=noosa,wagga

`DEBUG`

Optionally enable verbose debugging messages when developing:

export DEBUG=1 # for basic, or 2 for verbose or 3 for tracing nearly everything

Example Scraper Implementation

Update your scraper.rb as per example scraper

For more advanced implementations, see the Interleaving Requests documentation.

Logging Tables

The following logging tables are created for use in monitoring failure patterns and debugging issues. Records are automatically cleared after 30 days.

The ScraperUtils::LogUtils.log_scraping_run call also logs the information to the scrape_log table.

The ScraperUtils::LogUtils.save_summary_record call also logs the information to the scrape_summary table.

Getting Started with ScraperUtils

Installation

Environment Variables

MORPH_AUSTRALIAN_PROXY

MORPH_EXPECT_BAD

MORPH_AUTHORITIES