Module: ScraperUtils::SpecSupport

Defined in:
lib/scraper_utils/spec_support.rb

Overview

Methods to support specs

Constant Summary collapse

AUSTRALIAN_STATES =
%w[ACT NSW NT QLD SA TAS VIC WA].freeze
STREET_TYPE_PATTERNS =
[
  /\bAv(e(nue)?)?\b/i,
  /\bB(oulevard|lvd)\b/i,
  /\b(Circuit|Cct)\b/i,
  /\bCl(ose)?\b/i,
  /\bC(our|r)?t\b/i,
  /\bCircle\b/i,
  /\bChase\b/i,
  /\bCr(es(cent)?)?\b/i,
  /\bDr((ive)?|v)\b/i,
  /\bEnt(rance)?\b/i,
  /\bGr(ove)?\b/i,
  /\bH(ighwa|w)y\b/i,
  /\bLane\b/i,
  /\bLoop\b/i,
  /\bParkway\b/i,
  /\bPl(ace)?\b/i,
  /\bPriv(ate)?\b/i,
  /\bParade\b/i,
  /\bR(oa)?d\b/i,
  /\bRise\b/i,
  /\bSt(reet)?\b/i,
  /\bSquare\b/i,
  /\bTerrace\b/i,
  /\bWay\b/i
].freeze
AUSTRALIAN_POSTCODES =
/\b\d{4}\b/.freeze
PLACEHOLDERS =
[
  /no description/i,
  /not available/i,
  /to be confirmed/i,
  /\btbc\b/i,
  %r{\bn/a\b}i
].freeze

Class Method Summary collapse

Class Method Details

.geocodable?(address, ignore_case: false) ⇒ Boolean

Check if an address is likely to be geocodable by analyzing its format. This is a bit stricter than needed - typically assert >= 75% match

Parameters:

  • address (String)

    The address to check

Returns:

  • (Boolean)

    True if the address appears to be geocodable.



42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# File 'lib/scraper_utils/spec_support.rb', line 42

def self.geocodable?(address, ignore_case: false)
  return false if address.nil? || address.empty?
  check_address = ignore_case ? address.upcase : address

  # Basic structure check - must have a street name, suburb, state and postcode
  has_state = AUSTRALIAN_STATES.any? { |state| check_address.end_with?(" #{state}") || check_address.include?(" #{state} ") }
  has_postcode = address.match?(AUSTRALIAN_POSTCODES)

  # Using the pre-compiled patterns
  has_street_type = STREET_TYPE_PATTERNS.any? { |pattern| check_address.match?(pattern) }

  has_unit_or_lot = address.match?(/\b(Unit|Lot:?)\s+\d+/i)

  has_suburb_stats = check_address.match?(/(\b[A-Z]{2,}(\s+[A-Z]+)*,?|,\s+[A-Z][A-Za-z ]+)(\s+\d{4})?\s+(#{AUSTRALIAN_STATES.join('|')})\b/)

  if ENV["DEBUG"]
    missing = []
    unless has_street_type || has_unit_or_lot
      missing << "street type / unit / lot"
    end
    missing << "state" unless has_state
    missing << "postcode" unless has_postcode
    missing << "suburb state" unless has_suburb_stats
    puts "  address: #{address} is not geocodable, missing #{missing.join(', ')}" if missing.any?
  end

  (has_street_type || has_unit_or_lot) && has_state && has_postcode && has_suburb_stats
end

.placeholder?(text) ⇒ Boolean

Returns:

  • (Boolean)


79
80
81
# File 'lib/scraper_utils/spec_support.rb', line 79

def self.placeholder?(text)
  PLACEHOLDERS.any? { |placeholder| text.to_s.match?(placeholder) }
end

.reasonable_description?(text) ⇒ Boolean

Check if this looks like a “reasonable” description This is a bit stricter than needed - typically assert >= 75% match

Returns:

  • (Boolean)


85
86
87
# File 'lib/scraper_utils/spec_support.rb', line 85

def self.reasonable_description?(text)
  !placeholder?(text) && text.to_s.split.size >= 3
end