Class: ScraperUtils::DataQualityMonitor

Inherits:
Object
  • Object
show all
Defined in:
lib/scraper_utils/data_quality_monitor.rb

Overview

Monitors data quality during scraping by tracking successful vs failed record processing Automatically triggers an exception if the error rate exceeds a threshold

Class Attribute Summary collapse

Class Method Summary collapse

Class Attribute Details

.statsObject (readonly)

Returns the value of attribute stats.



10
11
12
# File 'lib/scraper_utils/data_quality_monitor.rb', line 10

def stats
  @stats
end

Class Method Details

.extract_authority(record) ⇒ Object

Extracts authority label and ensures stats are setup for record



22
23
24
25
26
27
# File 'lib/scraper_utils/data_quality_monitor.rb', line 22

def self.extract_authority(record)
  authority_label = (record&.key?("authority_label") ? record["authority_label"] : "").to_sym
  @stats ||= {}
  @stats[authority_label] ||= { saved: 0, unprocessed: 0 }
  authority_label
end

.log_saved_record(record) ⇒ void

This method returns an undefined value.

Logs a successfully saved record

Parameters:

  • record (Hash)

    The record that was saved



57
58
59
60
61
# File 'lib/scraper_utils/data_quality_monitor.rb', line 57

def self.log_saved_record(record)
  authority_label = extract_authority(record)
  @stats[authority_label][:saved] += 1
  ScraperUtils::LogUtils.log "Saving record #{authority_label} - #{record['address']}"
end

.log_unprocessable_record(exception, record) ⇒ void

This method returns an undefined value.

Logs an unprocessable record and raises an exception if error threshold is exceeded The threshold is 5 + 10% of saved records

Parameters:

  • exception (Exception)

    The exception that caused the record to be unprocessable

  • record (Hash, nil)

    The record that couldn’t be processed

Raises:



40
41
42
43
44
45
46
47
48
49
50
51
# File 'lib/scraper_utils/data_quality_monitor.rb', line 40

def self.log_unprocessable_record(exception, record)
  authority_label = extract_authority(record)
  @stats[authority_label][:unprocessed] += 1
  ScraperUtils::LogUtils.log "Erroneous record #{authority_label} - #{record&.fetch(
    'address', nil
  ) || record.inspect}: #{exception}"
  return unless @stats[authority_label][:unprocessed] > threshold(authority_label)

  raise ScraperUtils::UnprocessableSite,
        "Too many unprocessable_records for #{authority_label}: " \
        "#{@stats[authority_label].inspect} - aborting processing of site!"
end

.start_authority(authority_label) ⇒ Object

Notes the start of processing an authority and clears any previous stats

Parameters:

  • authority_label (Symbol)

    The authority we are processing



16
17
18
19
# File 'lib/scraper_utils/data_quality_monitor.rb', line 16

def self.start_authority(authority_label)
  @stats ||= {}
  @stats[authority_label] = { saved: 0, unprocessed: 0 }
end

.threshold(authority_label) ⇒ Object



29
30
31
# File 'lib/scraper_utils/data_quality_monitor.rb', line 29

def self.threshold(authority_label)
  5.01 + (@stats[authority_label][:saved] * 0.1) if @stats&.fetch(authority_label, nil)
end