Class: PEROBS::FuzzyStringMatcher

Inherits:
Object
  • Object
show all
Defined in:
lib/perobs/FuzzyStringMatcher.rb

Overview

The fuzzy string matcher can be used to perform a fuzzy string search against a known set of strings. The dictionary of known strings does not store the actual strings but references to arbitrary objects. These could be the string, but can be something else related to the learned strings. To use this class a list of strings with their references must be learned. Once the dictionary has been established, fuzzy matches can be done.

Instance Method Summary collapse

Constructor Details

#initialize(store, name, case_sensitive = false, n = 4) ⇒ FuzzyStringMatcher

Create a new FuzzyStringMatcher.

Parameters:

  • store (PEROBS::Store)

    place to store the dictionary

  • name (String)

    Unique name of the string matcher

  • case_sensitive (Boolean) (defaults to: false)

    True if case matters for matching

  • n (Integer) (defaults to: 4)

    Determines what kind of n-gramm is used to store the references in the dictionary. It also determines the minimum word length that can be used for fuzzy matches.



48
49
50
51
52
53
54
55
56
57
58
# File 'lib/perobs/FuzzyStringMatcher.rb', line 48

def initialize(store, name, case_sensitive = false, n = 4)
  @store = store
  @dict_name = "FuzzyStringMatcher::#{name}"
  if n < 2 || n > 10
    raise ArgumentError, 'n must be between 2 and 10'
  end
  @case_sensitive = case_sensitive
  @n = n

  clear unless (@dict = @store[@dict_name])
end

Instance Method Details

#best_matches(string, min_score = 0.5, max_count = 100) ⇒ Array

Find the references who’s string best matches the given string.

Parameters:

  • string (String)

    string to search for

  • min_score (Float) (defaults to: 0.5)

    Value 0.01 and 1.0 that specifies how strict the matching should be done. The larger the value the more closer the given string needs to be.

  • max_count (Integer) (defaults to: 100)

    The maximum number of matches that should be returned.

Returns:

  • (Array)

    The result is an Array of Arrays. The nested Arrays only have 2 entries. The reference and a Float value between 0 and 1.0 that describes how good the match is. The matches are sorted in descending order by the match score.



103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
# File 'lib/perobs/FuzzyStringMatcher.rb', line 103

def best_matches(string, min_score = 0.5, max_count = 100)
  unless @case_sensitive
    string = string.downcase
  end
  # Enclose string in 'start of text' and 'end of text' ASCII values.
  string = "\002" + string + "\003"

  matches = {}

  # This will be the best possible score for a perfect match.
  best_possible_score = 0
  each_n_gramm(string) do |n_gramm|
    best_possible_score += 1
    if (ng_list = @dict[n_gramm])
      ng_list.each do |reference, count|
        if matches.include?(reference)
          matches[reference] += 1
        else
          # We use internally a 10 times larger list so that we don't
          # throw away good matches too early. If the max_count value is
          # chosen too small there is a risk of not finding the best
          # matches!
          if matches.size > 10 * max_count
            matches = discard_worst_match(matches)
          end
          matches[reference] = 1
        end
      end
    end
  end

  return [] if matches.empty?

  # Sort in the order of occurance count downwards.
  match_list = matches.to_a.sort do |a, b|
    b[1] <=> a[1]
  end

  # Set occurance counters to scores relative to the best possible score.
  match_list.map! { |a, b| [ a, b.to_f / best_possible_score ] }

  # Delete all matches that occured less than half as often than the
  # top match.
  match_list.delete_if { |a| a[1] < min_score }

  match_list[0..max_count]
end

#clearObject

Wipe the dictionary.



61
62
63
# File 'lib/perobs/FuzzyStringMatcher.rb', line 61

def clear
  @store[@dict_name] = @dict = @store.new(BigHash)
end

#learn(string, reference = string) ⇒ Object

Add a string with its reference to the dictionary.

Parameters:

  • string (String)

    The string to store

  • reference (Object) (defaults to: string)

    Any object that is associated with the string



68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# File 'lib/perobs/FuzzyStringMatcher.rb', line 68

def learn(string, reference = string)
  reference = string if reference.nil?

  unless @case_sensitive
    string = string.downcase
  end
  # Enclose string in 'start of text' and 'end of text' ASCII values.
  string = "\002" + string + "\003"

  each_n_gramm(string) do |n_gramm|
    unless (ng_list = @dict[n_gramm])
      @dict[n_gramm] = ng_list = @store.new(Hash)
    end

    if ng_list.include?(reference)
      ng_list[reference] += 1
    else
      ng_list[reference] = 0
    end
  end

  nil
end

#statsObject

Returns some internal stats about the dictionary.



152
153
154
155
156
157
158
159
160
161
162
163
164
165
# File 'lib/perobs/FuzzyStringMatcher.rb', line 152

def stats
  s = {}
  s['dictionary_size'] = @dict.size
  max = total = 0
  @dict.each do |n_gramm, ng_list|
    size = ng_list.length
    max = size if size > max
    total += size
  end
  s['max_list_size'] = max
  s['avg_list_size'] = total > 0 ? total.to_f / s['dictionary_size'] : 0

  s
end