Class: PEROBS::FuzzyStringMatcher
- Inherits:
-
Object
- Object
- PEROBS::FuzzyStringMatcher
- Defined in:
- lib/perobs/FuzzyStringMatcher.rb
Overview
The fuzzy string matcher can be used to perform a fuzzy string search against a known set of strings. The dictionary of known strings does not store the actual strings but references to arbitrary objects. These could be the string, but can be something else related to the learned strings. To use this class a list of strings with their references must be learned. Once the dictionary has been established, fuzzy matches can be done.
Instance Method Summary collapse
-
#best_matches(string, min_score = 0.5, max_count = 100) ⇒ Array
Find the references who’s string best matches the given string.
-
#clear ⇒ Object
Wipe the dictionary.
-
#initialize(store, name, case_sensitive = false, n = 4) ⇒ FuzzyStringMatcher
constructor
Create a new FuzzyStringMatcher.
-
#learn(string, reference = string) ⇒ Object
Add a string with its reference to the dictionary.
-
#stats ⇒ Object
Returns some internal stats about the dictionary.
Constructor Details
#initialize(store, name, case_sensitive = false, n = 4) ⇒ FuzzyStringMatcher
Create a new FuzzyStringMatcher.
48 49 50 51 52 53 54 55 56 57 58 |
# File 'lib/perobs/FuzzyStringMatcher.rb', line 48 def initialize(store, name, case_sensitive = false, n = 4) @store = store @dict_name = "FuzzyStringMatcher::#{name}" if n < 2 || n > 10 raise ArgumentError, 'n must be between 2 and 10' end @case_sensitive = case_sensitive @n = n clear unless (@dict = @store[@dict_name]) end |
Instance Method Details
#best_matches(string, min_score = 0.5, max_count = 100) ⇒ Array
Find the references who’s string best matches the given string.
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
# File 'lib/perobs/FuzzyStringMatcher.rb', line 103 def best_matches(string, min_score = 0.5, max_count = 100) unless @case_sensitive string = string.downcase end # Enclose string in 'start of text' and 'end of text' ASCII values. string = "\002" + string + "\003" matches = {} # This will be the best possible score for a perfect match. best_possible_score = 0 each_n_gramm(string) do |n_gramm| best_possible_score += 1 if (ng_list = @dict[n_gramm]) ng_list.each do |reference, count| if matches.include?(reference) matches[reference] += 1 else # We use internally a 10 times larger list so that we don't # throw away good matches too early. If the max_count value is # chosen too small there is a risk of not finding the best # matches! if matches.size > 10 * max_count matches = discard_worst_match(matches) end matches[reference] = 1 end end end end return [] if matches.empty? # Sort in the order of occurance count downwards. match_list = matches.to_a.sort do |a, b| b[1] <=> a[1] end # Set occurance counters to scores relative to the best possible score. match_list.map! { |a, b| [ a, b.to_f / best_possible_score ] } # Delete all matches that occured less than half as often than the # top match. match_list.delete_if { |a| a[1] < min_score } match_list[0..max_count] end |
#clear ⇒ Object
Wipe the dictionary.
61 62 63 |
# File 'lib/perobs/FuzzyStringMatcher.rb', line 61 def clear @store[@dict_name] = @dict = @store.new(BigHash) end |
#learn(string, reference = string) ⇒ Object
Add a string with its reference to the dictionary.
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
# File 'lib/perobs/FuzzyStringMatcher.rb', line 68 def learn(string, reference = string) reference = string if reference.nil? unless @case_sensitive string = string.downcase end # Enclose string in 'start of text' and 'end of text' ASCII values. string = "\002" + string + "\003" each_n_gramm(string) do |n_gramm| unless (ng_list = @dict[n_gramm]) @dict[n_gramm] = ng_list = @store.new(Hash) end if ng_list.include?(reference) ng_list[reference] += 1 else ng_list[reference] = 0 end end nil end |
#stats ⇒ Object
Returns some internal stats about the dictionary.
152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
# File 'lib/perobs/FuzzyStringMatcher.rb', line 152 def stats s = {} s['dictionary_size'] = @dict.size max = total = 0 @dict.each do |n_gramm, ng_list| size = ng_list.length max = size if size > max total += size end s['max_list_size'] = max s['avg_list_size'] = total > 0 ? total.to_f / s['dictionary_size'] : 0 s end |