Module: Ariel::LabelUtils
- Defined in:
- lib/ariel/label_utils.rb
Overview
A set of methods for use when dealing with strings from labeled documents
Constant Summary collapse
- S_LABEL =
"<"
- E_LABEL =
">"
Class Method Summary collapse
-
.any_label_regex ⇒ Object
Helper function that returns a regex that will return any open or closing label tags.
-
.clean_string(string) ⇒ Object
Removes all labels such as <l:title> from the given string and returns the result.
-
.extract_labeled_region(structure, parent_extracted_node) ⇒ Object
Extracts the labeled region representing the given structure node from the parent_extracted_node.
-
.label_regex(tag_contents = '\w+', namespace = 'l') ⇒ Object
Returns an array containing a pair of regular expressions to match a start label tag and an end label tag.
Class Method Details
.any_label_regex ⇒ Object
Helper function that returns a regex that will return any open or closing label tags.
20 21 22 |
# File 'lib/ariel/label_utils.rb', line 20 def self.any_label_regex() Regexp.union(*self.label_regex) end |
.clean_string(string) ⇒ Object
Removes all labels such as <l:title> from the given string and returns the result.
26 27 28 |
# File 'lib/ariel/label_utils.rb', line 26 def self.clean_string(string) string.gsub self.any_label_regex, '' end |
.extract_labeled_region(structure, parent_extracted_node) ⇒ Object
Extracts the labeled region representing the given structure node from the parent_extracted_node. A new ExtractedNode is returned to be added as a child to the parent_extracted_node. Used when loading labeled documents.
33 34 35 36 37 38 39 40 41 42 |
# File 'lib/ariel/label_utils.rb', line 33 def self.extract_labeled_region(structure, parent_extracted_node) tokenstream=parent_extracted_node.tokenstream start_idx=self.skip_to_label_tag(tokenstream, structure..name, :open) end_idx=self.skip_to_label_tag(tokenstream.reverse, structure..name, :closed) end_idx=tokenstream.reverse_pos end_idx newstream=tokenstream.slice_by_token_index(start_idx, end_idx) child_node=ExtractedNode.new(structure..name, newstream, structure) parent_extracted_node.add_child child_node return child_node end |
.label_regex(tag_contents = '\w+', namespace = 'l') ⇒ Object
Returns an array containing a pair of regular expressions to match a start label tag and an end label tag. If the tag_contents is not modified the regular expressions will return any properly formatted label tag. The namespace to search for can also be modified. The returned regular expressions are case insensitive.
13 14 15 16 |
# File 'lib/ariel/label_utils.rb', line 13 def self.label_regex(tag_contents='\w+', namespace='l') [/#{S_LABEL}#{namespace}:#{tag_contents}#{E_LABEL}/i, /#{S_LABEL}\/#{namespace}:#{tag_contents}#{E_LABEL}/i] end |