Module: Ariel::LabelUtils

Defined in:
lib/ariel/label_utils.rb

Overview

A set of methods for use when dealing with strings from labeled documents

Constant Summary collapse

S_LABEL =
"<"
E_LABEL =
">"

Class Method Summary collapse

Class Method Details

.any_label_regexObject

Helper function that returns a regex that will return any open or closing label tags.



20
21
22
# File 'lib/ariel/label_utils.rb', line 20

def self.any_label_regex()
  Regexp.union(*self.label_regex)
end

.clean_string(string) ⇒ Object

Removes all labels such as <l:title> from the given string and returns the result.



26
27
28
# File 'lib/ariel/label_utils.rb', line 26

def self.clean_string(string)
  string.gsub self.any_label_regex, ''
end

.extract_labeled_region(structure, parent_extracted_node) ⇒ Object

Extracts the labeled region representing the given structure node from the parent_extracted_node. A new ExtractedNode is returned to be added as a child to the parent_extracted_node. Used when loading labeled documents.



33
34
35
36
37
38
39
40
41
42
# File 'lib/ariel/label_utils.rb', line 33

def self.extract_labeled_region(structure, parent_extracted_node)
  tokenstream=parent_extracted_node.tokenstream
  start_idx=self.skip_to_label_tag(tokenstream, structure.meta.name, :open)
  end_idx=self.skip_to_label_tag(tokenstream.reverse, structure.meta.name, :closed)
  end_idx=tokenstream.reverse_pos end_idx
  newstream=tokenstream.slice_by_token_index(start_idx, end_idx)
  child_node=ExtractedNode.new(structure.meta.name, newstream, structure)
  parent_extracted_node.add_child child_node
  return child_node
end

.label_regex(tag_contents = '\w+', namespace = 'l') ⇒ Object

Returns an array containing a pair of regular expressions to match a start label tag and an end label tag. If the tag_contents is not modified the regular expressions will return any properly formatted label tag. The namespace to search for can also be modified. The returned regular expressions are case insensitive.



13
14
15
16
# File 'lib/ariel/label_utils.rb', line 13

def self.label_regex(tag_contents='\w+', namespace='l')
  [/#{S_LABEL}#{namespace}:#{tag_contents}#{E_LABEL}/i,
  /#{S_LABEL}\/#{namespace}:#{tag_contents}#{E_LABEL}/i]
end