Class: Ariel::TokenStream

Inherits:

Object

Object
Ariel::TokenStream

show all

Includes:: Enumerable

Defined in:: lib/ariel/token_stream.rb

Overview

A TokenStream instance stores a stream of Tokens once it has used its tokenization rules to extract them from a string. A TokenStream knows its current position (TokenStream#cur_pos), which is incremented when any of the Enumerable methods are used (due to the redefinition of TokenStream#each). As you advance through the stream, the current token is always returned and then consumed. A TokenStream also provides methods for finding patterns in a given stream much like StringScanner but for an array of tokens. For rule generation, a certain token can be marked as being the start point of a label. Finally, a TokenStream will record whether it is in a reversed or unreversed state so that when rules are applied, they are always applied from the front or end of the stream as required, whether it is reversed or not.

Constant Summary collapse

TOKEN_REGEXEN =

[
Wildcards.list[:html_tag], # Match html tags that don't have attributes
/\d+/, # Match any numbers, probably good to make a split
/\b\w+\b/, # Pick up words, will split at punctuation
/\S/ # Grab any characters left over that aren't whitespace
]

LABEL_TAG_REGEXEN =

[LabelUtils.any_label_regex]

Instance Attribute Summary collapse

#cur_pos ⇒ Object

Returns the value of attribute cur_pos.
#label_index ⇒ Object

Returns the value of attribute label_index.
#original_text ⇒ Object

Returns the value of attribute original_text.
#tokens ⇒ Object

Returns the value of attribute tokens.

Instance Method Summary collapse

#advance ⇒ Object

Returns the current Token and consumes it.
#cache_hash ⇒ Object

Note, token.cache_hash!=token.reverse.reverse.cache_hash.
#contains_label_tags? ⇒ Boolean
#current_token ⇒ Object

Returns the current Token.
#deep_clone ⇒ Object

Used to ensure operations such as @tokens.reverse! in one instance won’t inadvertently effect another.
#each ⇒ Object

Iterates over and consumes every Token from the cur_pos.
#initialize ⇒ TokenStream constructor

A new instance of TokenStream.
#raw_text(l_index = 0, r_index = -1)) ⇒ Object

Returns all text represented by the instance’s stored tokens.
#remove_label_tags ⇒ Object

Goes through all stored Token instances, removing them if Token#is_label_tag? Called after a labeled document has been extracted to a tree ready for the rule learning process.
#reverse ⇒ Object

Returns a copy of the current instance with a reversed set of tokens.
#reverse! ⇒ Object

Same as LabeledStream#reverse, but changes are made in place.
#reverse_pos(pos) ⇒ Object

Converts the given position so it points to the same token once the stream is reversed.
#reversed? ⇒ Boolean

Returns true or false depending on whether the given tokenstream is in a reversed state.
#rewind ⇒ Object

Return to the beginning of the TokenStream.
#set_label_at(pos) ⇒ Object

Set a label at a given offset in the original text.
#size ⇒ Object

Returns the number of tokens in the TokenStream.
#skip_to(*features) ⇒ Object

Takes a list of Strings and Symbols as its arguments representing text to be matched in individual tokens and Wildcards.
#slice_by_string_pos(left, right) ⇒ Object

Returns the slice of the current instance containing all the tokens between the token where the start_loc == the left parameter and the token where the end_loc == the right parameter.
#slice_by_token_index(l_index, r_index) ⇒ Object

Slices tokens between the l_index and the r_index inclusive.
#text(l_index = 0, r_index = -1)) ⇒ Object

Returns all text represented by the instance’s stored tokens, stripping any label tags if the stream was declared to be containing them when it was initialized (this would only happen during the process of loading labeled examples).
#tokenize(input, contains_label_tags = false) ⇒ Object

The tokenizer operates on a string by splitting it at every point it finds a match to a regular expression.

Constructor Details

#initialize ⇒ `TokenStream`

Returns a new instance of TokenStream.

# File 'lib/ariel/token_stream.rb', line 28

def initialize()
  @tokens=[]
  @cur_pos=0
  @original_text = ""
  @reversed=false
  @contains_label_tags=false
end

Instance Attribute Details

#cur_pos ⇒ `Object`

Returns the value of attribute cur_pos.



18
19
20

# File 'lib/ariel/token_stream.rb', line 18

def cur_pos
  @cur_pos
end

#label_index ⇒ `Object`

Returns the value of attribute label_index.



18
19
20

# File 'lib/ariel/token_stream.rb', line 18

def label_index
  @label_index
end

#original_text ⇒ `Object`

Returns the value of attribute original_text.



18
19
20

# File 'lib/ariel/token_stream.rb', line 18

def original_text
  @original_text
end

#tokens ⇒ `Object`

Returns the value of attribute tokens.



18
19
20

# File 'lib/ariel/token_stream.rb', line 18

def tokens
  @tokens
end

Instance Method Details

#advance ⇒ `Object`

Returns the current Token and consumes it.

# File 'lib/ariel/token_stream.rb', line 145

def advance
  return nil if @cur_pos > @tokens.size
  while true
    @cur_pos+=1
    current_token = @tokens[@cur_pos-1]
    return nil if current_token.nil?
    return current_token
  end
end

#cache_hash ⇒ `Object`

Note, token.cache_hash!=token.reverse.reverse.cache_hash.



61
62
63

# File 'lib/ariel/token_stream.rb', line 61

def cache_hash
  [@tokens, @reversed].hash
end

#contains_label_tags? ⇒ `Boolean`

Returns:

(Boolean)



65
66
67

# File 'lib/ariel/token_stream.rb', line 65

def contains_label_tags?
  @contains_label_tags
end

#current_token ⇒ `Object`

Returns the current Token.



219
220
221

# File 'lib/ariel/token_stream.rb', line 219

def current_token
  @tokens[@cur_pos]
end

#deep_clone ⇒ `Object`

Used to ensure operations such as @tokens.reverse! in one instance won’t inadvertently effect another.



100
101
102

# File 'lib/ariel/token_stream.rb', line 100

def deep_clone
  Marshal::load(Marshal.dump(self))
end

#each ⇒ `Object`

Iterates over and consumes every Token from the cur_pos.

# File 'lib/ariel/token_stream.rb', line 212

def each
  while (token = self.advance)
    yield token
  end
end

#raw_text(l_index = 0, r_index = -1)) ⇒ `Object`

Returns all text represented by the instance’s stored tokens. It will not strip label tags even if the stream is marked to contain them. However, you should not expect to get the raw_text once any label_tags have been filtered (TokenStream#remove_label_tags).

# File 'lib/ariel/token_stream.rb', line 136

def raw_text(l_index=0, r_index=-1)
  return "" if @tokens.size==0
  if reversed?
    l_index, r_index = r_index, l_index
  end
  @original_text[@tokens[l_index].start_loc...@tokens[r_index].end_loc]
end

#remove_label_tags ⇒ `Object`

Goes through all stored Token instances, removing them if Token#is_label_tag? Called after a labeled document has been extracted to a tree ready for the rule learning process.



72
73
74

# File 'lib/ariel/token_stream.rb', line 72

def remove_label_tags
  @tokens.delete_if {|token| token.is_label_tag?}
end

#reverse ⇒ `Object`

Returns a copy of the current instance with a reversed set of tokens. If it is set, the label_index is adjusted accordingly to point to the correct token.



164
165
166

# File 'lib/ariel/token_stream.rb', line 164

def reverse
  self.deep_clone.reverse!
end

#reverse! ⇒ `Object`

Same as LabeledStream#reverse, but changes are made in place.

# File 'lib/ariel/token_stream.rb', line 175

def reverse!
  @tokens.reverse!
  if label_index
    @label_index = reverse_pos(@label_index)
  end
  @reversed=!@reversed
  return self
end

#reverse_pos(pos) ⇒ `Object`

Converts the given position so it points to the same token once the stream is reversed. Result invalid for when @tokens.size==0



170
171
172

# File 'lib/ariel/token_stream.rb', line 170

def reverse_pos(pos)
  @tokens.size-(pos + 1)
end

#reversed? ⇒ `Boolean`

Returns true or false depending on whether the given tokenstream is in a reversed state

Returns:

(Boolean)



186
187
188

# File 'lib/ariel/token_stream.rb', line 186

def reversed?
  @reversed
end

#rewind ⇒ `Object`

Return to the beginning of the TokenStream. Returns self.

# File 'lib/ariel/token_stream.rb', line 156

def rewind
  @cur_pos=0
  self
end

#set_label_at(pos) ⇒ `Object`

Set a label at a given offset in the original text. Searches for a token with a start_loc equal to the position passed as an argument, and raises an error if one is not found.

# File 'lib/ariel/token_stream.rb', line 107

def set_label_at(pos)
  token_pos=nil
  @tokens.each_index {|i| token_pos = i if @tokens[i].start_loc == pos}
  if token_pos.nil?
    raise ArgumentError, "Given string position does not match the start of any token"
  else
    @label_index = token_pos
    Log.debug "Token ##{label_index} - \"#{@tokens[label_index].text}\" labeled."
    return @label_index
  end
end

#size ⇒ `Object`

Returns the number of tokens in the TokenStream



191
192
193

# File 'lib/ariel/token_stream.rb', line 191

def size
  @tokens.size
end

#skip_to(*features) ⇒ `Object`

Takes a list of Strings and Symbols as its arguments representing text to be matched in individual tokens and Wildcards. For a match to be a success, all wildcards and strings must match a consecutive sequence of Tokens in the TokenStream. All matched Tokens are consumed, and the TokenStream’s current position is returned on success. On failure, the TokenStream is returned to its original state and returns nil.

# File 'lib/ariel/token_stream.rb', line 201

def skip_to(*features)
  original_pos=@cur_pos
  self.each_cons(features.size) do |tokens|
    i=0
    return @cur_pos if tokens.all? {|token| i+=1; token.matches?(features[i-1])}
  end
  @cur_pos=original_pos #No match, return TokenStream to original state
  return nil 
end

#slice_by_string_pos(left, right) ⇒ `Object`

Returns the slice of the current instance containing all the tokens between the token where the start_loc == the left parameter and the token where the end_loc == the right parameter.

# File 'lib/ariel/token_stream.rb', line 79

def slice_by_string_pos(left, right)
  l_index=nil
  r_index=nil
  @tokens.each_index {|i| l_index = i if @tokens[i].start_loc == left}
  @tokens.each_index {|i| r_index = i if @tokens[i].end_loc == right}
  if l_index.nil? or r_index.nil?
    raise ArgumentError, "Cannot slice between those locations"
  else
    return slice_by_token_index(l_index, r_index)
  end
end

#slice_by_token_index(l_index, r_index) ⇒ `Object`

Slices tokens between the l_index and the r_index inclusive.

# File 'lib/ariel/token_stream.rb', line 92

def slice_by_token_index(l_index, r_index)
  sliced = self.dup
  sliced.tokens=@tokens[l_index..r_index]
  return sliced
end

#text(l_index = 0, r_index = -1)) ⇒ `Object`

Returns all text represented by the instance’s stored tokens, stripping any label tags if the stream was declared to be containing them when it was initialized (this would only happen during the process of loading labeled examples). See also TokenStream#raw_text

# File 'lib/ariel/token_stream.rb', line 123

def text(l_index=0, r_index=-1)
  out=raw_text(l_index, r_index)
  if contains_label_tags?
    LabelUtils.clean_string(out)
  else
    out
  end
end

#tokenize(input, contains_label_tags = false) ⇒ `Object`

The tokenizer operates on a string by splitting it at every point it finds a match to a regular expression. Each match is added as a token, and the strings between each match are stored along with their original offsets. The same is then done with the next regular expression on each of these split strings, and new tokens are created with the correct offset in the original text. Any characters left unmatched by any of the regular expressions in TokenStream::TOKEN_REGEXEN are discarded. This approach allows a hierarchy of regular expressions to work simply and easily. A simple regular expression to match html tags might operate first, and then later expressions that pick up runs of word characters can operate on what’s left. If contains_labels is set to true when calling tokenize, the tokenizer will first remove and discard any occurences of label_tags (as defined by the Regex set in LabelUtils) before matching and adding tokens. Any label_tag tokens will be marked as such upon creation.

# File 'lib/ariel/token_stream.rb', line 50

def tokenize(input, contains_label_tags=false)
  string_array=[[input, 0]]
  @original_text = input
  @contains_label_tags=contains_label_tags
  LABEL_TAG_REGEXEN.each {|regex| split_string_array_by_regex(string_array, regex, false)} if contains_label_tags
  TOKEN_REGEXEN.each {|regex| split_string_array_by_regex(string_array, regex)}
  @tokens.sort!
  @tokens.size
end

Class: Ariel::TokenStream

Overview

Constant Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize ⇒ TokenStream

Instance Attribute Details

#cur_pos ⇒ Object

#label_index ⇒ Object

#original_text ⇒ Object

#tokens ⇒ Object

Instance Method Details

#advance ⇒ Object

#cache_hash ⇒ Object

#contains_label_tags? ⇒ Boolean

#current_token ⇒ Object

#deep_clone ⇒ Object

#each ⇒ Object

#raw_text(l_index = 0, r_index = -1)) ⇒ Object

#remove_label_tags ⇒ Object

#reverse ⇒ Object

#reverse! ⇒ Object

#reverse_pos(pos) ⇒ Object

#reversed? ⇒ Boolean

#rewind ⇒ Object

#set_label_at(pos) ⇒ Object

#size ⇒ Object

#skip_to(*features) ⇒ Object

#slice_by_string_pos(left, right) ⇒ Object

#slice_by_token_index(l_index, r_index) ⇒ Object

#text(l_index = 0, r_index = -1)) ⇒ Object

#tokenize(input, contains_label_tags = false) ⇒ Object