Class: Ariel::TokenStream
- Inherits:
-
Object
- Object
- Ariel::TokenStream
- Includes:
- Enumerable
- Defined in:
- lib/ariel/token_stream.rb
Overview
A TokenStream instance stores a stream of Tokens once it has used its tokenization rules to extract them from a string. A TokenStream knows its current position (TokenStream#cur_pos), which is incremented when any of the Enumerable methods are used (due to the redefinition of TokenStream#each). As you advance through the stream, the current token is always returned and then consumed. A TokenStream also provides methods for finding patterns in a given stream much like StringScanner but for an array of tokens. For rule generation, a certain token can be marked as being the start point of a label. Finally, a TokenStream will record whether it is in a reversed or unreversed state so that when rules are applied, they are always applied from the front or end of the stream as required, whether it is reversed or not.
Constant Summary collapse
- TOKEN_REGEXEN =
[ Wildcards.list[:html_tag], # Match html tags that don't have attributes /\d+/, # Match any numbers, probably good to make a split /\b\w+\b/, # Pick up words, will split at punctuation /\S/ # Grab any characters left over that aren't whitespace ]
- LABEL_TAG_REGEXEN =
[LabelUtils.any_label_regex]
Instance Attribute Summary collapse
-
#cur_pos ⇒ Object
Returns the value of attribute cur_pos.
-
#label_index ⇒ Object
Returns the value of attribute label_index.
-
#original_text ⇒ Object
Returns the value of attribute original_text.
-
#tokens ⇒ Object
Returns the value of attribute tokens.
Instance Method Summary collapse
-
#advance ⇒ Object
Returns the current Token and consumes it.
-
#cache_hash ⇒ Object
Note, token.cache_hash!=token.reverse.reverse.cache_hash.
- #contains_label_tags? ⇒ Boolean
-
#current_token ⇒ Object
Returns the current Token.
-
#deep_clone ⇒ Object
Used to ensure operations such as @tokens.reverse! in one instance won’t inadvertently effect another.
-
#each ⇒ Object
Iterates over and consumes every Token from the cur_pos.
-
#initialize ⇒ TokenStream
constructor
A new instance of TokenStream.
-
#raw_text(l_index = 0, r_index = -1)) ⇒ Object
Returns all text represented by the instance’s stored tokens.
-
#remove_label_tags ⇒ Object
Goes through all stored Token instances, removing them if Token#is_label_tag? Called after a labeled document has been extracted to a tree ready for the rule learning process.
-
#reverse ⇒ Object
Returns a copy of the current instance with a reversed set of tokens.
-
#reverse! ⇒ Object
Same as LabeledStream#reverse, but changes are made in place.
-
#reverse_pos(pos) ⇒ Object
Converts the given position so it points to the same token once the stream is reversed.
-
#reversed? ⇒ Boolean
Returns true or false depending on whether the given tokenstream is in a reversed state.
-
#rewind ⇒ Object
Return to the beginning of the TokenStream.
-
#set_label_at(pos) ⇒ Object
Set a label at a given offset in the original text.
-
#size ⇒ Object
Returns the number of tokens in the TokenStream.
-
#skip_to(*features) ⇒ Object
Takes a list of Strings and Symbols as its arguments representing text to be matched in individual tokens and Wildcards.
-
#slice_by_string_pos(left, right) ⇒ Object
Returns the slice of the current instance containing all the tokens between the token where the start_loc == the left parameter and the token where the end_loc == the right parameter.
-
#slice_by_token_index(l_index, r_index) ⇒ Object
Slices tokens between the l_index and the r_index inclusive.
-
#text(l_index = 0, r_index = -1)) ⇒ Object
Returns all text represented by the instance’s stored tokens, stripping any label tags if the stream was declared to be containing them when it was initialized (this would only happen during the process of loading labeled examples).
-
#tokenize(input, contains_label_tags = false) ⇒ Object
The tokenizer operates on a string by splitting it at every point it finds a match to a regular expression.
Constructor Details
#initialize ⇒ TokenStream
Returns a new instance of TokenStream.
28 29 30 31 32 33 34 |
# File 'lib/ariel/token_stream.rb', line 28 def initialize() @tokens=[] @cur_pos=0 @original_text = "" @reversed=false =false end |
Instance Attribute Details
#cur_pos ⇒ Object
Returns the value of attribute cur_pos.
18 19 20 |
# File 'lib/ariel/token_stream.rb', line 18 def cur_pos @cur_pos end |
#label_index ⇒ Object
Returns the value of attribute label_index.
18 19 20 |
# File 'lib/ariel/token_stream.rb', line 18 def label_index @label_index end |
#original_text ⇒ Object
Returns the value of attribute original_text.
18 19 20 |
# File 'lib/ariel/token_stream.rb', line 18 def original_text @original_text end |
#tokens ⇒ Object
Returns the value of attribute tokens.
18 19 20 |
# File 'lib/ariel/token_stream.rb', line 18 def tokens @tokens end |
Instance Method Details
#advance ⇒ Object
Returns the current Token and consumes it.
145 146 147 148 149 150 151 152 153 |
# File 'lib/ariel/token_stream.rb', line 145 def advance return nil if @cur_pos > @tokens.size while true @cur_pos+=1 current_token = @tokens[@cur_pos-1] return nil if current_token.nil? return current_token end end |
#cache_hash ⇒ Object
Note, token.cache_hash!=token.reverse.reverse.cache_hash.
61 62 63 |
# File 'lib/ariel/token_stream.rb', line 61 def cache_hash [@tokens, @reversed].hash end |
#contains_label_tags? ⇒ Boolean
65 66 67 |
# File 'lib/ariel/token_stream.rb', line 65 def end |
#current_token ⇒ Object
Returns the current Token.
219 220 221 |
# File 'lib/ariel/token_stream.rb', line 219 def current_token @tokens[@cur_pos] end |
#deep_clone ⇒ Object
Used to ensure operations such as @tokens.reverse! in one instance won’t inadvertently effect another.
100 101 102 |
# File 'lib/ariel/token_stream.rb', line 100 def deep_clone Marshal::load(Marshal.dump(self)) end |
#each ⇒ Object
Iterates over and consumes every Token from the cur_pos.
212 213 214 215 216 |
# File 'lib/ariel/token_stream.rb', line 212 def each while (token = self.advance) yield token end end |
#raw_text(l_index = 0, r_index = -1)) ⇒ Object
Returns all text represented by the instance’s stored tokens. It will not strip label tags even if the stream is marked to contain them. However, you should not expect to get the raw_text once any label_tags have been filtered (TokenStream#remove_label_tags).
136 137 138 139 140 141 142 |
# File 'lib/ariel/token_stream.rb', line 136 def raw_text(l_index=0, r_index=-1) return "" if @tokens.size==0 if reversed? l_index, r_index = r_index, l_index end @original_text[@tokens[l_index].start_loc...@tokens[r_index].end_loc] end |
#remove_label_tags ⇒ Object
Goes through all stored Token instances, removing them if Token#is_label_tag? Called after a labeled document has been extracted to a tree ready for the rule learning process.
72 73 74 |
# File 'lib/ariel/token_stream.rb', line 72 def @tokens.delete_if {|token| token.is_label_tag?} end |
#reverse ⇒ Object
Returns a copy of the current instance with a reversed set of tokens. If it is set, the label_index is adjusted accordingly to point to the correct token.
164 165 166 |
# File 'lib/ariel/token_stream.rb', line 164 def reverse self.deep_clone.reverse! end |
#reverse! ⇒ Object
Same as LabeledStream#reverse, but changes are made in place.
175 176 177 178 179 180 181 182 |
# File 'lib/ariel/token_stream.rb', line 175 def reverse! @tokens.reverse! if label_index @label_index = reverse_pos(@label_index) end @reversed=!@reversed return self end |
#reverse_pos(pos) ⇒ Object
Converts the given position so it points to the same token once the stream is reversed. Result invalid for when @tokens.size==0
170 171 172 |
# File 'lib/ariel/token_stream.rb', line 170 def reverse_pos(pos) @tokens.size-(pos + 1) end |
#reversed? ⇒ Boolean
Returns true or false depending on whether the given tokenstream is in a reversed state
186 187 188 |
# File 'lib/ariel/token_stream.rb', line 186 def reversed? @reversed end |
#rewind ⇒ Object
Return to the beginning of the TokenStream. Returns self.
156 157 158 159 |
# File 'lib/ariel/token_stream.rb', line 156 def rewind @cur_pos=0 self end |
#set_label_at(pos) ⇒ Object
Set a label at a given offset in the original text. Searches for a token with a start_loc equal to the position passed as an argument, and raises an error if one is not found.
107 108 109 110 111 112 113 114 115 116 117 |
# File 'lib/ariel/token_stream.rb', line 107 def set_label_at(pos) token_pos=nil @tokens.each_index {|i| token_pos = i if @tokens[i].start_loc == pos} if token_pos.nil? raise ArgumentError, "Given string position does not match the start of any token" else @label_index = token_pos Log.debug "Token ##{label_index} - \"#{@tokens[label_index].text}\" labeled." return @label_index end end |
#size ⇒ Object
Returns the number of tokens in the TokenStream
191 192 193 |
# File 'lib/ariel/token_stream.rb', line 191 def size @tokens.size end |
#skip_to(*features) ⇒ Object
Takes a list of Strings and Symbols as its arguments representing text to be matched in individual tokens and Wildcards. For a match to be a success, all wildcards and strings must match a consecutive sequence of Tokens in the TokenStream. All matched Tokens are consumed, and the TokenStream’s current position is returned on success. On failure, the TokenStream is returned to its original state and returns nil.
201 202 203 204 205 206 207 208 209 |
# File 'lib/ariel/token_stream.rb', line 201 def skip_to(*features) original_pos=@cur_pos self.each_cons(features.size) do |tokens| i=0 return @cur_pos if tokens.all? {|token| i+=1; token.matches?(features[i-1])} end @cur_pos=original_pos #No match, return TokenStream to original state return nil end |
#slice_by_string_pos(left, right) ⇒ Object
Returns the slice of the current instance containing all the tokens between the token where the start_loc == the left parameter and the token where the end_loc == the right parameter.
79 80 81 82 83 84 85 86 87 88 89 |
# File 'lib/ariel/token_stream.rb', line 79 def slice_by_string_pos(left, right) l_index=nil r_index=nil @tokens.each_index {|i| l_index = i if @tokens[i].start_loc == left} @tokens.each_index {|i| r_index = i if @tokens[i].end_loc == right} if l_index.nil? or r_index.nil? raise ArgumentError, "Cannot slice between those locations" else return slice_by_token_index(l_index, r_index) end end |
#slice_by_token_index(l_index, r_index) ⇒ Object
Slices tokens between the l_index and the r_index inclusive.
92 93 94 95 96 |
# File 'lib/ariel/token_stream.rb', line 92 def slice_by_token_index(l_index, r_index) sliced = self.dup sliced.tokens=@tokens[l_index..r_index] return sliced end |
#text(l_index = 0, r_index = -1)) ⇒ Object
Returns all text represented by the instance’s stored tokens, stripping any label tags if the stream was declared to be containing them when it was initialized (this would only happen during the process of loading labeled examples). See also TokenStream#raw_text
123 124 125 126 127 128 129 130 |
# File 'lib/ariel/token_stream.rb', line 123 def text(l_index=0, r_index=-1) out=raw_text(l_index, r_index) if LabelUtils.clean_string(out) else out end end |
#tokenize(input, contains_label_tags = false) ⇒ Object
The tokenizer operates on a string by splitting it at every point it finds a match to a regular expression. Each match is added as a token, and the strings between each match are stored along with their original offsets. The same is then done with the next regular expression on each of these split strings, and new tokens are created with the correct offset in the original text. Any characters left unmatched by any of the regular expressions in TokenStream::TOKEN_REGEXEN are discarded. This approach allows a hierarchy of regular expressions to work simply and easily. A simple regular expression to match html tags might operate first, and then later expressions that pick up runs of word characters can operate on what’s left. If contains_labels is set to true when calling tokenize, the tokenizer will first remove and discard any occurences of label_tags (as defined by the Regex set in LabelUtils) before matching and adding tokens. Any label_tag tokens will be marked as such upon creation.
50 51 52 53 54 55 56 57 58 |
# File 'lib/ariel/token_stream.rb', line 50 def tokenize(input, =false) string_array=[[input, 0]] @original_text = input = LABEL_TAG_REGEXEN.each {|regex| split_string_array_by_regex(string_array, regex, false)} if TOKEN_REGEXEN.each {|regex| split_string_array_by_regex(string_array, regex)} @tokens.sort! @tokens.size end |