Class: Spacy::Doc
Overview
See also spaCy Python API document for [‘Doc`](spacy.io/api/doc).
Instance Attribute Summary collapse
-
#py_doc ⇒ Object
readonly
A Python ‘Doc` instance accessible via `PyCall`.
-
#py_nlp ⇒ Object
readonly
A Python ‘Language` instance accessible via `PyCall`.
-
#text ⇒ String
readonly
A text string of the document.
Instance Method Summary collapse
-
#[](range) ⇒ Object
Returns a span if given a range object; or returns a token if given an integer representing a position in the doc.
-
#displacy(style: "dep", compact: false) ⇒ String
Visualize the document in one of two styles: “dep” (dependencies) or “ent” (named entities).
-
#each ⇒ Object
Iterates over the elements in the doc yielding a token instance each time.
-
#ents ⇒ Array<Span>
Returns an array of spans each representing a named entity.
-
#initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL, retrial: 0) ⇒ Doc
constructor
It is recommended to use Language#read method to create a doc.
-
#method_missing(name, *args) ⇒ Object
Methods defined in Python but not wrapped in ruby-spacy can be called by this dynamic method handling mechanism.
-
#noun_chunks ⇒ Array<Span>
Returns an array of spans representing noun chunks.
- #openai_completion(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-4o-mini") ⇒ Object
- #openai_embeddings(access_token: nil, model: "text-embedding-ada-002") ⇒ Object
- #openai_query(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-4o-mini", messages: [], prompt: nil) ⇒ Object
- #respond_to_missing?(sym) ⇒ Boolean
-
#retokenize(start_index, end_index, attributes = {}) ⇒ Object
Retokenizes the text merging a span into a single token.
-
#retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) ⇒ Object
Retokenizes the text splitting the specified token.
-
#sents ⇒ Array<Span>
Returns an array of spans each representing a sentence.
-
#similarity(other) ⇒ Float
Returns a semantic similarity estimate.
-
#span(range_or_start, optional_size = nil) ⇒ Span
Returns a span of the specified range within the doc.
-
#to_s ⇒ String
String representation of the document.
-
#tokens ⇒ Array<Token>
Returns an array of tokens contained in the doc.
Constructor Details
#initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL, retrial: 0) ⇒ Doc
It is recommended to use Language#read method to create a doc. If you need to create one using #initialize, there are two method signatures: ‘Spacy::Doc.new(nlp_id, py_doc: Object)` and `Spacy::Doc.new(nlp_id, text: String)`.
87 88 89 90 91 92 93 94 95 96 97 98 |
# File 'lib/ruby-spacy.rb', line 87 def initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL, retrial: 0) @py_nlp = nlp @py_doc = py_doc || @py_doc = nlp.call(text) @text = @py_doc.text rescue StandardError retrial += 1 raise "Error: Failed to construct a Doc object" unless retrial <= max_retrial sleep 0.5 initialize(nlp, py_doc: py_doc, text: text, max_retrial: max_retrial, retrial: retrial) end |
Dynamic Method Handling
This class handles dynamic methods through the method_missing method
#method_missing(name, *args) ⇒ Object
Methods defined in Python but not wrapped in ruby-spacy can be called by this dynamic method handling mechanism.
350 351 352 |
# File 'lib/ruby-spacy.rb', line 350 def method_missing(name, *args) @py_doc.send(name, *args) end |
Instance Attribute Details
#py_doc ⇒ Object (readonly)
Returns a Python ‘Doc` instance accessible via `PyCall`.
70 71 72 |
# File 'lib/ruby-spacy.rb', line 70 def py_doc @py_doc end |
#py_nlp ⇒ Object (readonly)
Returns a Python ‘Language` instance accessible via `PyCall`.
67 68 69 |
# File 'lib/ruby-spacy.rb', line 67 def py_nlp @py_nlp end |
#text ⇒ String (readonly)
Returns a text string of the document.
73 74 75 |
# File 'lib/ruby-spacy.rb', line 73 def text @text end |
Instance Method Details
#[](range) ⇒ Object
Returns a span if given a range object; or returns a token if given an integer representing a position in the doc.
203 204 205 206 207 208 209 210 |
# File 'lib/ruby-spacy.rb', line 203 def [](range) if range.is_a?(Range) py_span = @py_doc[range] Span.new(self, start_index: py_span.start, end_index: py_span.end - 1) else Token.new(@py_doc[range]) end end |
#displacy(style: "dep", compact: false) ⇒ String
Visualize the document in one of two styles: “dep” (dependencies) or “ent” (named entities).
223 224 225 |
# File 'lib/ruby-spacy.rb', line 223 def displacy(style: "dep", compact: false) PyDisplacy.render(py_doc, style: style, options: { compact: compact }, jupyter: false) end |
#each ⇒ Object
Iterates over the elements in the doc yielding a token instance each time.
139 140 141 142 143 |
# File 'lib/ruby-spacy.rb', line 139 def each PyCall::List.call(@py_doc).each do |py_token| yield Token.new(py_token) end end |
#ents ⇒ Array<Span>
Returns an array of spans each representing a named entity.
189 190 191 192 193 194 195 196 197 198 199 |
# File 'lib/ruby-spacy.rb', line 189 def ents # so that ents canbe "each"-ed in Ruby ent_array = [] PyCall::List.call(@py_doc.ents).each do |ent| ent.define_singleton_method :label do label_ end ent_array << ent end ent_array end |
#noun_chunks ⇒ Array<Span>
Returns an array of spans representing noun chunks.
167 168 169 170 171 172 173 174 |
# File 'lib/ruby-spacy.rb', line 167 def noun_chunks chunk_array = [] py_chunks = PyCall::List.call(@py_doc.noun_chunks) py_chunks.each do |py_chunk| chunk_array << Span.new(self, start_index: py_chunk.start, end_index: py_chunk.end - 1) end chunk_array end |
#openai_completion(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-4o-mini") ⇒ Object
305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 |
# File 'lib/ruby-spacy.rb', line 305 def openai_completion(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-4o-mini") = [ { role: "system", content: "Complete the text input by the user." }, { role: "user", content: @text } ] access_token ||= ENV["OPENAI_API_KEY"] raise "Error: OPENAI_API_KEY is not set" unless access_token begin response = Spacy.openai_client(access_token: access_token).chat( parameters: { model: model, messages: , max_tokens: max_tokens, temperature: temperature } ) response.dig("choices", 0, "message", "content") rescue StandardError => e puts "Error: OpenAI API call failed." pp e. pp e.backtrace end end |
#openai_embeddings(access_token: nil, model: "text-embedding-ada-002") ⇒ Object
330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 |
# File 'lib/ruby-spacy.rb', line 330 def (access_token: nil, model: "text-embedding-ada-002") access_token ||= ENV["OPENAI_API_KEY"] raise "Error: OPENAI_API_KEY is not set" unless access_token begin response = Spacy.openai_client(access_token: access_token).( parameters: { model: model, input: @text } ) response.dig("data", 0, "embedding") rescue StandardError => e puts "Error: OpenAI API call failed." pp e. pp e.backtrace end end |
#openai_query(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-4o-mini", messages: [], prompt: nil) ⇒ Object
227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 |
# File 'lib/ruby-spacy.rb', line 227 def openai_query(access_token: nil, max_tokens: 1000, temperature: 0.7, model: "gpt-4o-mini", messages: [], prompt: nil) if .empty? = [ { role: "system", content: prompt }, { role: "user", content: @text } ] end access_token ||= ENV["OPENAI_API_KEY"] raise "Error: OPENAI_API_KEY is not set" unless access_token begin response = Spacy.openai_client(access_token: access_token).chat( parameters: { model: model, messages: , max_tokens: max_tokens, temperature: temperature, function_call: "auto", stream: false, functions: [ { name: "get_tokens", description: "Tokenize given text and return a list of tokens with their attributes: surface, lemma, tag, pos (part-of-speech), dep (dependency), ent_type (entity type), and morphology", "parameters": { "type": "object", "properties": { "text": { "type": "string", "description": "text to be tokenized" } }, "required": ["text"] } } ] } ) = response.dig("choices", 0, "message") if ["role"] == "assistant" && ["function_call"] << function_name = .dig("function_call", "name") _args = JSON.parse(.dig("function_call", "arguments")) case function_name when "get_tokens" res = tokens.map do |t| { "surface": t.text, "lemma": t.lemma, "pos": t.pos, "tag": t.tag, "dep": t.dep, "ent_type": t.ent_type, "morphology": t.morphology } end.to_json end << { role: "system", content: res } openai_query(access_token: access_token, max_tokens: max_tokens, temperature: temperature, model: model, messages: , prompt: prompt) else ["content"] end rescue StandardError => e puts "Error: OpenAI API call failed." pp e. pp e.backtrace end end |
#respond_to_missing?(sym) ⇒ Boolean
354 355 356 |
# File 'lib/ruby-spacy.rb', line 354 def respond_to_missing?(sym) sym ? true : super end |
#retokenize(start_index, end_index, attributes = {}) ⇒ Object
Retokenizes the text merging a span into a single token.
104 105 106 107 108 |
# File 'lib/ruby-spacy.rb', line 104 def retokenize(start_index, end_index, attributes = {}) PyCall.with(@py_doc.retokenize) do |retokenizer| retokenizer.merge(@py_doc[start_index..end_index], attrs: attributes) end end |
#retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) ⇒ Object
Retokenizes the text splitting the specified token.
115 116 117 118 119 120 |
# File 'lib/ruby-spacy.rb', line 115 def retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) PyCall.with(@py_doc.retokenize) do |retokenizer| heads = [[@py_doc[pos_in_doc], head_pos_in_split], @py_doc[ancestor_pos]] retokenizer.split(@py_doc[pos_in_doc], split_array, heads: heads, attrs: attributes) end end |
#sents ⇒ Array<Span>
Returns an array of spans each representing a sentence.
178 179 180 181 182 183 184 185 |
# File 'lib/ruby-spacy.rb', line 178 def sents sentence_array = [] py_sentences = PyCall::List.call(@py_doc.sents) py_sentences.each do |py_sent| sentence_array << Span.new(self, start_index: py_sent.start, end_index: py_sent.end - 1) end sentence_array end |
#similarity(other) ⇒ Float
Returns a semantic similarity estimate.
215 216 217 |
# File 'lib/ruby-spacy.rb', line 215 def similarity(other) py_doc.similarity(other.py_doc) end |
#span(range_or_start, optional_size = nil) ⇒ Span
Returns a span of the specified range within the doc. The method should be used either of the two ways: ‘Doc#span(range)` or `Doc#spansize_of_span`.
150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
# File 'lib/ruby-spacy.rb', line 150 def span(range_or_start, optional_size = nil) if optional_size start_index = range_or_start temp = tokens[start_index...start_index + optional_size] else start_index = range_or_start.first range = range_or_start temp = tokens[range] end end_index = start_index + temp.size - 1 Span.new(self, start_index: start_index, end_index: end_index) end |
#to_s ⇒ String
String representation of the document.
124 125 126 |
# File 'lib/ruby-spacy.rb', line 124 def to_s @text end |
#tokens ⇒ Array<Token>
Returns an array of tokens contained in the doc.
130 131 132 133 134 135 136 |
# File 'lib/ruby-spacy.rb', line 130 def tokens results = [] PyCall::List.call(@py_doc).each do |py_token| results << Token.new(py_token) end results end |