Class: Spacy::Doc

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/ruby-spacy.rb

Overview

See also spaCy Python API document for [‘Doc`](spacy.io/api/doc).

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL, retrial: 0) ⇒ Doc

It is recommended to use Language#read method to create a doc. If you need to create one using #initialize, there are two method signatures: ‘Spacy::Doc.new(nlp_id, py_doc: Object)` and `Spacy::Doc.new(nlp_id, text: String)`.

Parameters:

  • nlp (Language)

    an instance of Language class

  • py_doc (Object) (defaults to: nil)

    an instance of Python ‘Doc` class

  • text (String) (defaults to: nil)

    the text string to be analyzed



79
80
81
82
83
84
85
86
87
88
89
90
# File 'lib/ruby-spacy.rb', line 79

def initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL,
               retrial: 0)
  @py_nlp = nlp
  @py_doc = py_doc || @py_doc = nlp.call(text)
  @text = @py_doc.text
rescue StandardError
  retrial += 1
  raise "Error: Failed to construct a Doc object" unless retrial <= max_retrial

  sleep 0.5
  initialize(nlp, py_doc: py_doc, text: text, max_retrial: max_retrial, retrial: retrial)
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(name, *args) ⇒ Object

Methods defined in Python but not wrapped in ruby-spacy can be called by this dynamic method handling mechanism.



404
405
406
# File 'lib/ruby-spacy.rb', line 404

def method_missing(name, *args)
  @py_doc.send(name, *args)
end

Instance Attribute Details

#py_docObject (readonly)

Returns a Python ‘Doc` instance accessible via `PyCall`.

Returns:

  • (Object)

    a Python ‘Doc` instance accessible via `PyCall`



62
63
64
# File 'lib/ruby-spacy.rb', line 62

def py_doc
  @py_doc
end

#py_nlpObject (readonly)

Returns a Python ‘Language` instance accessible via `PyCall`.

Returns:

  • (Object)

    a Python ‘Language` instance accessible via `PyCall`



59
60
61
# File 'lib/ruby-spacy.rb', line 59

def py_nlp
  @py_nlp
end

#textString (readonly)

Returns a text string of the document.

Returns:

  • (String)

    a text string of the document



65
66
67
# File 'lib/ruby-spacy.rb', line 65

def text
  @text
end

Class Method Details

.from_bytes(nlp, byte_string) ⇒ Doc

Restores a doc from binary data created by #to_bytes. This is useful for caching processed documents to avoid re-processing.

Examples:

Load doc from file

bytes = File.binread("doc.bin")
doc = Spacy::Doc.from_bytes(nlp, bytes)

Parameters:

  • byte_string (String)

    binary data from #to_bytes

Returns:

  • (Doc)

    the restored doc



229
230
231
232
233
# File 'lib/ruby-spacy.rb', line 229

def self.from_bytes(nlp, byte_string)
  py_bytes = PyCall.eval("bytes(#{byte_string.bytes})")
  py_doc = nlp.py_nlp.call("").from_bytes(py_bytes)
  new(nlp.py_nlp, py_doc: py_doc)
end

Instance Method Details

#[](range) ⇒ Object

Returns a span if given a range object; or returns a token if given an integer representing a position in the doc.

Parameters:

  • range (Range)

    an ordinary Ruby’s range object such as ‘0..3`, `1…4`, or `3 .. -1`



195
196
197
198
199
200
201
202
# File 'lib/ruby-spacy.rb', line 195

def [](range)
  if range.is_a?(Range)
    py_span = @py_doc[range]
    Span.new(self, start_index: py_span.start, end_index: py_span.end - 1)
  else
    Token.new(@py_doc[range])
  end
end

#displacy(style: "dep", compact: false) ⇒ String

Visualize the document in one of two styles: “dep” (dependencies) or “ent” (named entities).

Parameters:

  • style (String) (defaults to: "dep")

    either ‘dep` or `ent`

  • compact (Boolean) (defaults to: false)

    only relevant to the ‘dep’ style

Returns:

  • (String)

    in the case of ‘dep`, the output text will be an SVG, whereas in the `ent` style, the output text will be an HTML.



239
240
241
# File 'lib/ruby-spacy.rb', line 239

def displacy(style: "dep", compact: false)
  PyDisplacy.render(py_doc, style: style, options: { compact: compact }, jupyter: false)
end

#eachObject

Iterates over the elements in the doc yielding a token instance each time.



131
132
133
134
135
# File 'lib/ruby-spacy.rb', line 131

def each
  PyCall::List.call(@py_doc).each do |py_token|
    yield Token.new(py_token)
  end
end

#entsArray<Span>

Returns an array of spans each representing a named entity.

Returns:



181
182
183
184
185
186
187
188
189
190
191
# File 'lib/ruby-spacy.rb', line 181

def ents
  # so that ents canbe "each"-ed in Ruby
  ent_array = []
  PyCall::List.call(@py_doc.ents).each do |ent|
    ent.define_singleton_method :label do
      label_
    end
    ent_array << ent
  end
  ent_array
end

#noun_chunksArray<Span>

Returns an array of spans representing noun chunks.

Returns:



159
160
161
162
163
164
165
166
# File 'lib/ruby-spacy.rb', line 159

def noun_chunks
  chunk_array = []
  py_chunks = PyCall::List.call(@py_doc.noun_chunks)
  py_chunks.each do |py_chunk|
    chunk_array << Span.new(self, start_index: py_chunk.start, end_index: py_chunk.end - 1)
  end
  chunk_array
end

#openai_completion(access_token: nil, max_completion_tokens: nil, max_tokens: nil, temperature: 0.7, model: "gpt-5-mini") ⇒ String?

Sends a text completion request to OpenAI’s chat API.

Parameters:

  • access_token (String, nil) (defaults to: nil)

    OpenAI API key (defaults to OPENAI_API_KEY env var)

  • max_completion_tokens (Integer) (defaults to: nil)

    Maximum tokens in the response

  • max_tokens (Integer) (defaults to: nil)

    Alias for max_completion_tokens (deprecated, for backward compatibility)

  • temperature (Float) (defaults to: 0.7)

    Sampling temperature (ignored for GPT-5 models)

  • model (String) (defaults to: "gpt-5-mini")

    The model to use (default: gpt-5-mini)

Returns:

  • (String, nil)

    The completed text



361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
# File 'lib/ruby-spacy.rb', line 361

def openai_completion(access_token: nil, max_completion_tokens: nil, max_tokens: nil, temperature: 0.7, model: "gpt-5-mini")
  # Support both max_completion_tokens and max_tokens for backward compatibility
  max_completion_tokens ||= max_tokens || 1000

  messages = [
    { role: "system", content: "Complete the text input by the user." },
    { role: "user", content: @text }
  ]

  access_token ||= ENV["OPENAI_API_KEY"]
  raise "Error: OPENAI_API_KEY is not set" unless access_token

  client = OpenAIClient.new(access_token: access_token)
  response = client.chat(
    model: model,
    messages: messages,
    max_completion_tokens: max_completion_tokens,
    temperature: temperature
  )
  response.dig("choices", 0, "message", "content")
rescue OpenAIClient::APIError => e
  puts "Error: OpenAI API call failed - #{e.message}"
  nil
end

#openai_embeddings(access_token: nil, model: "text-embedding-3-small") ⇒ Array<Float>?

Generates text embeddings using OpenAI’s embeddings API.

Parameters:

  • access_token (String, nil) (defaults to: nil)

    OpenAI API key (defaults to OPENAI_API_KEY env var)

  • model (String) (defaults to: "text-embedding-3-small")

    The embeddings model (default: text-embedding-3-small)

Returns:

  • (Array<Float>, nil)

    The embedding vector



391
392
393
394
395
396
397
398
399
400
401
# File 'lib/ruby-spacy.rb', line 391

def openai_embeddings(access_token: nil, model: "text-embedding-3-small")
  access_token ||= ENV["OPENAI_API_KEY"]
  raise "Error: OPENAI_API_KEY is not set" unless access_token

  client = OpenAIClient.new(access_token: access_token)
  response = client.embeddings(model: model, input: @text)
  response.dig("data", 0, "embedding")
rescue OpenAIClient::APIError => e
  puts "Error: OpenAI API call failed - #{e.message}"
  nil
end

#openai_query(access_token: nil, max_completion_tokens: nil, max_tokens: nil, temperature: 0.7, model: "gpt-5-mini", messages: [], prompt: nil) ⇒ String?

Sends a query to OpenAI’s chat completion API with optional tool support. The get_tokens tool allows the model to request token-level linguistic analysis.

Parameters:

  • access_token (String, nil) (defaults to: nil)

    OpenAI API key (defaults to OPENAI_API_KEY env var)

  • max_completion_tokens (Integer) (defaults to: nil)

    Maximum tokens in the response

  • max_tokens (Integer) (defaults to: nil)

    Alias for max_completion_tokens (deprecated, for backward compatibility)

  • temperature (Float) (defaults to: 0.7)

    Sampling temperature (ignored for GPT-5 models)

  • model (String) (defaults to: "gpt-5-mini")

    The model to use (default: gpt-5-mini)

  • messages (Array<Hash>) (defaults to: [])

    Conversation history (for recursive tool calls)

  • prompt (String, nil) (defaults to: nil)

    System prompt for the query

Returns:

  • (String, nil)

    The model’s response content



254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
# File 'lib/ruby-spacy.rb', line 254

def openai_query(access_token: nil,
                 max_completion_tokens: nil,
                 max_tokens: nil,
                 temperature: 0.7,
                 model: "gpt-5-mini",
                 messages: [],
                 prompt: nil)
  # Support both max_completion_tokens and max_tokens for backward compatibility
  max_completion_tokens ||= max_tokens || 1000
  if messages.empty?
    messages = [
      { role: "system", content: prompt },
      { role: "user", content: @text }
    ]
  end

  access_token ||= ENV["OPENAI_API_KEY"]
  raise "Error: OPENAI_API_KEY is not set" unless access_token

  # Tool definition for token analysis (GPT-5 tools API format)
  tools = [
    {
      type: "function",
      function: {
        name: "get_tokens",
        description: "Tokenize given text and return a list of tokens with their attributes: surface, lemma, tag, pos (part-of-speech), dep (dependency), ent_type (entity type), and morphology",
        parameters: {
          type: "object",
          properties: {
            text: {
              type: "string",
              description: "text to be tokenized"
            }
          },
          required: ["text"]
        }
      }
    }
  ]

  client = OpenAIClient.new(access_token: access_token)
  response = client.chat(
    model: model,
    messages: messages,
    max_completion_tokens: max_completion_tokens,
    temperature: temperature,
    tools: tools,
    tool_choice: "auto"
  )

  message = response.dig("choices", 0, "message")

  # Handle tool calls (GPT-5 format)
  if message["tool_calls"] && !message["tool_calls"].empty?
    messages << message

    message["tool_calls"].each do |tool_call|
      function_name = tool_call.dig("function", "name")
      tool_call_id = tool_call["id"]

      case function_name
      when "get_tokens"
        result = tokens.map do |t|
          {
            surface: t.text,
            lemma: t.lemma,
            pos: t.pos,
            tag: t.tag,
            dep: t.dep,
            ent_type: t.ent_type,
            morphology: t.morphology
          }
        end.to_json

        messages << {
          role: "tool",
          tool_call_id: tool_call_id,
          content: result
        }
      end
    end

    # Recursive call to get final response after tool execution
    openai_query(
      access_token: access_token,
      max_completion_tokens: max_completion_tokens,
      temperature: temperature,
      model: model,
      messages: messages,
      prompt: prompt
    )
  else
    message["content"]
  end
rescue OpenAIClient::APIError => e
  puts "Error: OpenAI API call failed - #{e.message}"
  nil
end

#respond_to_missing?(sym, *args) ⇒ Boolean

Returns:

  • (Boolean)


408
409
410
# File 'lib/ruby-spacy.rb', line 408

def respond_to_missing?(sym, *args)
  sym ? true : super
end

#retokenize(start_index, end_index, attributes = {}) ⇒ Object

Retokenizes the text merging a span into a single token.

Parameters:

  • start_index (Integer)

    the start position of the span to be retokenized in the document

  • end_index (Integer)

    the end position of the span to be retokenized in the document

  • attributes (Hash) (defaults to: {})

    attributes to set on the merged token



96
97
98
99
100
# File 'lib/ruby-spacy.rb', line 96

def retokenize(start_index, end_index, attributes = {})
  PyCall.with(@py_doc.retokenize) do |retokenizer|
    retokenizer.merge(@py_doc[start_index..end_index], attrs: attributes)
  end
end

#retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) ⇒ Object

Retokenizes the text splitting the specified token.

Parameters:

  • pos_in_doc (Integer)

    the position of the span to be retokenized in the document

  • split_array (Array<String>)

    text strings of the split results

  • ancestor_pos (Integer)

    the position of the immediate ancestor element of the split elements in the document

  • attributes (Hash) (defaults to: {})

    the attributes of the split elements



107
108
109
110
111
112
# File 'lib/ruby-spacy.rb', line 107

def retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {})
  PyCall.with(@py_doc.retokenize) do |retokenizer|
    heads = [[@py_doc[pos_in_doc], head_pos_in_split], @py_doc[ancestor_pos]]
    retokenizer.split(@py_doc[pos_in_doc], split_array, heads: heads, attrs: attributes)
  end
end

#sentsArray<Span>

Returns an array of spans each representing a sentence.

Returns:



170
171
172
173
174
175
176
177
# File 'lib/ruby-spacy.rb', line 170

def sents
  sentence_array = []
  py_sentences = PyCall::List.call(@py_doc.sents)
  py_sentences.each do |py_sent|
    sentence_array << Span.new(self, start_index: py_sent.start, end_index: py_sent.end - 1)
  end
  sentence_array
end

#similarity(other) ⇒ Float

Returns a semantic similarity estimate.

Parameters:

  • other (Doc)

    the other doc to which a similarity estimation is made

Returns:

  • (Float)


207
208
209
# File 'lib/ruby-spacy.rb', line 207

def similarity(other)
  py_doc.similarity(other.py_doc)
end

#span(range_or_start, optional_size = nil) ⇒ Span

Returns a span of the specified range within the doc. The method should be used either of the two ways: ‘Doc#span(range)` or `Doc#spansize_of_span`.

Parameters:

  • range_or_start (Range, Integer)

    a range object, or, alternatively, an integer that represents the start position of the span

  • optional_size (Integer) (defaults to: nil)

    an integer representing the size of the span

Returns:



142
143
144
145
146
147
148
149
150
151
152
153
154
155
# File 'lib/ruby-spacy.rb', line 142

def span(range_or_start, optional_size = nil)
  if optional_size
    start_index = range_or_start
    temp = tokens[start_index...start_index + optional_size]
  else
    start_index = range_or_start.first
    range = range_or_start
    temp = tokens[range]
  end

  end_index = start_index + temp.size - 1

  Span.new(self, start_index: start_index, end_index: end_index)
end

#to_bytesString

Serializes the doc to a binary string. The binary data includes all annotations (tokens, entities, etc.) and can be used to restore the doc later without re-processing.

Examples:

Save doc to file

doc = nlp.read("Hello world")
File.binwrite("doc.bin", doc.to_bytes)

Returns:

  • (String)

    binary representation of the doc



218
219
220
# File 'lib/ruby-spacy.rb', line 218

def to_bytes
  @py_doc.to_bytes.force_encoding(Encoding::BINARY)
end

#to_sString

String representation of the document.

Returns:

  • (String)


116
117
118
# File 'lib/ruby-spacy.rb', line 116

def to_s
  @text
end

#tokensArray<Token>

Returns an array of tokens contained in the doc.

Returns:



122
123
124
125
126
127
128
# File 'lib/ruby-spacy.rb', line 122

def tokens
  results = []
  PyCall::List.call(@py_doc).each do |py_token|
    results << Token.new(py_token)
  end
  results
end