Class: Spacy::Doc
Overview
See also spaCy Python API document for [‘Doc`](spacy.io/api/doc).
Instance Attribute Summary collapse
-
#py_doc ⇒ Object
readonly
A Python ‘Doc` instance accessible via `PyCall`.
-
#py_nlp ⇒ Object
readonly
A Python ‘Language` instance accessible via `PyCall`.
-
#text ⇒ String
readonly
A text string of the document.
Class Method Summary collapse
-
.from_bytes(nlp, byte_string) ⇒ Doc
Restores a doc from binary data created by #to_bytes.
Instance Method Summary collapse
-
#[](range) ⇒ Object
Returns a span if given a range object; or returns a token if given an integer representing a position in the doc.
-
#displacy(style: "dep", compact: false) ⇒ String
Visualize the document in one of two styles: “dep” (dependencies) or “ent” (named entities).
-
#each ⇒ Object
Iterates over the elements in the doc yielding a token instance each time.
-
#ents ⇒ Array<Span>
Returns an array of spans each representing a named entity.
-
#initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL, retrial: 0) ⇒ Doc
constructor
It is recommended to use Language#read method to create a doc.
-
#method_missing(name, *args) ⇒ Object
Methods defined in Python but not wrapped in ruby-spacy can be called by this dynamic method handling mechanism.
-
#noun_chunks ⇒ Array<Span>
Returns an array of spans representing noun chunks.
-
#openai_completion(access_token: nil, max_completion_tokens: nil, max_tokens: nil, temperature: 0.7, model: "gpt-5-mini") ⇒ String?
Sends a text completion request to OpenAI’s chat API.
-
#openai_embeddings(access_token: nil, model: "text-embedding-3-small") ⇒ Array<Float>?
Generates text embeddings using OpenAI’s embeddings API.
-
#openai_query(access_token: nil, max_completion_tokens: nil, max_tokens: nil, temperature: 0.7, model: "gpt-5-mini", messages: [], prompt: nil) ⇒ String?
Sends a query to OpenAI’s chat completion API with optional tool support.
- #respond_to_missing?(sym, *args) ⇒ Boolean
-
#retokenize(start_index, end_index, attributes = {}) ⇒ Object
Retokenizes the text merging a span into a single token.
-
#retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) ⇒ Object
Retokenizes the text splitting the specified token.
-
#sents ⇒ Array<Span>
Returns an array of spans each representing a sentence.
-
#similarity(other) ⇒ Float
Returns a semantic similarity estimate.
-
#span(range_or_start, optional_size = nil) ⇒ Span
Returns a span of the specified range within the doc.
-
#to_bytes ⇒ String
Serializes the doc to a binary string.
-
#to_s ⇒ String
String representation of the document.
-
#tokens ⇒ Array<Token>
Returns an array of tokens contained in the doc.
Constructor Details
#initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL, retrial: 0) ⇒ Doc
It is recommended to use Language#read method to create a doc. If you need to create one using #initialize, there are two method signatures: ‘Spacy::Doc.new(nlp_id, py_doc: Object)` and `Spacy::Doc.new(nlp_id, text: String)`.
79 80 81 82 83 84 85 86 87 88 89 90 |
# File 'lib/ruby-spacy.rb', line 79 def initialize(nlp, py_doc: nil, text: nil, max_retrial: MAX_RETRIAL, retrial: 0) @py_nlp = nlp @py_doc = py_doc || @py_doc = nlp.call(text) @text = @py_doc.text rescue StandardError retrial += 1 raise "Error: Failed to construct a Doc object" unless retrial <= max_retrial sleep 0.5 initialize(nlp, py_doc: py_doc, text: text, max_retrial: max_retrial, retrial: retrial) end |
Dynamic Method Handling
This class handles dynamic methods through the method_missing method
#method_missing(name, *args) ⇒ Object
Methods defined in Python but not wrapped in ruby-spacy can be called by this dynamic method handling mechanism.
404 405 406 |
# File 'lib/ruby-spacy.rb', line 404 def method_missing(name, *args) @py_doc.send(name, *args) end |
Instance Attribute Details
#py_doc ⇒ Object (readonly)
Returns a Python ‘Doc` instance accessible via `PyCall`.
62 63 64 |
# File 'lib/ruby-spacy.rb', line 62 def py_doc @py_doc end |
#py_nlp ⇒ Object (readonly)
Returns a Python ‘Language` instance accessible via `PyCall`.
59 60 61 |
# File 'lib/ruby-spacy.rb', line 59 def py_nlp @py_nlp end |
#text ⇒ String (readonly)
Returns a text string of the document.
65 66 67 |
# File 'lib/ruby-spacy.rb', line 65 def text @text end |
Class Method Details
.from_bytes(nlp, byte_string) ⇒ Doc
Restores a doc from binary data created by #to_bytes. This is useful for caching processed documents to avoid re-processing.
229 230 231 232 233 |
# File 'lib/ruby-spacy.rb', line 229 def self.from_bytes(nlp, byte_string) py_bytes = PyCall.eval("bytes(#{byte_string.bytes})") py_doc = nlp.py_nlp.call("").from_bytes(py_bytes) new(nlp.py_nlp, py_doc: py_doc) end |
Instance Method Details
#[](range) ⇒ Object
Returns a span if given a range object; or returns a token if given an integer representing a position in the doc.
195 196 197 198 199 200 201 202 |
# File 'lib/ruby-spacy.rb', line 195 def [](range) if range.is_a?(Range) py_span = @py_doc[range] Span.new(self, start_index: py_span.start, end_index: py_span.end - 1) else Token.new(@py_doc[range]) end end |
#displacy(style: "dep", compact: false) ⇒ String
Visualize the document in one of two styles: “dep” (dependencies) or “ent” (named entities).
239 240 241 |
# File 'lib/ruby-spacy.rb', line 239 def displacy(style: "dep", compact: false) PyDisplacy.render(py_doc, style: style, options: { compact: compact }, jupyter: false) end |
#each ⇒ Object
Iterates over the elements in the doc yielding a token instance each time.
131 132 133 134 135 |
# File 'lib/ruby-spacy.rb', line 131 def each PyCall::List.call(@py_doc).each do |py_token| yield Token.new(py_token) end end |
#ents ⇒ Array<Span>
Returns an array of spans each representing a named entity.
181 182 183 184 185 186 187 188 189 190 191 |
# File 'lib/ruby-spacy.rb', line 181 def ents # so that ents canbe "each"-ed in Ruby ent_array = [] PyCall::List.call(@py_doc.ents).each do |ent| ent.define_singleton_method :label do label_ end ent_array << ent end ent_array end |
#noun_chunks ⇒ Array<Span>
Returns an array of spans representing noun chunks.
159 160 161 162 163 164 165 166 |
# File 'lib/ruby-spacy.rb', line 159 def noun_chunks chunk_array = [] py_chunks = PyCall::List.call(@py_doc.noun_chunks) py_chunks.each do |py_chunk| chunk_array << Span.new(self, start_index: py_chunk.start, end_index: py_chunk.end - 1) end chunk_array end |
#openai_completion(access_token: nil, max_completion_tokens: nil, max_tokens: nil, temperature: 0.7, model: "gpt-5-mini") ⇒ String?
Sends a text completion request to OpenAI’s chat API.
361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 |
# File 'lib/ruby-spacy.rb', line 361 def openai_completion(access_token: nil, max_completion_tokens: nil, max_tokens: nil, temperature: 0.7, model: "gpt-5-mini") # Support both max_completion_tokens and max_tokens for backward compatibility max_completion_tokens ||= max_tokens || 1000 = [ { role: "system", content: "Complete the text input by the user." }, { role: "user", content: @text } ] access_token ||= ENV["OPENAI_API_KEY"] raise "Error: OPENAI_API_KEY is not set" unless access_token client = OpenAIClient.new(access_token: access_token) response = client.chat( model: model, messages: , max_completion_tokens: max_completion_tokens, temperature: temperature ) response.dig("choices", 0, "message", "content") rescue OpenAIClient::APIError => e puts "Error: OpenAI API call failed - #{e.}" nil end |
#openai_embeddings(access_token: nil, model: "text-embedding-3-small") ⇒ Array<Float>?
Generates text embeddings using OpenAI’s embeddings API.
391 392 393 394 395 396 397 398 399 400 401 |
# File 'lib/ruby-spacy.rb', line 391 def (access_token: nil, model: "text-embedding-3-small") access_token ||= ENV["OPENAI_API_KEY"] raise "Error: OPENAI_API_KEY is not set" unless access_token client = OpenAIClient.new(access_token: access_token) response = client.(model: model, input: @text) response.dig("data", 0, "embedding") rescue OpenAIClient::APIError => e puts "Error: OpenAI API call failed - #{e.}" nil end |
#openai_query(access_token: nil, max_completion_tokens: nil, max_tokens: nil, temperature: 0.7, model: "gpt-5-mini", messages: [], prompt: nil) ⇒ String?
Sends a query to OpenAI’s chat completion API with optional tool support. The get_tokens tool allows the model to request token-level linguistic analysis.
254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 |
# File 'lib/ruby-spacy.rb', line 254 def openai_query(access_token: nil, max_completion_tokens: nil, max_tokens: nil, temperature: 0.7, model: "gpt-5-mini", messages: [], prompt: nil) # Support both max_completion_tokens and max_tokens for backward compatibility max_completion_tokens ||= max_tokens || 1000 if .empty? = [ { role: "system", content: prompt }, { role: "user", content: @text } ] end access_token ||= ENV["OPENAI_API_KEY"] raise "Error: OPENAI_API_KEY is not set" unless access_token # Tool definition for token analysis (GPT-5 tools API format) tools = [ { type: "function", function: { name: "get_tokens", description: "Tokenize given text and return a list of tokens with their attributes: surface, lemma, tag, pos (part-of-speech), dep (dependency), ent_type (entity type), and morphology", parameters: { type: "object", properties: { text: { type: "string", description: "text to be tokenized" } }, required: ["text"] } } } ] client = OpenAIClient.new(access_token: access_token) response = client.chat( model: model, messages: , max_completion_tokens: max_completion_tokens, temperature: temperature, tools: tools, tool_choice: "auto" ) = response.dig("choices", 0, "message") # Handle tool calls (GPT-5 format) if ["tool_calls"] && !["tool_calls"].empty? << ["tool_calls"].each do |tool_call| function_name = tool_call.dig("function", "name") tool_call_id = tool_call["id"] case function_name when "get_tokens" result = tokens.map do |t| { surface: t.text, lemma: t.lemma, pos: t.pos, tag: t.tag, dep: t.dep, ent_type: t.ent_type, morphology: t.morphology } end.to_json << { role: "tool", tool_call_id: tool_call_id, content: result } end end # Recursive call to get final response after tool execution openai_query( access_token: access_token, max_completion_tokens: max_completion_tokens, temperature: temperature, model: model, messages: , prompt: prompt ) else ["content"] end rescue OpenAIClient::APIError => e puts "Error: OpenAI API call failed - #{e.}" nil end |
#respond_to_missing?(sym, *args) ⇒ Boolean
408 409 410 |
# File 'lib/ruby-spacy.rb', line 408 def respond_to_missing?(sym, *args) sym ? true : super end |
#retokenize(start_index, end_index, attributes = {}) ⇒ Object
Retokenizes the text merging a span into a single token.
96 97 98 99 100 |
# File 'lib/ruby-spacy.rb', line 96 def retokenize(start_index, end_index, attributes = {}) PyCall.with(@py_doc.retokenize) do |retokenizer| retokenizer.merge(@py_doc[start_index..end_index], attrs: attributes) end end |
#retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) ⇒ Object
Retokenizes the text splitting the specified token.
107 108 109 110 111 112 |
# File 'lib/ruby-spacy.rb', line 107 def retokenize_split(pos_in_doc, split_array, head_pos_in_split, ancestor_pos, attributes = {}) PyCall.with(@py_doc.retokenize) do |retokenizer| heads = [[@py_doc[pos_in_doc], head_pos_in_split], @py_doc[ancestor_pos]] retokenizer.split(@py_doc[pos_in_doc], split_array, heads: heads, attrs: attributes) end end |
#sents ⇒ Array<Span>
Returns an array of spans each representing a sentence.
170 171 172 173 174 175 176 177 |
# File 'lib/ruby-spacy.rb', line 170 def sents sentence_array = [] py_sentences = PyCall::List.call(@py_doc.sents) py_sentences.each do |py_sent| sentence_array << Span.new(self, start_index: py_sent.start, end_index: py_sent.end - 1) end sentence_array end |
#similarity(other) ⇒ Float
Returns a semantic similarity estimate.
207 208 209 |
# File 'lib/ruby-spacy.rb', line 207 def similarity(other) py_doc.similarity(other.py_doc) end |
#span(range_or_start, optional_size = nil) ⇒ Span
Returns a span of the specified range within the doc. The method should be used either of the two ways: ‘Doc#span(range)` or `Doc#spansize_of_span`.
142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
# File 'lib/ruby-spacy.rb', line 142 def span(range_or_start, optional_size = nil) if optional_size start_index = range_or_start temp = tokens[start_index...start_index + optional_size] else start_index = range_or_start.first range = range_or_start temp = tokens[range] end end_index = start_index + temp.size - 1 Span.new(self, start_index: start_index, end_index: end_index) end |
#to_bytes ⇒ String
Serializes the doc to a binary string. The binary data includes all annotations (tokens, entities, etc.) and can be used to restore the doc later without re-processing.
218 219 220 |
# File 'lib/ruby-spacy.rb', line 218 def to_bytes @py_doc.to_bytes.force_encoding(Encoding::BINARY) end |
#to_s ⇒ String
String representation of the document.
116 117 118 |
# File 'lib/ruby-spacy.rb', line 116 def to_s @text end |