Class: Mindee::Input::Source::LocalInputSource
- Inherits:
-
Object
- Object
- Mindee::Input::Source::LocalInputSource
- Defined in:
- lib/mindee/input/sources.rb
Overview
Base class for loading documents.
Direct Known Subclasses
Base64InputSource, BytesInputSource, FileInputSource, PathInputSource
Instance Attribute Summary collapse
- #file_mimetype ⇒ String readonly
- #filename ⇒ String readonly
- #io_stream ⇒ StringIO readonly
Instance Method Summary collapse
- #count_pdf_pages ⇒ Object
-
#initialize(io_stream, filename, fix_pdf: false) ⇒ LocalInputSource
constructor
A new instance of LocalInputSource.
-
#pdf? ⇒ Boolean
Shorthand for pdf mimetype validation.
-
#process_pdf(options) ⇒ Object
Parses a PDF file according to provided options.
-
#read_document(close: true) ⇒ Array<String, [String, aBinaryString ], [Hash, nil] >
Reads a document.
-
#rescue_broken_pdf(stream) ⇒ Object
Attempts to fix pdf files if mimetype is rejected.
Constructor Details
#initialize(io_stream, filename, fix_pdf: false) ⇒ LocalInputSource
Returns a new instance of LocalInputSource.
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
# File 'lib/mindee/input/sources.rb', line 57 def initialize(io_stream, filename, fix_pdf: false) @io_stream = io_stream @filename = filename @file_mimetype = if fix_pdf Marcel::MimeType.for @io_stream else Marcel::MimeType.for @io_stream, name: @filename end return if ALLOWED_MIME_TYPES.include? @file_mimetype if filename.end_with?('.pdf') && fix_pdf rescue_broken_pdf(@io_stream) @file_mimetype = Marcel::MimeType.for @io_stream return if ALLOWED_MIME_TYPES.include? @file_mimetype end raise InvalidMimeTypeError, @file_mimetype.to_s end |
Instance Attribute Details
#file_mimetype ⇒ String (readonly)
50 51 52 |
# File 'lib/mindee/input/sources.rb', line 50 def file_mimetype @file_mimetype end |
#filename ⇒ String (readonly)
48 49 50 |
# File 'lib/mindee/input/sources.rb', line 48 def filename @filename end |
#io_stream ⇒ StringIO (readonly)
52 53 54 |
# File 'lib/mindee/input/sources.rb', line 52 def io_stream @io_stream end |
Instance Method Details
#count_pdf_pages ⇒ Object
122 123 124 125 126 127 128 |
# File 'lib/mindee/input/sources.rb', line 122 def count_pdf_pages return 1 unless pdf? @io_stream.seek(0) pdf_processor = Mindee::PDF::PdfProcessor.open_pdf(@io_stream) pdf_processor.pages.size end |
#pdf? ⇒ Boolean
Shorthand for pdf mimetype validation.
94 95 96 |
# File 'lib/mindee/input/sources.rb', line 94 def pdf? @file_mimetype.to_s == 'application/pdf' end |
#process_pdf(options) ⇒ Object
Parses a PDF file according to provided options.
106 107 108 109 |
# File 'lib/mindee/input/sources.rb', line 106 def process_pdf() @io_stream.seek(0) @io_stream = PdfProcessor.parse(@io_stream, ) end |
#read_document(close: true) ⇒ Array<String, [String, aBinaryString ], [Hash, nil] >
Reads a document.
114 115 116 117 118 119 120 |
# File 'lib/mindee/input/sources.rb', line 114 def read_document(close: true) @io_stream.seek(0) # Avoids needlessly re-packing some files data = @io_stream.read @io_stream.close if close ['document', data, { filename: Mindee::Input::Source.convert_to_unicode_escape(@filename) }] end |
#rescue_broken_pdf(stream) ⇒ Object
Attempts to fix pdf files if mimetype is rejected. "Broken PDFs" are often a result of third-party injecting invalid headers. This attempts to remove them and send the file
81 82 83 84 85 86 87 88 89 90 91 |
# File 'lib/mindee/input/sources.rb', line 81 def rescue_broken_pdf(stream) stream.gets('%PDF-') raise UnfixablePDFError if stream.eof? || stream.pos > 500 stream.pos = stream.pos - 5 data = stream.read @io_stream.close @io_stream = StringIO.new @io_stream << data end |