Class: Unisec::Blocks
- Inherits:
-
Object
- Object
- Unisec::Blocks
- Defined in:
- lib/unisec/blocks.rb
Overview
Operations about Unicode blocks
Constant Summary collapse
- UCD_BLOCKS =
UCD Blocks file location
File.join(__dir__, '../../data/Blocks.txt')
- INVALID_RANGES =
List of invalid, private, reserved ranges. Unasigned, unallocated ranges are calculated dynamically in list_unassigned.
[ { range: 0xd800..0xdfff, name: 'Surrogates (invalid outside UTF-16)' }, { range: 0xe000..0xf8ff, name: 'Private Use Area (located in BMP)' }, { range: 0xf0000..0xfffff, name: 'Supplementary Private Use Area-A' }, { range: 0x100000..0x10ffff, name: 'Supplementary Private Use Area-B' } ].freeze
Class Method Summary collapse
-
.block(block_arg, with_count: false) ⇒ Hash|nil
Find the block including the target character or code point, or matching the provided name.
-
.block_display(block_arg, with_count: false) ⇒ Object
Display a CLI-friendly output detailing the searched block.
-
.count_char_in_block(range) ⇒ Integer
Count the number of characters allocated in a block.
-
.list(with_count: false) ⇒ Array<Hash>
List Unicode blocks name ⚠️ Char count value may be wrong for CJK UNIFIED IDEOGRAPH because they are poorly described in DerivedName.txt.
-
.list_display(with_count: false) ⇒ Object
Display a CLI-friendly output listing all blocks.
-
.list_invalid_display ⇒ Object
Display a CLI-friendly output listing all invalid and unsassigned ranges.
-
.list_unassigned ⇒ Array<Range>
List unasigned, unallocated ranges.
-
.ucd_blocks_version ⇒ String
Returns the version of Unicode used in UCD local file (data/Blocks.txt).
Class Method Details
.block(block_arg, with_count: false) ⇒ Hash|nil
Find the block including the target character or code point, or matching the provided name.
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
# File 'lib/unisec/blocks.rb', line 99 def self.block(block_arg, with_count: false) # rubocop:disable Metrics/AbcSize,Metrics/CyclomaticComplexity,Metrics/MethodLength,Metrics/PerceivedComplexity file = File.new(UCD_BLOCKS) found = false file.each_line(chomp: true) do |line| # Skip if the line is empty or a comment next if line.empty? || line[0] == '#' # parse the line to extract code point range and the name blk_range, blk_name = line.split(';') blk_range = Unisec::Utils::String.to_range(blk_range) blk_name.lstrip! case block_arg when Integer # block_arg is an intgeger code point found = true if blk_range.include?(block_arg) when String # can be a char or block name or a string code point if block_arg.size == 1 # is a char (1 code unit, not one grapheme) found = true if blk_range.include?(Utils::String.convert_to_integer(block_arg)) elsif block_arg.start_with?('U+') # string code point found = true if blk_range.include?(Utils::String.convert(block_arg, :integer)) elsif blk_name.downcase == block_arg.downcase # block name found = true end end if found return { range: blk_range, name: blk_name, range_size: with_count ? blk_range.size : nil, char_count: with_count ? count_char_in_block(blk_range) : nil } end end nil # not found end |
.block_display(block_arg, with_count: false) ⇒ Object
Display a CLI-friendly output detailing the searched block
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
# File 'lib/unisec/blocks.rb', line 176 def self.block_display(block_arg, with_count: false) blk = block(block_arg, with_count: with_count) if blk.nil? puts "no block found with #{block_arg}" else display = ->(key, value) { puts Paint[key, :red, :bold] + " #{value}" } display.call('Range:', Utils::Range.range2codepoint_range(blk[:range])) display.call('Name:', blk[:name]) if with_count display.call('Range size:', blk[:range_size]) display.call('Char count:', blk[:char_count]) end end nil end |
.count_char_in_block(range) ⇒ Integer
Count the number of characters allocated in a block. ⚠️ Char count value may be wrong for CJK UNIFIED IDEOGRAPH because they are poorly described in DerivedName.txt.
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
# File 'lib/unisec/blocks.rb', line 65 def self.count_char_in_block(range) # rubocop:disable Metrics/AbcSize counter = 0 file = File.new(Rugrep::UCD_DERIVEDNAME) file.each_line(chomp: true) do |line| # Skip if the line is empty or a comment next if line.empty? || line[0] == '#' # parse the line to extract code point as integer and the name cp_int, _name = line.split(';') if cp_int.include?('..') # handle ranges in DerivedName.txt ucd_range = Utils::String.to_range(cp_int) next unless range.include_range?(ucd_range) counter += ucd_range.size next end cp_int = cp_int.chomp.to_i(16) next unless range.include?(cp_int) counter += 1 break if cp_int == range.end end counter end |
.list(with_count: false) ⇒ Array<Hash>
List Unicode blocks name ⚠️ Char count value may be wrong for CJK UNIFIED IDEOGRAPH because they are poorly described in DerivedName.txt. ⚠️ Populating char_count is slow and can take a few seconds.
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
# File 'lib/unisec/blocks.rb', line 38 def self.list(with_count: false) out = [] file = File.new(UCD_BLOCKS) file.each_line(chomp: true) do |line| # Skip if the line is empty or a comment next if line.empty? || line[0] == '#' # parse the line to extract code point range and the name blk_range, blk_name = line.split(';') blk_range = Unisec::Utils::String.to_range(blk_range) blk_name.lstrip! out << { range: blk_range, name: blk_name, range_size: with_count ? blk_range.size : nil, char_count: with_count ? count_char_in_block(blk_range) : nil } end out end |
.list_display(with_count: false) ⇒ Object
Display a CLI-friendly output listing all blocks
158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
# File 'lib/unisec/blocks.rb', line 158 def self.list_display(with_count: false) # rubocop:disable Metrics/AbcSize blocks = list(with_count: with_count) display = ->(key, value, just) { print Paint[key, :red, :bold] + " #{value}".ljust(just) } blocks.each do |blk| display.call('Range:', Utils::Range.range2codepoint_range(blk[:range]), 22) display.call('Name:', blk[:name], 50) if with_count display.call('Range size:', blk[:range_size], 8) display.call('Char count:', blk[:char_count], 0) end puts end nil end |
.list_invalid_display ⇒ Object
Display a CLI-friendly output listing all invalid and unsassigned ranges.
193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 |
# File 'lib/unisec/blocks.rb', line 193 def self.list_invalid_display # rubocop:disable Metrics/AbcSize display = ->(key, value, just) { print Paint[key, :red, :bold] + " #{value}".ljust(just) } puts '(Assigned) invalid, private, reserved ranges:' INVALID_RANGES.each do |blk| display.call('Range:', Utils::Range.range2codepoint_range(blk[:range]), 22) display.call('Name:', blk[:name], 50) puts end puts "\nUnasigned, unallocated ranges:" list_unassigned.each do |blk| display.call('Range:', Utils::Range.range2codepoint_range(blk), 22) puts end nil end |
.list_unassigned ⇒ Array<Range>
List unasigned, unallocated ranges.
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
# File 'lib/unisec/blocks.rb', line 138 def self.list_unassigned # rubocop:disable Metrics/AbcSize base = (0x0000..0x10ffff) assigned = Unisec::Blocks.list.map { |b| b[:range] } unassigned = [] cursor = base.begin assigned.each do |r| unassigned << (cursor..(r.begin - 1)) if cursor < r.begin cursor = r.end + 1 break if cursor > base.end end unassigned << (cursor..base.end) if cursor <= base.end unassigned end |
.ucd_blocks_version ⇒ String
Returns the version of Unicode used in UCD local file (data/Blocks.txt)
25 26 27 28 |
# File 'lib/unisec/blocks.rb', line 25 def self.ucd_blocks_version first_line = File.open(UCD_BLOCKS, &:readline) first_line.match(/-(\d+\.\d+\.\d+)\.txt/).captures.first end |