Class: Datasets::WikipediaKyotoJapaneseEnglish
- Includes:
- TarGzReadable
- Defined in:
- lib/datasets/wikipedia-kyoto-japanese-english.rb
Defined Under Namespace
Classes: Article, ArticleListener, Entry, Paragraph, Section, Sentence, Title
Instance Attribute Summary
Attributes inherited from Dataset
Instance Method Summary collapse
- #each(&block) ⇒ Object
-
#initialize(type: :article) ⇒ WikipediaKyotoJapaneseEnglish
constructor
A new instance of WikipediaKyotoJapaneseEnglish.
Methods included from TarGzReadable
Methods inherited from Dataset
Constructor Details
#initialize(type: :article) ⇒ WikipediaKyotoJapaneseEnglish
Returns a new instance of WikipediaKyotoJapaneseEnglish.
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
# File 'lib/datasets/wikipedia-kyoto-japanese-english.rb', line 55 def initialize(type: :article) unless [:article, :lexicon].include?(type) raise ArgumentError, "Please set type :article or :lexicon: #{type.inspect}" end super() @type = type @metadata.id = "wikipedia-kyoto-japanese-english" @metadata.name = "The Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles" @metadata.url = "https://alaginrc.nict.go.jp/WikiCorpus/index_E.html" @metadata.licenses = ["CC-BY-SA-3.0"] @metadata.description = <<-DESCRIPTION "The Japanese-English Bilingual Corpus of Wikipedia's Kyoto Articles" aims mainly at supporting research and development relevant to high-performance multilingual machine translation, information extraction, and other language processing technologies. The National Institute of Information and Communications Technology (NICT) has created this corpus by manually translating Japanese Wikipedia articles (related to Kyoto) into English. DESCRIPTION end |
Instance Method Details
#each(&block) ⇒ Object
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
# File 'lib/datasets/wikipedia-kyoto-japanese-english.rb', line 78 def each(&block) return to_enum(__method__) unless block_given? data_path = download_tar_gz open_tar_gz(data_path) do |tar| tar.each do |entry| next unless entry.file? base_name = File.basename(entry.full_name) case @type when :article next unless base_name.end_with?(".xml") listener = ArticleListener.new(block) parser = REXML::Parsers::StreamParser.new(entry.read, listener) parser.parse when :lexicon next unless base_name == "kyoto_lexicon.csv" is_header = true CSV.parse(entry.read.force_encoding("UTF-8")) do |row| if is_header is_header = false next end yield(Entry.new(row[0], row[1])) end end end end end |