Class: Bio::FastaFormat
Overview
Treats a FASTA formatted entry, such as:
>id and/or some comments <== comment line
ATGCATGCATGCATGCATGCATGCATGCATGCATGC <== sequence lines
ATGCATGCATGCATGCATGCATGCATGCATGCATGC
ATGCATGCATGC
The precedent ‘>’ can be omitted and the trailing ‘>’ will be removed automatically.
Examples
f_str = ">sce:YBR160W CDC28, SRM5; cyclin-dependent protein kinase catalytic subunit [EC:2.7.1.-] [SP:CC28_YEAST]\nMSGELANYKRLEKVGEGTYGVVYKALDLRPGQGQRVVALKKIRLESEDEG\nVPSTAIREISLLKELKDDNIVRLYDIVHSDAHKLYLVFEFLDLDLKRYME\nGIPKDQPLGADIVKKFMMQLCKGIAYCHSHRILHRDLKPQNLLINKDGNL\nKLGDFGLARAFGVPLRAYTHEIVTLWYRAPEVLLGGKQYSTGVDTWSIGC\nIFAEMCNRKPIFSGDSEIDQIFKIFRVLGTPNEAIWPDIVYLPDFKPSFP\nQWRRKDLSQVVPSLDPRGIDLLDKLLAYDPINRISARRAAIHPYFQES\n>sce:YBR274W CHK1; probable serine/threonine-protein kinase [EC:2.7.1.-] [SP:KB9S_YEAST]\nMSLSQVSPLPHIKDVVLGDTVGQGAFACVKNAHLQMDPSIILAVKFIHVP\nTCKKMGLSDKDITKEVVLQSKCSKHPNVLRLIDCNVSKEYMWIILEMADG\nGDLFDKIEPDVGVDSDVAQFYFQQLVSAINYLHVECGVAHRDIKPENILL\nDKNGNLKLADFGLASQFRRKDGTLRVSMDQRGSPPYMAPEVLYSEEGYYA\nDRTDIWSIGILLFVLLTGQTPWELPSLENEDFVFFIENDGNLNWGPWSKI\nEFTHLNLLRKILQPDPNKRVTLKALKLHPWVLRRASFSGDDGLCNDPELL\nAKKLFSHLKVSLSNENYLKFTQDTNSNNRYISTQPIGNELAELEHDSMHF\nQTVSNTQRAFTSYDSNTNYNSGTGMTQEAKWTQFISYDIAALQFHSDEND\nCNELVKRHLQFNPNKLTKFYTLQPMDVLLPILEKALNLSQIRVKPDLFAN\nFERLCELLGYDNVFPLIINIKTKSNGGYQLCGSISIIKIEEELKSVGFER\nKTGDPLEWRRLFKKISTICRDIILIPN\n"
f = Bio::FastaFormat.new(f_str)
puts "### FastaFormat"
puts "# entry"
puts f.entry
puts "# entry_id"
p f.entry_id
puts "# definition"
p f.definition
puts "# data"
p f.data
puts "# seq"
p f.seq
puts "# seq.type"
p f.seq.type
puts "# length"
p f.length
puts "# aaseq"
p f.aaseq
puts "# aaseq.type"
p f.aaseq.type
puts "# aaseq.composition"
p f.aaseq.composition
puts "# aalen"
p f.aalen
References
-
FASTA format (WikiPedia) en.wikipedia.org/wiki/FASTA_format
Direct Known Subclasses
Constant Summary collapse
- DELIMITER =
Entry delimiter in flatfile text.
RS = "\n>"
- DELIMITER_OVERRUN =
(Integer) excess read size included in DELIMITER.
1
Instance Attribute Summary collapse
-
#data ⇒ Object
The seuqnce lines in text.
-
#definition ⇒ Object
The comment line of the FASTA formatted data.
-
#entry_overrun ⇒ Object
readonly
Returns the value of attribute entry_overrun.
Instance Method Summary collapse
-
#aalen ⇒ Object
Returens the length of Bio::Sequence::AA.
-
#aaseq ⇒ Object
Returens the Bio::Sequence::AA.
-
#acc_version ⇒ Object
Returns accession number with version.
-
#accession ⇒ Object
Returns an accession number.
-
#accessions ⇒ Object
Parsing FASTA Defline (using #identifiers method), and shows accession numbers.
-
#comment ⇒ Object
Returns comments.
-
#entry ⇒ Object
(also: #to_s)
Returns the stored one entry as a FASTA format.
-
#entry_id ⇒ Object
Parsing FASTA Defline (using #identifiers method), and shows a possibly unique identifier.
-
#gi ⇒ Object
Parsing FASTA Defline (using #identifiers method), and shows GI/locus/accession/accession with version number.
-
#identifiers ⇒ Object
Parsing FASTA Defline, and extract IDs.
-
#initialize(str) ⇒ FastaFormat
constructor
Stores the comment and sequence information from one entry of the FASTA format string.
-
#length ⇒ Object
Returns sequence length.
-
#locus ⇒ Object
Returns locus.
-
#nalen ⇒ Object
Returens the length of Bio::Sequence::NA.
-
#naseq ⇒ Object
Returens the Bio::Sequence::NA.
-
#query(factory) ⇒ Object
(also: #fasta, #blast)
Executes FASTA/BLAST search by using a Bio::Fasta or a Bio::Blast factory object.
-
#seq ⇒ Object
Returns a joined sequence line as a String.
-
#to_biosequence ⇒ Object
(also: #to_seq)
Returns sequence as a Bio::Sequence object.
Methods inherited from DB
#exists?, #fetch, #get, open, #tags
Constructor Details
#initialize(str) ⇒ FastaFormat
Stores the comment and sequence information from one entry of the FASTA format string. If the argument contains more than one entry, only the first entry is used.
119 120 121 122 123 124 |
# File 'lib/bio/db/fasta.rb', line 119 def initialize(str) @definition = str[/.*/].sub(/^>/, '').strip # 1st line @data = str.sub(/.*/, '') # rests @data.sub!(/^>.*/m, '') # remove trailing entries for sure @entry_overrun = $& end |
Instance Attribute Details
#data ⇒ Object
The seuqnce lines in text.
112 113 114 |
# File 'lib/bio/db/fasta.rb', line 112 def data @data end |
#definition ⇒ Object
The comment line of the FASTA formatted data.
109 110 111 |
# File 'lib/bio/db/fasta.rb', line 109 def definition @definition end |
#entry_overrun ⇒ Object (readonly)
Returns the value of attribute entry_overrun.
114 115 116 |
# File 'lib/bio/db/fasta.rb', line 114 def entry_overrun @entry_overrun end |
Instance Method Details
#aalen ⇒ Object
Returens the length of Bio::Sequence::AA.
209 210 211 |
# File 'lib/bio/db/fasta.rb', line 209 def aalen self.aaseq.length end |
#aaseq ⇒ Object
Returens the Bio::Sequence::AA.
204 205 206 |
# File 'lib/bio/db/fasta.rb', line 204 def aaseq Sequence::AA.new(seq) end |
#acc_version ⇒ Object
Returns accession number with version.
265 266 267 |
# File 'lib/bio/db/fasta.rb', line 265 def acc_version identifiers.acc_version end |
#accession ⇒ Object
Returns an accession number.
253 254 255 |
# File 'lib/bio/db/fasta.rb', line 253 def accession identifiers.accession end |
#accessions ⇒ Object
Parsing FASTA Defline (using #identifiers method), and shows accession numbers. It returns an array of strings.
260 261 262 |
# File 'lib/bio/db/fasta.rb', line 260 def accessions identifiers.accessions end |
#comment ⇒ Object
Returns comments.
183 184 185 186 |
# File 'lib/bio/db/fasta.rb', line 183 def comment seq @comment end |
#entry ⇒ Object Also known as: to_s
Returns the stored one entry as a FASTA format. (same as to_s)
127 128 129 |
# File 'lib/bio/db/fasta.rb', line 127 def entry @entry = ">#{@definition}\n#{@data.strip}\n" end |
#entry_id ⇒ Object
Parsing FASTA Defline (using #identifiers method), and shows a possibly unique identifier. It returns a string.
239 240 241 |
# File 'lib/bio/db/fasta.rb', line 239 def entry_id identifiers.entry_id end |
#gi ⇒ Object
Parsing FASTA Defline (using #identifiers method), and shows GI/locus/accession/accession with version number. If a entry has more than two of such IDs, only the first ID are shown. It returns a string or nil.
248 249 250 |
# File 'lib/bio/db/fasta.rb', line 248 def gi identifiers.gi end |
#identifiers ⇒ Object
Parsing FASTA Defline, and extract IDs. IDs are NSIDs (NCBI standard FASTA sequence identifiers) or “:”-separated IDs. It returns a Bio::FastaDefline instance.
229 230 231 232 233 234 |
# File 'lib/bio/db/fasta.rb', line 229 def identifiers unless defined?(@ids) then @ids = FastaDefline.new(@definition) end @ids end |
#length ⇒ Object
Returns sequence length.
189 190 191 |
# File 'lib/bio/db/fasta.rb', line 189 def length seq.length end |
#locus ⇒ Object
Returns locus.
270 271 272 |
# File 'lib/bio/db/fasta.rb', line 270 def locus identifiers.locus end |
#nalen ⇒ Object
Returens the length of Bio::Sequence::NA.
199 200 201 |
# File 'lib/bio/db/fasta.rb', line 199 def nalen self.naseq.length end |
#naseq ⇒ Object
Returens the Bio::Sequence::NA.
194 195 196 |
# File 'lib/bio/db/fasta.rb', line 194 def naseq Sequence::NA.new(seq) end |
#query(factory) ⇒ Object Also known as: fasta, blast
Executes FASTA/BLAST search by using a Bio::Fasta or a Bio::Blast factory object.
#!/usr/bin/env ruby
require 'bio'
factory = Bio::Fasta.local('fasta34', 'db/swissprot.f')
flatfile = Bio::FlatFile.open(Bio::FastaFormat, 'queries.f')
flatfile.each do |entry|
p entry.definition
result = entry.fasta(factory)
result.each do |hit|
print "#{hit.query_id} : #{hit.evalue}\t#{hit.target_id} at "
p hit.lap_at
end
end
150 151 152 |
# File 'lib/bio/db/fasta.rb', line 150 def query(factory) factory.query(@entry) end |
#seq ⇒ Object
Returns a joined sequence line as a String.
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 |
# File 'lib/bio/db/fasta.rb', line 157 def seq unless defined?(@seq) unless /\A\s*^\#/ =~ @data then @seq = Sequence::Generic.new(@data.tr(" \t\r\n0-9", '')) # lazy clean up else a = @data.split(/(^\#.*$)/) i = 0 cmnt = {} s = [] a.each do |x| if /^# ?(.*)$/ =~ x then cmnt[i] ? cmnt[i] << "\n" << $1 : cmnt[i] = $1 else x.tr!(" \t\r\n0-9", '') # lazy clean up i += x.length s << x end end @comment = cmnt @seq = Bio::Sequence::Generic.new(s.join('')) end end @seq end |
#to_biosequence ⇒ Object Also known as: to_seq
Returns sequence as a Bio::Sequence object.
Note: If you modify the returned Bio::Sequence object, the sequence or definition in this FastaFormat object might also be changed (but not always be changed) because of efficiency.
220 221 222 |
# File 'lib/bio/db/fasta.rb', line 220 def to_biosequence Bio::Sequence.adapter(self, Bio::Sequence::Adapter::FastaFormat) end |