ulla
Description
‘ulla’ is a program for calculating environment-specific substitution tables from user providing environmental class definitions and sequence alignments with the annotations of the environment classes.
Features
-
Environment-specific substitution table generation based on user providing environmental class definition
-
Entropy-based smoothing procedures to cope with sparse data problem
-
BLOSUM-like weighting procedures using PID threshold
-
Heat Map generation for substitution tables
Requirements
-
ruby 1.8.7 or above (1.9.0 or above recommended, www.ruby-lang.org)
-
rubygems 1.2.0 or above (rubyforge.org/projects/rubygems)
Following RubyGems will be automatically installed if you have rubygems installed on your machine
-
narray (narray.rubyforge.org)
-
bio (bioruby.open-bio.org)
-
Active Support (as.rubyonrails.org)
-
RMagick (rmagick.rubyforge.org)
Installation
~user $ sudo gem install ulla
Basic Usage
It’s pretty much the same as Kenji’s subst (www-cryst.bioc.cam.ac.uk/~kenji/subst/), so in most cases, you can swap ‘subst’ with ‘ulla’.
~user $ ulla -l TEMLIST-file -c classdef.dat
or
~user $ ulla -f TEM-file -c classdef.dat
Options
--tem-file (-f) FILE: a tem file
--tem-list (-l) FILE: a list for tem files
--classdef (-c) FILE: a file for the defintion of environmental class
if no definition file provided, --cys (-y) 2 and --nosmooth automatcially applied
--outfile (-o) FILE: output filename (default 'allmat.dat')
--weight (-w) INTEGER: clustering level (PID) for the BLOSUM-like weighting (default: 60)
--noweight: calculate substitution counts with no weights
--environment (-e) INTEGER:
0 for considering only substituted amino acids' environments (default)
1 for considering both substituted and substituting amino acids' environments
--smooth (-s) INTEGER:
0 for partial smoothing (default)
1 for full smoothing
--p1smooth: perform smoothing for p1 probability calculation when partial smoothing
--nosmooth: perform no smoothing operation
--cys (-y) INTEGER:
0 for using C and J only for structure (default)
1 for both structure and sequence
2 for using only C for both (must be set when you have no 'disulphide' or 'disulfide' annotation in templates)
--output INTEGER:
0 for raw counts (no smoothing performed)
1 for probabilities
2 for log-odds (default)
--noroundoff: do not round off log odds ratio
--scale INTEGER: log-odds matrices in 1/n bit units (default 3)
--sigma DOUBLE: change the sigma value for smoothing (default 5.0)
--autosigma: automatically adjust the sigma value for smoothing
--add DOUBLE: add this value to raw counts when deriving log-odds without smoothing (default 0)
--pidmin DOUBLE: count substitutions only for pairs with PID equal to or greater than this value (default none)
--pidmax DOUBLE: count substitutions only for pairs with PID smaller than this value (default none)
--heatmap INTEGER:
0 create a heat map file for each substitution table
1 create one big file containing all heat maps from substitution tables
2 do both 0 and 1
--heatmap-format INTEGER:
0 for Portable Network Graphics (PNG) Format (default)
1 for Graphics Interchange Format (GIF)
2 for Joint Photographic Experts Group (JPEG) Format
3 for Microsoft Windows bitmap (BMP) Format
4 for Portable Document Format (PDF)
--heatmap-columns INTEGER: number of tables to print in a row when --heatmap 1 or 2 set (default: sqrt(no. of tables))
--heatmap-stem STRING: stem for a file name when --heatmap 1 or 2 set (default: 'heatmap')
--heatmap-values: print values in the cells when heat maps
--verbose (-v) INTEGER
0 for ERROR level
1 for WARN or above level (default)
2 for INFO or above level
3 for DEBUG or above level
--version: print version
--help (-h): show help
Usage
-
Prepare an environmental class definition file. For more details, please check this notes (www-cryst.bioc.cam.ac.uk/~kenji/subst/NOTES). You can download a sample environmental class definition file from www-cryst.bioc.cam.ac.uk/~kenji/subst/classdef.dat
~user $ cat classdef.dat # # name of feature (string); values adopted in .tem file (string); class labels assigned for each value (string); # constrained or not (T or F); silent (used as masks)? (T or F) # secondary structure and phi angle;HEPC;HEPC;T;F solvent accessibility;TF;Aa;F;F
-
Prepare structural alignments and their annotations of above environmental classes in PIR format. You can download sample alignments from www-cryst.bioc.cam.ac.uk/~kenji/subst/alltem-allmask.tar.gz
~user $ cat sample1.tem >P1;1mnma sequence QKERRKIEIKFIENKTRRHVTFSKRKHGIMKKAFELSVLTGTQVLLLVVSETGLVYTFSTPKFEPIVTQQEGRNL IQACLNAPDD* >P1;1egwa sequence --GRKKIQITRIMDERNRQVTFTKRKFGLMKKAYELSVLCDCEIALIIFNSSNKLFQYASTDMDKVLLKYTEY-- ----------* >P1;1mnma secondary structure and phi angle CPCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHPCCCEEEEECCCPCEEEEECCCCCHHHHCHHHHHH HHHHHCCCCP* >P1;1egwa secondary structure and phi angle --CCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHCPCCCEEEEECCCPCEEEEECCCHHHHHHHHHHC-- ----------* >P1;1mnma solvent accessibility TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTFTTTTTTTTTTTTTTTT TTTTTTTTTT* >P1;1egwa solvent accessibility --TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTFTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT-- ----------* ...
-
When you have two or more alignment files, you should make a separate file containing all the paths for the alignment files.
~user $ ls -1 *.tem > TEMLIST ~user $ cat TEMLIST sample1.tem sample2.tem ...
-
To produce substitution count matrices,
~user $ ulla -l TEMLIST --output 0 -o substcount.mat
-
To produce substitution probability matrices,
~user $ ulla -l TEMLIST --output 1 -o substprob.mat
-
To produce log odds ratio matrices,
~user $ ulla -l TEMLIST --output 2 -o substlogo.mat
-
To produce substitution probability matrices from the sequence pairs within a certain PID range (if you don’t provide any name for output, ‘allmat.dat’ will be used.),
~user $ ulla -l TEMLIST --pidmin 60 --pidmax 80 --output 1
-
To change the clustering level (default 60) to PID 80,
~user $ ulla -l TEMLIST --weight 80 --output 1
-
In case positions are masked with the character ‘X’ in any environmental features, all mutations from/to the position will be excluded from substitution counts.
-
Then, it will produce a file containing all the matrices, which will look like the one below. For more details, please check this notes (www-cryst.bioc.cam.ac.uk/~kenji/subst/NOTES).
# Environment-specific amino acid substitution matrices # Creator: ulla version 0.0.5 # Creation Date: 05/02/2009 17:29 # # Definitions for structural environments: # 2 features used # # secondary structure and phi angle;HEPC;HEPC;F;F # solvent accessibility;TF;Aa;F;F # (read in from classdef.dat) # # Number of alignments: 1187 # (list of .tem files read in from TEMLIST) # # Total number of environments: 8 # # There are 21 amino acids considered. # ACDEFGHIKLMNPQRSTVWYJ # # C: Cystine (the disulfide-bonded form) # J: Cysteine (the free thiol form) # # Weighting scheme: clustering at PID 60 level # ... # >HA 0 # A C D E F G H I K L M N P Q R S T V W Y J A 3 -5 0 0 -1 2 0 0 1 0 0 0 1 1 0 1 1 1 -1 0 2 C -16 19 -16 -18 -11 -14 -13 -13 -14 -14 -14 -11 -17 -16 -13 -16 -14 -12 -12 -10 -4 D 1 -7 6 3 -3 1 0 -3 1 -3 -2 2 1 2 0 1 0 -2 -3 -2 -2 E 3 -7 5 7 -1 2 2 0 3 0 0 3 2 4 3 3 2 1 -1 0 -1 F -4 -4 -6 -6 7 -5 -1 0 -4 1 0 -5 -5 -4 -4 -4 -3 -1 3 3 0 G -2 -6 -3 -4 -5 5 -4 -5 -4 -5 -4 -2 -3 -4 -4 -2 -3 -5 -6 -4 -3 H 0 -6 0 0 1 0 8 -1 0 0 0 1 -2 1 1 0 0 0 1 3 0 I -3 -7 -6 -5 0 -5 -3 4 -4 1 1 -5 -4 -4 -3 -5 -2 2 -2 -1 0 K 2 -6 2 2 -1 1 2 0 5 1 1 2 0 3 4 2 2 0 -2 0 -1 L -2 -6 -5 -4 1 -4 -2 2 -3 4 2 -3 -4 -3 -2 -4 -2 1 0 0 1 M -2 -7 -4 -3 1 -2 -1 2 -2 2 6 -3 -4 -2 -1 -2 -1 1 0 0 1 N 0 -5 1 0 -3 1 1 -3 0 -2 -2 6 -2 0 0 1 1 -2 -3 -1 -1 P -1 -7 -1 -2 -4 -1 -3 -3 -2 -3 -4 -2 9 -2 -3 0 -1 -2 -4 -4 -4 Q 2 -7 2 2 -1 1 2 -1 2 0 0 2 0 5 2 1 1 0 -2 -1 0 R 1 -6 1 1 -1 0 2 0 3 0 1 1 -1 2 6 1 1 0 -1 0 0 S 0 -6 -1 -1 -3 0 -2 -3 -1 -3 -3 0 0 -1 -1 3 1 -2 -4 -3 0 T -1 -7 -2 -2 -3 -2 -2 -2 -2 -2 -2 -1 -2 -2 -2 0 3 -1 -3 -3 0 V -3 -6 -6 -5 -1 -4 -3 1 -4 0 0 -5 -3 -4 -4 -4 -2 2 -2 -2 0 W -4 -6 -6 -5 2 -6 -2 -2 -5 -1 -2 -5 -5 -4 -4 -5 -4 -2 12 2 -3 Y -3 -5 -5 -5 3 -4 1 -1 -3 -1 -1 -3 -5 -3 -3 -4 -3 -2 3 7 -1 J -2 0 -4 -5 0 -2 -1 0 -3 0 0 -3 -6 -2 -2 -1 -1 0 -1 0 9 U -5 16 -7 -8 -3 -5 -4 -3 -6 -3 -3 -5 -9 -6 -5 -4 -4 -3 -4 -3 6 ...
-
To generate a heat map for each table with values in it,
~user $ ulla -l TEMLIST --heatmap 0 --heatmap-values
which will look like this,
-
To generate one big figure, ‘myheatmaps.gif’ containing all the heat maps (4 maps in a row),
~user $ ulla -l TEMLIST --heatmap 1 --heatmap-stem myheatmaps --heatmap-format 1 --heatmap-columns 4
which will look like this,
Repository
You can download a pre-built RubyGems package from
-
rubyforge: rubyforge.org/projects/ulla
or, You can fetch the source from
Reference
Contact
Comments are welcome, please send an email to me (seminlee at gmail dot com).
License
This work is licensed under a Creative Commons Attribution-Noncommercial 2.0 UK: England & Wales License.