Wed Apr 14 14:16:46 2004
Gerald Hertz
gzhertz AT alum.mit.edu
================================================================
CONTENTS OF THE Consensus DIRECTORY (files are described below)
================================================================
CONSENSUS_2004-04-14.TAR.gz: NEW!! Contains everything in this directory
README.20f: Text version of this file
EXAMPLES.9
consensus-v6d.2.tar.gz
consensus.multiwidths-v1b
wconsensus-v5d.2.tar.gz
con-filter-v2c.c
seq-modifier-v1a.3.tar.gz
genbank-consensus-v1a.1.tar.gz
fasta-consensus-v2.c
rand-seqs-v1.2.tar.gz
patser-v3e.2.tar.gz
gmat-inf-gc-v2c.3.tar.gz
p-value-v3a.4.tar.gz
make-matrix-v2.4.tar.gz
alphabet.aa
theoryB.pdf
theoryB.ps
theoryA.pdf
theoryA.ps
================================================================
FUTURE PLANS:
1) "patser" and "consensus.multiwidths" will be upgraded.
2) I will be simplifying the procedure for compiling the programs.
================================================================
The CONSENSUS programs are a collection of programs for determining and
analyzing DNA and protein patterns describing functional elements.
Patterns are described by alignment matrices, which summarize alignments
of multiple sequences by listing the occurrence of each letter at each
position of the alignment. If you send e-mail notifying me that you are
using my programs, I will send you occasional information when I update
the CONSENSUS site [http://gzhertz.home.comcast.net].
The programs were developed under the UNIX operating system. If you
are using a UNIX system and my programs do not work, please notify me.
I have placed a copyright notice in the programs stating that my files
"May be copied for noncommercial purposes." My intention is not to
limit the use of my programs. However, if my programs have commercial
value, I would like to know about it.
IMPORTANT MESSAGE FOR USERS OF EARLIER VERSIONS OF MY PROGRAMS
(e.g., version 5 of CONSENSUS and earlier; version 4 of WCONSENSUS and
earlier): In the current versions, information content is calculated
using natural logarithms (i.e. base e). If you used earlier versions
that used logarithms base 2, multiply by ln(2) = 0.693 to convert
those information contents to natural logarithms.
I will first list the programs contained in this directory. I will
then describe how to obtain and compile the programs. Further below,
I will give a slightly more detailed description of each program. At
the end of this file, I will give the detailed directions that can be
obtained by giving a program the -h command line option.
=====================================================================
I HAVE THE FOLLOWING PROGRAMS THAT CAN BE DOWNLOADED:
=====================================================================
PROGRAMS FOR DETERMINING ALIGNMENTS
1) consensus (version 6d): determines alignments having a fixed width
2) consensus.multiwidths (version 1b): automatically runs the "consensus"
program multiple times
3) wconsensus (version 5d): determines alignments having variable widths
UTILITIES FOR HELPING THE ALIGNMENT PROGRAMS
4) con-filter (version 2c): extracts summary from consensus/wconsensus output
5) seq-modifier (version 1a): restricts the sequence data being aligned
6) genbank-consensus (version 1a): converts sequences from GenBank/EMBL format
7) fasta-consensus (version 2): converts sequences from FASTA format
8) rand-seqs (version 1): create a set of randomized sequences
PROGRAMS FOR ANALYZING ALIGNMENTS
9) patser (version 3e): scores sequences against an alignment matrix
10) gmat-inf-gc (version 2c): determines information content of a matrix
11) p-value (version 3a): determines the p-value of an information content
12) make-matrix (version 2): create an alignment matrix from aligned sequences
This directory also contains two useful files.
1) EXAMPLES.9: examples of how I typically run the alignment programs and other selected programs.
2) alphabet.aa: an alphabet file containing the amino acid alphabet.
In addition, this directory contains 2 manuscripts describing
the theory behind the programs.
1) theoryB.ps [or theoryB.pdf]: describes the statistics and methods behind
the "consensus" and "wconsensus" programs. This paper is the current
definitive reference for my alignment programs.
G.Z. Hertz and G.D. Stormo.
Identifying DNA and Protein Patterns with Statistically Significant
Alignments of Multiple Sequences.
Bioinformatics, 1999, volume 15, pages 563--577
2) theoryA.ps [or theoryA.pdf]: describes a statistical basis for
accounting for gaps in multiple alignments. It also includes
information on the methods behind the "consensus" and "wconsensus"
programs. However, the information on gaps is not necessary for the
programs currently available for downloading, and "theoryB.ps" is more
up to date as a reference for "consensus" and "wconsensus".
G.Z. Hertz and G.D. Stormo.
Identification of Consensus Patterns in Unaligned DNA and Protein
Sequences: A Large-Deviation Statistical Basis for Penalizing Gaps.
In: Proceedings of the Third International Conference on Bioinformatics
and Genome Research (H.A. Lim, and C.R. Cantor, editors).
World Scientific Publishing Co., Ltd. Singapore, 1995.
pages 201--216.
=====================================================================
HOW TO OBTAIN AND COMPILE THE PROGRAMS
=====================================================================
The source code for my programs can be downloaded from
http://gzhertz.home.comcast.net. The tar files should be unbundled
in separate directories to avoid name clashes.
The programs are written in C and were developed under the UNIX
operating system using SUN workstations and DECstations. Some of the
programs have been further used with Silicon Graphics Indigo
workstations, DEC AlphaStations, and Red Hat Linux. If you discover
aspects of my code that are not compatible with your system, please
let me know.
Each tar file contains a UNIX "makefile" that describes how to compile
the corresponding program and a copy of the corresponding directions
also shown below. Simply type "make" to compile the
program if your computer has the "gcc" compiler. If your computer
only has the "cc" compiler, you will need to edit the "makefile"
before typing "make" (unless you are using an SGI machine):
Change the two lines
#CC = cc
CC = gcc
to
CC = cc
#CC = gcc
If you are using an SGI machine, I have written special directions for
it in the "makefile". These compiling directions use the native "cc"
compiler which is substantially faster than "gcc" on these machines.
Instead of just typing "make", you will need to type the following commands:
make consensus-v6d.sgi
make wconsensus-v5d.sgi
make seq-modifier-v1a.sgi
make genbank-consensus-v1a.sgi
make rand-seqs-v1.sgi
make patser-v3e.sgi
make gmat-inf-gc-v2c.sgi
make p-value-v3a.sgi
make make-matrix-v2.sgi
The "fasta-consensus-v2" and "con-filter-v2c" programs are each
contained in a single file and are to be compiled directly from the
command line. To create these 2 programs, execute the following
command lines:
gcc -O2 -o fasta-consensus-v2 fasta-consensus-v2.c
gcc -O2 -o con-filter-v2c con-filter-v2c.c
If you do not have the "gcc" compiler and the above command line
fails, then simply use the "cc" compiler (i.e. use "cc" instead of
"gcc" as the initial word on the command line).
The "consensus.multiwidths-v1b" program is written in PERL and does
not need to be compiled. You will need to make this file executable
with the chmod command. At the command line execute:
chmod +x consensus.multiwidths-v1b
The path to the PERL program in the first line of this file
(/usr/bin/perl), may need to be changed for your particular system
depending on where PERL is located. If you get the error that
/usr/bin/perl cannot be found, execute the "which perl" command to
find the location of PERL. For example, if you get the response that
PERL is located in /usr/local/bin/perl, the first line would be
changed to: #!/usr/local/bin/perl -w
=====================================================================
DESCRIPTION OF EACH PROGRAM
=====================================================================
PROGRAMS FOR DETERMINING ALIGNMENTS
1) The "consensus" program is the current version of the program
originally described in Stormo and Hartzell (1989, PNAS, 86:1183-1187)
and Hertz et al. (1990, CABIOS, 6:81-92). However, this program has
many more options than the originally published version. Some of the
major changes are (i) each sequence may optionally contribute more
than one word to the pattern being generated, (ii) the user designates
the maximum number of matrices to save after each cycle (the "-q"
option), and (iii) the program calculates the statistical significance
of the alignments.
2) "consensus.multiwidths" automates the process of running the
"consensus" program with different widths. It also automates the
search for additional alignments that are completetly independent
because they contain a completely different set of sequence words. My
expectation is that most users will run "consensus" under the control
of the "consensus.multiwidth" program. I also expect most users to
use "consensus" rather than "wconsensus". "consensus.multiwidths"
requires the following programs to be available: "consensus",
"fasta-consensus", "con-filter", and "seq-modifier".
3) "wconsensus" differs from the "consensus" program in that the user
does not directly supply the width of the pattern being sought.
However, the user still needs to vary a bias that directly influences
the width of the pattern.
UTILITIES FOR HELPING THE ALIGNMENT PROGRAMS
4) The "con-filter" program extracts key information about the top
alignments identified by the "consensus" or "wconsensus" program. This
information allows a quick comparison of a program's output after
multiple runs with different settings of the width (in "consensus") or
the standard-deviation bias (in "wconsensus").
5) The "seq-modifier" program restricts the portion of the sequence
data that is available for alignment. The program first identifies
sequences that match a pre-existing pattern either because they are
contained in an output alignment of "consensus" or "wconsensus",
because they score high with the "patser" program, or because the
sequences are partially pre-aligned. The sequence data is then
modified to either exclude these sequences from a future alignment,
limit a future alignment to a region around these sequences, or force
the alignments to initiate only from these sequences. This program
takes advantage of the sequence modifiers that are described in
section 1 of the detailed directions of the "consensus" and
"wconsensus" programs.
6) The "genbank-consensus" program converts sequences from the GenBank
or EMBL format to the CONSENSUS format used by the "consensus",
"wconsensus", and "patser" programs. The CONSENSUS-style sequences
are sent to the standard output. Each CONSENSUS-style sequence can be
derived either from a whole GenBank/EMBL entry or from a designated
subset.
7) The "fasta-consensus" program converts a file from the FASTA
sequence format to the CONSENSUS format used by the "consensus",
"wconsensus", and "patser" programs. The FASTA-style sequences are
read from the standard input and the CONSENSUS-style sequences are
sent to the standard output.
8) The "rand-seqs" program creates a set of randomized sequences in
the CONSENSUS sequence style. Alignments of randomized sequences can
be a valuable tool for determining the statistical significance of a
sequence alignment, although the newest versions of "consensus" and
"wconsensus" do a good job of calculating statistical significance.
However, both the calculated statistical significance and the
randomized sequences assume that each position of a sequence is
independent and identically distributed. This is only true if you have
created the sequences by chemical synthesis. Therefore, I recommend
the additional control of aligning sequences that have been randomly
extracted from the organism(s) of interest. While the nucleotides of
the E. coli genome can be modeled fairly effectively as being
independent, this is not a safe assumption for other prokaryotes and
eukaryotes.
PROGRAMS FOR ANALYZING ALIGNMENTS
9) The "patser" program allows one to score the words of a sequence
against an alignment matrix obtained from the "consensus" or
"wconsensus" program.
10) The "gmat-inf-gc" program calculates the information content of an
alignment matrix determined by the "consensus" or "wconsensus"
program. This program can also do a crude graphing of the information
content at each position of an alignment matrix.
11) The "p-value" program determines an information content's p-value,
which is the probability of observing that particular information
content or greater in an arbitrary alignment of random sequences.
This calculation is also done within the current versions of the
"consensus", "wconsensus", and "gmat-inf-gc" programs. The input of
this program is (i) the width of the alignment, (ii) the number of
sequences in the alignment, and (iii) the negative log-likelihood
ratio, which equals the information content multiplied by the number
of sequences in the alignment.
12) The "make-matrix" creates an alignment matrix from a set of
aligned sequences. This can be useful if you wish to use the other
programs in this section (i.e. "patser", "gmat-inf-gc", or "p-value")
with an alignment that was not determined with my programs
(i.e. "consensus", or "wconsensus").
Below the dashed line are the detailed directions for the programs.
These are the directions you would get if you give any of the programs
the "-h" option. A summary of the command line options is printed
whenever an unrecognized command line option is used. If you have any
comments or questions, or if anything does not seem quite right,
please send e-mail to gzhertz AT alum.mit.edu.
=====================================================================
DETAILED DIRECTIONS FOR EACH PROGRAM
=====================================================================
Copyright 1990--2002 Gerald Z. Hertz
May be copied for noncommercial purposes.
Author:
Gerald Z. Hertz
CONSENSUS (version 6d)
REQUIRED PARAMETER
-L <Width of pattern (required)>
BASIC OPTIONS
-h <Print directions>
-f <Name of sequence file>
-a <Name of ascii alphabet file (default: alphabet)>
-A <Ascii alphabet information on the command line>
-d <Use designated prior frequencies (default: use observed frequencies)>
-c0 <Ignore the complementary strand (the default)>
-c1 <Include both strands as separate sequences>
-c2 <Include both strands as a single sequence (i.e., orientation unknown)>
-c3 <Assume that the pattern is symmetrical>
-l <Seed with first sequence and proceed linearly through list>
-n <Maximum number of cycles (0 or more sites per sequence)>
-N <Maximum number of cycles (1 or more sites per sequence)>
ADVANCED OPTIONS
-q <Number of matrices to save (default: 1000)>
-m <Minimum distance between words (default: integer indicated by -L option)>
-t <Terminate indicated number of cycles after most significant alignment>
-pt <Number of top matrices to print (default: 4 when NOT using -l option)>
-pf <Number of final matrices to print (default: 4 when NOT using -n or -N)>
-u0 <Unrecognized characters are errors>
-u1 <Unrecognized characters are discontinuities, but print warning (default)>
-u2 <Unrecognized characters are discontinuities, and print NO warning>
OBSCURE OPTIONS
-i <Name of integer alphabet file>
-CS <Ascii alphabet is case sensitive (default: case insensitive)>
-w <Only count letters included within the sequence fragments being aligned>
-pr1 <Save top progeny matrices regardless of parentage>
-pr2 <Save the top progeny for each parental matrix (the default)>
This program determines consensus patterns in unaligned sequences. The
algorithm is based on a matrix representation of a consensus pattern. Each
row corresponds to one of the letters of the relevant alphabet---e.g., 4
rows in the case of DNA. Each column corresponds to one of the positions
within the pattern. The elements of the matrix are determined by the
number of times that the indicated letter occurs at the indicated position.
Matrices are constructed by sequentially adding additional L-mers
(subsequences of length L, where L is the width of the pattern being
sought) to previously saved matrices. During each cycle, only the
most significant matrices are saved. The maximum number of matrices to
save is determined by the "-q" option (see section 1 below). In
practice, less matrices are ultimately saved because many of the
matrices initially saved are identical to each other.
The program can use 3 different criteria for deciding to stop adding
additional words to the saved matrices:
1) Each sequence has contributed exactly one word to the saved
matrices (the default).
2) The saved matrices contain a maximum allowable number of words (set
with the "-n" and "-N" options).
3) The program has completed a designated number of cycles since finding
the current most significant alignment (set with the "-t" option).
This latter criteria is used in addition to criteria 1 and 2
to terminate the program sooner.
The significance of a matrix is initially measured by its information
content. A higher information content indicates a rarer pattern and a
more desirable matrix. The program also estimates for each matrix a
p-value, which is the probability of observing the particular
information content or higher in an arbitrary alignment of random
L-mers. The ultimate statistical significance of a matrix is
determined by multiplying the p-value by the approximate number of
possible alignments, containing the designated number of sequences and
having the observed width. This product is the expected frequency of
observing the particular information content or higher in an arbitrary
alignment of random sequences, given the alignment width and the total
amount of sequence data. This expectation is called the e-value. The
e-value allows the comparison of matrices summarizing differing
numbers of sequences and having differing widths.
The program can print two different lists of matrices. The first list
contains the matrices having the highest information content from each
cycle, ordered by decreasing statistical significance (i.e.,
increasing e-value). In general, this first list will contain the
most interesting alignment. The second list contains the matrices
saved after the final cycle of the program, also ordered by decreasing
statistical significance. In general, this latter list will be useful
when the user wishes each sequence to contribute exactly one word to
the final alignment (i.e., when the "-n" and "-N" options are not used).
In the program's output, the words contained in each matrix are listed
in the order of their occurrence in the input sequences. The order is
indicated by "integer|integer". The first integer is simply a
sequential count of the words, and the second integer indicates during
which cycle the word was added to the matrix. The location of a word
is indicated by "integer/integer". The first integer indicates which
sequence contains the word, and the second integer indicates where in
that sequence the word is located. If the first integer is preceded
by a minus sign, then the complementary word is the one included in
the matrix.
The output of the program is sent to the standard output. The input
files---those containing the actual sequences and those indicated by
the "-f", "-a", and "-i" options---can contain comments according to
the following convention. The portion of a line following a ';', '%',
or '#' is considered a comment and is ignored. Comments can begin
anywhere in a line and always end at the end of the line. The one
minor exception is that, to avoid ambiguity, comments in the list of
sequences (see the "-f" option below) must be preceded by a blank
space when not occurring at the beginning of a line.
COMMAND LINE OPTIONS:
0) -h: print these directions.
1) General information
-f filename: this file (default: read from the standard input) contains
the names of the sequences. The names of the sequences must be
less than 512 characters. The corresponding sequence may follow
its name if the sequence is enclosed between backslashes (\).
Otherwise, the sequence is assumed to be in a separate file having
the indicated name. The format of the actual sequences is described
at the end of these directions.
ADVANCED FEATURES: The following four modifiers can appear in front
of the name of the relevant sequence:
-c: the sequence is circular.
-s integer-integer integer-integer: the positions in the sequence
indicated by the integer pairs, inclusive, are seed sequences.
If the "-s" modifier is used anywhere in the input file, then the
initial set of matrices will only be constructed (i.e., seeded)
from the sequences within the marked regions. If this modifier
is not used anywhere in the input file, then all the sequences
will be used to seed matrices. One or more integer pair can be
indicated for a single sequence. However, if no integer pairs
are given, the whole sequence will be used for seeding matrices.
-i integer-integer integer-integer: the positions in the sequence
indicated by the integer pairs, inclusive, are the only positions
to be analyzed.
-e integer-integer integer-integer: the positions in the sequence
indicated by the integer pairs, inclusive, are to be excluded
from the analysis.
When both the "-i" and "-e" modifiers are used, the intersection
of permissible positions is analyzed. When a sequence name is
not marked by either the "-i" or "-e" modifier, then the whole
sequence is included in the analysis.
-L integer: width of the pattern being sought (REQUIRED).
-q integer: the maximum number of matrices to save between cycles of the
program---i.e., the queue size (default: save 1000 matrices).
2) Alphabet options. The three options in this section are mutually
exclusive (default: "-a alphabet").
-a filename: file containing the alphabet and the proportionalities
for determining a priori probabilities.
Each line contains a letter (a symbol in the alphabet) followed by
an optional proportionality (default: 1.0). The proportionality is
based on the relative prior probabilities of the letters. For nucleic
acids, this might be be the genomic frequency of the bases; however,
if the "-d" option is not used, the frequencies observed in your own
sequence data are used. In nucleic acid alphabets, a letter and its
complement appear on the same line, separated by a colon (a letter can
be its own complement, e.g. when using a dimer alphabet).
Complementary letters may use the same proportionality. Only the
standard 26 letters are permissible; however, when the "-CS" option is
used, the alphabet is case sensitive so that a total of 52 different
characters are possible.
POSSIBLE LINE FORMATS WITHOUT COMPLEMENTARY LETTERS:
letter
letter proportionality
POSSIBLE LINE FORMATS WITH COMPLEMENTARY LETTERS:
letter:complement
letter:complement proportionality
letter:complement proportionality:complement's_proportionality
-A alphabet_and_proportionality_information: same as "-a" option, except
information appears on the command line (e.g., -A a:t 3 c:g 2).
-i filename: (OBSCURE OPTION) same as the "-a" option, except that
the symbols of the alphabet are represented by integers rather
than by letters. Any integer permitted by the machine is a
permissible symbol. Each symbol and its optional complement
and proportionality must be on a single line.
3) Alphabet modifiers
-d: use the designated prior probabilities of the letters to override the
observed frequencies. By default, the program uses the frequencies
observed in your own sequence data for the prior probabilities of the
letters. However, if the "-d" option is set, the prior probabilities
designated in the alphabet information (see section 2 above) are used.
If the "-d" option is not set, the "-A", "-a", and "-i" options
described in section 2 are still needed for determining the sequence
alphabet, but any prior probability information is ignored.
-CS: (OBSCURE OPTION) ascii alphabets are case sensitive. This
option is mutually exclusive with the "-i" option
(default: ascii alphabets are case insensitive).
-w: (OBSCURE OPTION) only count letters that are included within
the sequence fragments being aligned. When the "-i" or "-e" sequence
modifiers are used or the sequences contain forward slashes (/), some
sequence data is excluded from the sequences being aligned.
This option indicates that only the sequence data being aligned will
be counted when determining the observed frequency of each letter.
When the "-d" option is not used, this option will influence the
determination of the a priori probabilities and, thus, affect the
outcome of the alignment. When the "-w" option is not used, all the
sequence data will be counted towards determining the observed
frequency of each letter. Earlier versions of CONSENSUS (version 6c
and earlier) and WCONSENSUS (version 5c and earlier) are equivalent to
always having the "-w" option in effect.
4) Options for handling the complement of nucleic acid sequences---
the four options in this section are mutually exclusive
-c0: ignore the complement (the default option)
-c1: include both strands as separate sequences
-c2: include both strands as a single sequence (i.e., orientation unknown)
-c3: assume pattern is symmetrical
5) Algorithm options
the "-pr1" and "-pr2" options are mutually exclusive;
the "-l" and "-n" options are mutually exclusive;
the "-n" and "-N" options are mutually exclusive;
the "-m" option can only be used when the "-n" or "-N" option is used.
-pr1: (OBSCURE OPTION) save the top progeny matrices regardless of
parentage.
-pr2: (OBSCURE OPTION) try to save the top progeny matrices for each
parental matrix (the default). This option prevents a strong pattern
found in only a subset of the sequences from overwhelming the
algorithm and eliminating other potential patterns. This undesirable
situation can occur when a subset of the sequences share an
evolutionary relationship not common to the majority of the sequences.
This option corresponds to the original "consensus" algorithm
(Stormo and Hartzell, 1989, PNAS, 86:1183-1187; Hertz et al., 1990,
CABIOS, 6:81-92).
-l: (lowercase L) seed with the first sequence and proceed linearly
through the list. This option results in a significant speed
up in the program, but the algorithm becomes dependent on the
order of the sequence-file names. This option corresponds to
the original "consensus" algorithm (Stormo and Hartzell, 1989,
PNAS, 86:1183-1187; Hertz et al., 1990, CABIOS, 6:81-92).
-n integer: repeat the matrix building cycle a maximum of "integer"
times and allow each sequence to contribute zero or more words
per matrix.
-N integer: repeat the matrix building cycle a maximum of "integer"
times and allow each sequence to contribute one or more words
per matrix.
-m integer: the minimum distance between the starting points of words
within the same matrix pattern; must be a positive integer; can only
be used when the "-n" or "-N" option is also used. If the integer
is a 1, then there is no restriction on the overlap. If the integer
is the same as the integer indicated by the "-L" option, then no
overlap is allowed (default: integer indicated by the "-L" option).
When the "-c2" option is used (see below), then the "-m" option also
indicates the minimum distance between the start of a word and the
end of a word on the complementary strand.
-t integer: terminate the program "integer" cycles after the current
most significant alignment is identified (default: terminate only
when the maximum number of matrix building cycles is completed).
6) Output options
-pt integer: the number of matrices to print of the top matrices from
each cycle (default when NOT using the "-l" option: print 4 matrices;
default when using the "-l" option: print no matrices).
An integer of -1 means print all the top matrices.
-pf integer: the number of matrices to print of the matrices saved from
the final cycle (default when NOT using "-n" or "-N" options: print 4
matrices; default when using "-n" or "-N" option: print no matrices).
7) Options indicating how unrecognized symbols are treated (default: -u1).
Symbols are letters when option "-a" or "-A" is used;
symbols are integers when option "-i" is used.
The following three options are mutually exclusive.
-u0: treat unrecognized symbols as errors and exit the program.
-u1: treat unrecognized symbols as discontinuities, but print a warning.
Treating a symbol as a discontinuity means that any sequence word
containing the unrecognized symbol will be ignored.
-u2: treat unrecognized symbols as discontinuities, and print NO warning.
FORMAT OF THE SEQUENCE FILES
Do not explicitly give the complements of nucleic acid sequences. If
needed, the complementary sequence is determined by the program.
Whitespace, periods, dashes (unless part of an integer when the "-i"
option is used), and comments beginning with ';', '%', or '#' are
ignored. When using letter characters (i.e., with the "-a" and "-A"
alphabet options), integers are also ignored so that the sequence file
can contain positional information. When using integer characters
(i.e., with the "-i" alphabet option) the integers must be separated
by whitespace.
Sequences surrounded by slashes (/) do not contribute to the
generation of the patterns; thus, a portion of a sequence can be
ignored without disrupting the overall numbering of the sequence.
A double slash (//) would indicate a discontinuity in the sequence.
A '/' at the beginning or the end of a sequence will cause the sequence
to be marked as non-circular even if the sequence's name is marked
with a "-c" (see the "-f" option in section 1). The effect of the
single slashes can also be created with the "-i" and "-e" modifiers in
the file containing the names of the sequences (see the "-f" option in
section 1). When slashes and the "-i" and "-e" modifiers are all
used, the intersection of permissible positions is analyzed.
Sequences that follow their name in the file indicated by the "-f"
option must be enclosed between backslashes (\) (i.e., the actual
sequence must be preceded and followed by a backslash). However, if
the sequence is contained in a separate file, do NOT use a '\'.
---------------------------------------------------------------------------
Copyright 2002 Gerald Z. Hertz
May be copied for noncommercial purposes.
Author:
Gerald Z. Hertz
CONSENSUS.MULTIWIDTHS (version 1b)
This program allows the user to feed the consensus program a range of
widths. All the normal command line options for the CONSENSUS program
are passed directly to the CONSENSUS program except for the -L, the
-f, and the -h options. The input sequences cannot be read from the
standard input when using this script. The program also adds the -M
option for seeking additional independent alignments, the -F option
for reading input sequences that are in the FASTA format, the -O
option for indicating the prefix of the output files, and the -S
option described below.
The name of the specific version of the CONSENSUS program and all its
other command line options are placed on the command line following
the name of this program. In addition to a current version of the
CONSENSUS program, this program requires the CON-FILTER,
SEQ-MODIFIER, and FASTA-CONSENSUS programs.
EXAMPLE
consensus.multiwidths-v1b consensus-v6d -A a:t c:g -c2 -n 40 -t 4 -L 8 14 -f input_sequence -M 4 > outputfile
NOTE: the "-A a:t c:g -c1 -n 40 -t 4" options are passed directly to
the "consensus-v6d" program and are not explicitly discussed below.
-h: Print these directions
-L OPTION (REQUIRED):
The -L can have alternative formats. In the following descriptions,
"width", "minwidth", "maxwidth", "delta" are positive integers. The
-L option can have the following formats:
The standard format: -L width
Repeatedly execute the program with all widths between
"minwidth" and "maxwidth", inclulsive: -L minwidth maxwidth
Repeatedly execute the program starting with "minwidth" and
incrementing the width by "delta" until the width is greater than
"maxwidth": -L minwidth maxwidth delta
-f OR -F OPTIONS (REQUIRED):
The -F option is a mutually exclusive alternative to the -f option.
-F indicates that the input sequence file is in the FASTA format
rather than the CONSENSUS format that is directly used by the CONSENSUS
program. The sequence file will be run through filters to convert from
the FASTA format (if -F is used) and to delete sequence words when
more than one alignment is being sought (see the -M option below).
-M OPTION:
The -M option tells the program to search for the indicated number of
independent alignments (the program will stop if there is insufficient
sequence remaining to create an independent alignment). The -M option
can have either of the following 2 formats.
The program will also stop searching for new matrices if the e-value
goes above -1: -M number
The program will only stop searching for new matrices after the
indicated number of independent alignments have been found: -M +number
-O OPTION:
By default, a unique prefix is created for the output files by
appending the options listed on the command line to the process ID.
The -O option allows the user to set the prefix for the output files.
The format of this option: -O prefix.
-S OPTION:
Start searching for independent matrices starting from the output of a
previous run of the consensus program generated by "consensus.multiwidths".
The "output_file_of_consensus" refers to the output of the CONSENSUS
program in one of the files listed in the output of "consensus.multiwidths".
These output files have names with the general format of "prefix.L11.MAT3".
The format of this option: -S output_file_of_consensus
The -L and the -f (or -F) options are still required with this option.
The -M option is optional; however, if you wish more than one more
matrix, it needs to be higher than the integer suffix at the end of
the "output_file_of_consensus". The -O option is inactive. The
options specific for the consensus program, including the program's
name, are read from the "output_file_of_consensus" and are not placed
on the command line.
---------------------------------------------------------------------------
Copyright 1991--2002 Gerald Z. Hertz
May be copied for noncommercial purposes.
Author:
Gerald Z. Hertz
WCONSENSUS (version 5d)
REQUIRED PARAMETER
-s <Number of standard deviations for identifying information peaks>
BASIC OPTIONS
-h <Print directions>
-f <Name of sequence file>
-a <Name of ascii alphabet file>
-A <Ascii alphabet information on the command line>
-d <Use designated prior frequencies (default: use observed frequencies)>
-c0 <Ignore the complementary strand (default)>
-c1 <Include both strands as separate sequences>
-c2 <Include both strands as a single sequence (i.e., orientation unknown)>
-l <Seed with first sequence and proceed linearly through list>
-n <Maximum number of cycles (0 or more sites per sequence)>
-N <Maximum number of cycles (1 or more sites per sequence)>
ADVANCED OPTIONS
-q <Number of matrices to save (default: 200)>
-m <Minimum distance between words (default: 1)>
-t <Terminate indicated number of cycles after most significant alignment>
-pt <Number of top matrices to print (default: 4 when NOT using -l option)>
-pf <Number of final matrices to print (default: 4 when NOT using -n or -N)>
-u0 <Unrecognized characters are errors>
-u1 <Unrecognized characters are discontinuities, but print warning (default)>
-u2 <Unrecognized characters are discontinuities, and print NO warning>
OBSCURE OPTIONS
-i <Name of integer alphabet file>
-CS <Ascii alphabet is case sensitive (default: case insensitive)>
-w <Only count letters included within the sequence fragments being aligned>
-pr1 <Save top progeny matrices regardless of parentage>
-pr2 <Save the top progeny for each parental matrix (the default)>
-pg0 <Do not allow terminal gaps---insertions and deletion (the default)>
-pg1 <Allow penalized terminal gaps---insertions and deletion>
This program determines consensus patterns in unaligned sequences.
The major difference between "wconsensus" and "consensus" is that this
program will determine the width of the pattern being sought. The
algorithm is based on a matrix representation of a consensus pattern.
Each row corresponds to one of the letters of the relevant
alphabet---e.g., 4 rows in the case of DNA. Each column corresponds
to one of the positions within the pattern. The elements of the matrix
are determined by the number of times that the indicated letter occurs
at the indicated position based on the words summarized by the pattern.
Matrices are constructed by sequentially adding additional words to
previously saved matrices. During each cycle, only the most
significant matrices are saved. The maximum number of matrices to
save is determined by the "-q" option (see section 1 below). In
practice, less matrices are ultimately saved because many of the
matrices initially saved are identical to each other.
The program can use 3 different criteria for deciding to stop adding
additional words to the saved matrices:
1) Each sequence has contributed exactly one word to the saved
matrices (the default).
2) The saved matrices contain a maximum allowable number of words (set
with the -n and -N options).
3) The program has completed a designated number of cycles since finding
the current most significant alignment (set with the -t option).
This latter criteria is used in addition to criteria 1 and 2
to terminate the program sooner.
The significance of a matrix is initially measured by its information
content. A higher information content indicates a rarer pattern and a
more desirable matrix. The information content of alignments having
different widths are compared after subtracting from each position the
average information and a multiple of the standard deviation expected
from an arbitrary alignment of random sequences. The program also
estimates for each matrix a p-value, which is the probability of
observing the particular information content or higher in an arbitrary
alignment of random words having a length equal to the matrix's width.
The ultimate statistical significance of a matrix is determined by
multiplying the p-value by the approximate number of possible
alignments, containing the designated number of sequences and having
the observed width. This product is the expected frequency of
observing the particular information content or higher in an arbitrary
alignment of random sequences, given the alignment width and the total
amount of sequence data. This expectation is called the e-value. The
e-value allows the comparison of matrices summarizing differing
numbers of sequences and having differing widths.
To identify an overall best alignment, it is necessary to determine
the alignments using various multiples of the standard-deviation
correction to the information content (set with the -s option). As
the standard-deviation correction is increased, less positions will
tend to be in the resulting alignments. The overall best alignment is
the one having the smallest e-value. We have found standard-deviation
corrections of 0.5, 1, 1.5, and 2 to be useful starting values.
The program can print two different lists of matrices. The first list
contains the matrices having the highest adjusted information from
each cycle, ordered by decreasing statistical significance (i.e.,
increasing e-value). In general, this first list will contain the
most interesting alignment. The second list contains the matrices
saved after the final cycle of the program, also ordered by decreasing
statistical significance. In general, this latter list will be useful
when the user wishes each sequence to contribute exactly one word to
the final alignment (i.e., when the -n and -N options are not used).
In the program's output, the words contained in each matrix are listed
in the order of their occurrence in the input sequences. The order is
indicated by "integer|integer". The first integer is simply a
sequential count of the words, and the second integer indicates during
which cycle the word was added to the matrix. The location of a word
is indicated by "integer/integer". The first integer indicates which
sequence contains the word, and the second integer indicates where in
that sequence the word is located. If the first integer is preceded
by a minus sign, then the complementary word is the one included in
the matrix.
The output of the program is sent to the standard output. The input
files---those containing the actual sequences and those indicated by
the "-f", "-a", and "-i" options---can contain comments according to
the following convention. The portion of a line following a ';', '%',
or '#' is considered a comment and is ignored. Comments can begin
anywhere in a line and always end at the end of the line. The one
minor exception is that, to avoid ambiguity, comments in the list of
sequences (see the "-f" option below) must be preceded by a blank
space when not occurring at the beginning of a line.
COMMAND LINE OPTIONS:
0) -h: print these directions.
1) General information
-f filename: this file (default: read from the standard input) contains
the names of the sequences. The names of the sequences must be
less than 512 characters. The corresponding sequence may follow
its name if the sequence is enclosed between backslashes (\).
Otherwise, the sequence is assumed to be in a separate file having
the indicated name. The format of the actual sequences is described
at the end of these directions.
ADVANCED FEATURES: The following four modifiers can appear in front
of the name of the relevant sequence:
-c: the sequence is circular. WARNING: circular sequences are not
handled completely properly, unless the sequence has a
discontinuity; in other words, circular sequences should be
modified by the -i or -e modifiers described below.
-s integer-integer integer-integer: the positions in the sequence
indicated by the integer pairs, inclusive, are seed sequences.
If the "-s" modifier is used anywhere in the input file, then the
initial set of matrices will only be constructed (i.e., seeded)
from the sequences within the marked regions. If this modifier
is not used anywhere in the input file, then all the sequences
will be used to seed matrices. One or more integer pair can be
indicated for a single sequence. However, if no integer pairs
are given, the whole sequence will be used for seeding matrices.
-i integer-integer integer-integer: the positions in the sequence
indicated by the integer pairs, inclusive, are the only positions
to be analyzed.
-e integer-integer integer-integer: the positions in the sequence
indicated by the integer pairs, inclusive, are to be excluded
from the analysis.
When both the "-i" and "-e" modifiers are used, the intersection
of permissible positions is analyzed. When a sequence name is
not marked by either the "-i" or "-e" modifier, then the whole
sequence is included in the analysis.
-q integer: the maximum number of matrices to save between cycles of the
program---i.e., the queue size (default: save 200 matrices).
This option can also be changed while the program is running:
see section 5 below.
-s number: the number of standard deviations to lower the information
content at each position before identifying information peaks
(REQUIRED). A range of values should be tried. For example,
try values of 0.5, 1, 1.5, and 2. The overall best alignment is
the one having the smallest e-value.
2) Alphabet options. The three options in this section are mutually
exclusive (default: "-a alphabet").
-a filename: file containing the alphabet and the proportionalities
for determining a priori probabilities.
Each line contains a letter (a symbol in the alphabet) followed by
an optional proportionality (default: 1.0). The proportionality is
based on the relative prior probabilities of the letters. For nucleic
acids, this might be be the genomic frequency of the bases; however,
if the "-d" option is not used, the frequencies observed in your own
sequence data are used. In nucleic acid alphabets, a letter and its
complement appear on the same line, separated by a colon (a letter can
be its own complement, e.g. when using a dimer alphabet).
Complementary letters may use the same proportionality. Only the
standard 26 letters are permissible; however, when the "-CS" option is
used, the alphabet is case sensitive so that a total of 52 different
characters are possible.
POSSIBLE LINE FORMATS WITHOUT COMPLEMENTARY LETTERS:
letter
letter proportionality
POSSIBLE LINE FORMATS WITH COMPLEMENTARY LETTERS:
letter:complement
letter:complement proportionality
letter:complement proportionality:complement's_proportionality
-A alphabet_and_proportionality_information: same as "-a" option, except
information appears on the command line (e.g., -A a:t 3 c:g 2).
-i filename: (OBSCURE OPTION) same as the "-a" option, except that
the symbols of the alphabet are represented by integers rather
than by letters. Any integer permitted by the machine is a
permissible symbol. Each symbol and its optional complement
and proportionality must be on a single line.
3) Alphabet modifiers
-d: use the designated prior probabilities of the letters to override the
observed frequencies. By default, the program uses the frequencies
observed in your own sequence data for the prior probabilities of the
letters. However, if the "-d" option is set, the prior probabilities
designated in the alphabet information (see section 2 above) are used.
If the "-d" option is not set, the "-A", "-a", and "-i" options
described in section 2 are still needed for determining the sequence
alphabet, but any prior probability information is ignored.
-CS: (OBSCURE OPTION) ascii alphabets are case sensitive. This
option is mutually exclusive with the "-i" option
(default: ascii alphabets are case insensitive).
-w: (OBSCURE OPTION) only count letters that are included within
the sequence fragments being aligned. When the "-i" or "-e" sequence
modifiers are used or the sequences contain forward slashes (/), some
sequence data is excluded from the sequences being aligned.
This option indicates that only the sequence data being aligned will
be counted when determining the observed frequency of each letter.
When the "-d" option is not used, this option will influence the
determination of the a priori probabilities and, thus, affect the
outcome of the alignment. When the "-w" option is not used, all the
sequence data will be counted towards determining the observed
frequency of each letter. Earlier versions of CONSENSUS (version 6c
and earlier) and WCONSENSUS (version 5c and earlier) are equivalent to
always having the "-w" option in effect.
4) Options for handling the complement of nucleic acid sequences---
the 3 options in this section are mutually exclusive.
-c0: ignore the complement (the default option)
-c1: include both strands as separate sequences
-c2: include both strands as a single sequence (i.e., orientation unknown)
[ONLY IMPLEMENTED FOR WHEN THE "-n" and "-N" OPTIONS ARE NOT USED!!!!]
5) Algorithm options
the "-pr1" and "-pr2" options are mutually exclusive;
the "-l" and "-n" options are mutually exclusive;
the "-n" and "-N" options are mutually exclusive;
the "-m" option can only be used when the "-n" or "-N" option is used.
-pr1: (OBSCURE OPTION) save the top progeny matrices regardless of
parentage.
-pr2: (OBSCURE OPTION) try to save the top progeny matrices for each
parental matrix (the default). This option prevents a strong pattern
found in only a subset of the sequences from overwhelming the
algorithm and eliminating other potential patterns. This undesirable
situation can occur when a subset of the sequences share an
evolutionary relationship not common to the majority of the sequences.
This option corresponds to the original "consensus" algorithm
(Stormo and Hartzell, 1989, PNAS, 86:1183-1187; Hertz et al., 1990,
CABIOS, 6:81-92).
-l: (lowercase L) seed with the first sequence and proceed linearly
through the list. This option results in a significant speed
up in the program, but the algorithm becomes dependent on the
order of the sequence-file names. This option corresponds to
the original "consensus" algorithm (Stormo and Hartzell, 1989,
PNAS, 86:1183-1187; Hertz et al., 1990, CABIOS, 6:81-92).
-n integer: repeat the matrix building cycle a maximum of "integer"
times and allow each sequence to contribute zero or more words
per matrix.
-N integer: repeat the matrix building cycle a maximum of "integer"
times and allow each sequence to contribute one or more words
per matrix.
-m integer: the minimum distance between the starting points of words
within the same matrix pattern; must be a positive integer; can only
be used when the "-n" or "-N" option is also used. If the integer
is a 1, then there is no restriction on the overlap. (default: 1).
-t integer: terminate the program "integer" cycles after the current
most significant alignment is identified (default: terminate only
when the maximum number of matrix building cycles is completed).
-pg0: (OBSCURE OPTION) do NOT permit terminal gaps (the default).
-pg1: (OBSCURE OPTION) permit penalized terminal gaps---i.e. deletions.
*** "-pg1" cannot currently be used with the "-n" or "-N" options. ***
OBSCURE FEATURE: The "-q", "-n", and "-t" options can be changed
after the program starts by placing the new options in a file called
"changes." suffixed with the process identification number---the PID
number listed at the beginning of the program's output. For example a
file called "changes.10568" might contain "-q10 -n50 -t2". The "-n"
option can change the maximum number of words in the alignments even
if it was not used at the beginning of the program, although it will
not permit a sequence to contribute more than one word to the
alignment unless the "-n" or "-N" option was used on the command line.
If the "-t" option was not used when the program was started, this
option will only keep track of alignments beginning with the cycle
during which it is first initiated.
6) Output options
-pt integer: the number of matrices to print of the top matrices from
each cycle (default when NOT using the "-l" option: print 4 matrices;
default when using the "-l" option: print no matrices).
An integer of -1 means print all the top matrices.
-pf integer: the number of matrices to print of the matrices saved from
the final cycle (default when NOT using "-n" or "-N" option: print 4
matrices; default when using "-n" or "-N" option: print no matrices).
7) Options indicating how unrecognized symbols are treated (default: -u1).
Symbols are letters when option "-a" or "-A" is used;
symbols are integers when option "-i" is used.
The following three options are mutually exclusive.
-u0: treat unrecognized symbols as errors and exit the program.
-u1: treat unrecognized symbols as discontinuities, but print a warning.
Treating a symbol as a discontinuity means that any sequence word
containing the unrecognized symbol will be ignored.
-u2: treat unrecognized symbols as discontinuities, and print NO warning.
FORMAT OF THE SEQUENCES
Do not explicitly give the complements of nucleic acid sequences. If
needed, the complementary sequence is determined by the program.
Whitespace, periods, dashes (unless part of an integer when the "-i"
option is used), and comments beginning with ';', '%', or '#' are
ignored. When using letter characters (i.e., with the "-a" and "-A"
alphabet options), integers are also ignored so that the sequence file
can contain positional information. When using integer characters
(i.e., with the "-i" alphabet option) the integers must be separated
by whitespace.
Sequences surrounded by slashes (/) do not contribute to the
generation of the patterns; thus, a portion of a sequence can be
ignored without disrupting the overall numbering of the sequence.
A double slash (//) would indicate a discontinuity in the sequence.
A '/' at the beginning or the end of a sequence will cause the sequence
to be marked as non-circular even if the sequence's name is marked
with a "-c" (see the "-f" option in section 1). The effect of the
single slashes can also be created with the "-i" and "-e" modifiers in
the file containing the names of the sequences (see the "-f" option in
section 1). When slashes and the "-i" and "-e" modifiers are all
used, the intersection of permissible positions is analyzed.
Sequences that follow their name in the file indicated by the "-f"
option must be enclosed between backslashes (\) (i.e., the actual
sequence must be preceded and followed by a backslash). However, if
the sequence is contained in a separate file, do NOT use a '\'.
---------------------------------------------------------------------------
Copyright 1996, 1998, 2002 Gerald Z. Hertz
May be copied for noncommercial purposes.
Author:
Gerald Z. Hertz
CON-FILTER (version 2c)
This program filters out the
number of sequences,
the width,
the sample size adjusted information content,
the ln(p-value), and
the ln(e-value)
of the "top MATRIX" and "top MATRIX from the final cycle"
identified by the "consensus", "wconsensus", or "lconsensus" program.
The names of multiple input files can be from the command line or
the standard input. The output goes to the standard output.
OPTIONS
-h: print these directions.
-f int: the maximum width of the name of an input file (default: 45).
ADVANCED OPTIONS
-nh: do not print the output heading.
-oh: print only the output heading. All items following "-oh"
on the command line are ignored.
-p int: (OBSCURE OPTION) the probability number for the lconsensus program.
1: print probability information 1 (correlation) [the default];
2: print probability information 2 (NO correlation).
---------------------------------------------------------------------------
Copyright 1999 Gerald Z. Hertz
May be copied for noncommercial purposes.
Author:
Gerald Z. Hertz
SEQ-MODIFIER (version 1a)
This program takes sequence data in CONSENSUS format and modifies it
according to the output of the "consensus", "wconsensus",
"lconsensus-gc", "patser", or "gpatser" program. The output of this
program is printed to the standard output.
This program modifies the sequence data to either exclude, include, or
seed with the sequence words contained within a matrix, scored by a
matrix, or located at a specific location within each sequence.
Preexisting sequence modifications can also be included from the
sequence data and the sequence summary in the output of the alignment
programs---see section 3 below.
COMMAND LINE OPTIONS:
0) -h: print these directions.
1) The type of modification.
One of the following mutually exclusive options is required:
-i integer1 integer2: include the designated sequence words plus integer1
letters before and integer2 letters after each sequence word.
-e integer1 integer2: exclude the designated sequence words plus integer1
letters before and integer2 letters after each sequence word.
-s integer1 integer2: seed with the designated sequence words plus
integer1 letters before and integer2 letters after each sequence word.
2) The input for determining the modifications.
One of the following mutually exclusive options is required:
-d filename: the name of the data file. This file should contain
the output of the "consensus", "wconsensus",
"lconsensus-gc", "patser", or "gpatser" program.
-l integer: location of the modification. This option directs each
sequence to include, exclude, or seed around the same
position number of each sequence. The exact number of
positions before and after this location are determined
by the options in section 1.
3) Which preexisting modifications listed in the sequence data and the
sequence summary are to be retained. The default is for the new
modifications to replace preexisting modifications of the same type
(i.e., -i, -e, -s). The -r and -a options are mutually exclusive.
-x: do NOT retain modifications only described in the sequence summary
of the "consensus", "wconsensus", or "lconsensus-gc" outputs.
-r: the new modifications replace all the preexisting modifications
listed in the sequence summary and sequence data.
-a: the new modifications are added to all the modifications listed
in the sequence summary and sequence data.
4) Optional modifiers to the input file named in section 2 (the -d option).
-m integer: exclude or seed with the sequence words contained in
"MATRIX 1" of the indicated matrix list in the output
file of the "consensus", "wconsensus", or
"lconsensus-gc" program. The integer, which should be
either a 1 (the default) or a 2, indicates which
"MATRIX 1" is to be processed since the output of the
alignment programs can contain 2 lists of matrices.
-p number: exclude or seed with sequence words scoring
higher than "number" in the output of the "patser" or
"gpatser" program. If the number is missing, all the
sequences in the output are used to determine the
modification.
5) The source of the sequence data. The source of the sequence data
is determined by one of the following 3 methods
-f filename: read the sequence data from the designated file.
Use a dash (-) to designated the standard input.
If the -f option is not used, read the sequence data from the sequence
file designated in the input file named in section 2 (the -d option).
If the -f option is not used and a sequence file is not designated
in an input file, read the sequence data from the standard input.
---------------------------------------------------------------------------
Copyright 1995, 1999 Gerald Z. Hertz
May be copied for noncommercial purposes.
Author:
Gerald Z. Hertz
GENBANK-CONSENSUS (version 1a)
This program extracts a set of sequence fragments from a set of
GenBank or EMBL sequences and prints the sequences in CONSENSUS
format. The final line of the non-sequence portion of a sequence
entry is assumed either to begin with "ORIGIN" (GenBank format) or
"SQ" (EMBL format), or to end with a ".." (old GCG format). The
description of which sequence fragments are to be extracted is read
from the standard input and the sequence fragments are written to the
standard output. The following is the format of the input:
Each line describes a single sequence fragment and can contain up to 4
fields. Only the first field is required. Comments following a '#',
'%', or a ';' are ignored until the end of the line.
1) The required first field is the name of the sequence fragment and
is the only required field. If given by itself without the "-r"
command line option, then the whole GenBank/EMBL sequence file having this name
is printed.
2) The optional second field is the name of the GenBank/EMBL sequence file. If
the second field is absent, then the GenBank/EMBL sequence file is assumed to
have the same name as the sequence fragment indicated in the first field.
3) The optional third field indicates whether the reverse complement is
desired and describes a reference position. The third field must be
contained within double quotes so it can contain spaces and be
distinguished from a missing second field.
The following is a detailed description of the third field.
"int ... int" or "complement int ... int" or "complement": the
first integer is the reference position when only a partial
region of the sequence file is printed (see the "-r" command
line option below). All the letters corresponding to these
positions are placed in lower case if the "-l" command line
option is used. The word "complement" indicates that the
complementary strand is to be used. If no integers are
indicated, the minus 1 position of the sequence is assumed to be
the reference position.
4) The optional fourth field indicates what region of the sequence file
should be printed. This information is overridden by the -r command
line option. The fourth field must be contained within parentheses and
has the following format.
(int int): print the sequence fragment contained by the 2 positions
indicated by the integers. DEFAULT: print the whole sequence.
Any remaining fields are ignored.
The following 2 command line options are possible:
-l or -l int: make the letters at reference positions lowercase; if
"-l" is followed by an int, offset the lowercase letters by the
indicated number of positions relative to the reference
positions. The remaining letters are all uppercase.
DEFAULT: print all letters in uppercase.
-r int int: print a region of the sequence file relative to the reference
position; the first integer is the number of letters to print prior to
the reference position; the second integer is the number of letters to
print following the reference position.
DEFAULT: print the whole sequence.
---------------------------------------------------------------------------
Copyright 1995, 1996, 1997 Gerald Z. Hertz
May be copied for noncommercial purposes.
Author:
Gerald Z. Hertz
FASTA-CONSENSUS (version 2)
This program converts a file from the FASTA sequence format to the
CONSENSUS sequence format. The input is from the standard input and
the output is sent to the standard output. In the FASTA format, the
line preceding a sequence begins with a ">" and contains the
sequence's name and a comment.
-l: print the sequence on a single line and do not print comments.
---------------------------------------------------------------------------
Copyright 1997 Gerald Z. Hertz
May be copied for noncommercial purposes.
Author:
Gerald Z. Hertz
RAND-SEQS (version 1)
This program creates a set of randomized sequence from the composition
indicated by the a priori frequencies determined by the alphabet
options (see section 4 below). Because the random letters are chosen
"with replacement," the final composition will generally be somewhat
different from the composition described in the alphabet information.
The output goes to the standard output and is in the CONSENSUS
format---i.e., sequence name, backslash, the randomized sequence, and
another backslash.
COMMAND LINE OPTIONS:
0) -h: print these directions.
1) -s int: the seed for the random number generator
(default: the process identification number).
2) -n int: the number of randomized sequences to generate (required).
3) -L int: the number of letters in each randomized sequence (required).
4) Alphabet options.
The following 2 options are mutually exclusive (default: "-a alphabet").
-a filename: file containing the alphabet and normalization information.
[Use "-af" when using the VMS operating system]
Each line contains a letter (a symbol in the alphabet) followed by
an optional normalization number (default: 1.0). The normalization
is based on the relative a priori frequencies of the letters. For
nucleic acids, this might be be the genomic frequency of the bases.
In nucleic acid alphabets, a letter and its complement appear on
the same line, separated by a colon (a letter can be its own
complement, e.g. when using a dimer alphabet). Complementary
letters may use the same normalization number. Only the standard
26 letters are permissible; however, the alphabet is case sensitive
so that a total of 52 different characters are possible.
POSSIBLE LINE FORMATS WITHOUT COMPLEMENTARY LETTERS:
letter
letter normalization
POSSIBLE LINE FORMATS WITH COMPLEMENTARY LETTERS:
letter:complement
letter:complement normalization
letter:complement normalization:complement's_normalization
-A alphabet_and_normalization_information: same as "-a" option, except
information appears on the command line (e.g., -A a:t 3 c:g 2).
[Use "-ac" when using the VMS operating system]
---------------------------------------------------------------------------
Copyright 1990, 1994, 1995, 1996, 2000, 2001, 2002 Gerald Z. Hertz
May be copied for noncommercial purposes.
Author:
Gerald Z. Hertz
PATSER (version 3e)
This program scores the L-mers (subsequences of length L) of the
indicated sequences against the indicated alignment or weight matrix.
The elements of an alignment matrix are simply the number of times
that the indicated letter is observed at the indicated position of a
sequence alignment. Such elements must be processed before the matrix
can be used to score an L-mer (e.g., Hertz and Stormo, 1999,
Bioinformatics, 15:563-577). A weight matrix is a matrix whose
elements are in a form considered appropriate for scoring an L-mer.
Each element of an alignment matrix is converted to an element of a
weight matrix by first adding pseudo-counts in proportion to the a
priori probability of the corresponding letter (see option "-b" in
section 1 below) and dividing by the total number of sequences plus
the total number of pseudo-counts. The resulting frequency is
normalized by the a priori probability for the corresponding letter.
The final quotient is converted to an element of a weight matrix by
taking its natural logarithm. The use of pseudo-counts here differs
from previous versions of this program by being proportional to the a
priori probability.
Version 3 of this program differs from previous versions by also
numerically estimating the p-value of the scores. The p-value
calculated here is the probability of observing a particular score or
higher at a particular sequence position and does NOT account for the
total amount of sequence being scored. P-values are estimated by the
method described in Staden, 1989, CABIOS, p. 89--96. The relative
value for each element of the weight matrix is approximated by
integers in a range determined by the "-R" and "-M" options (section 6
below). The p-value is calculated for each possible integer score and
the values are stored. The actual scores for the sequences are
determined from the true weight matrix. The true scores are converted
to their corresponding integer values and their p-values are looked up.
Matrices can be either horizontal or vertical. In a horizontal
matrix, the columns correspond to the positions within the pattern,
and the rows correspond to the letters. Each row begins with the
corresponding letter (or integer, if the "-i" option is used). In a
vertical matrix, the rows correspond to the positions within the
pattern, and the columns correspond to the letters. The first row
contains the letters (or integers, if the "-i" option is used)
corresponding to each column. In both types of matrices, spaces,
tabs, and vertical bars (|) are ignored. The output of the "consensus"
and "wconsensus" programs consists of horizontal alignment matrices.
The input files can contain comments according to the following
convention. The portion of a line following a ';', '%', or '#' is
considered a comment and is ignored. Comments can begin anywhere in a
line and always end at the end of the line. The output of this
program is sent to the standard output.
The following options can be determined on the command line.
0) -h: print these directions.
1) Matrix options.
-m filename: (default name is "matrix") file containing the matrix.
-w: the matrix is a weight matrix (default: alignment matrix)
-b number: a non-negative number indicating the total number of
pseudo-counts added to each alignment position (default: 1).
Before converting an alignment matrix to a weight matrix, the
total pseudo-counts multiplied by the a priori probability
(see section 3 below) of the corresponding letter is added
to each matrix element.
-v: the matrix is a vertical matrix (default: horizontal matrix).
-p: print the weight matrix derived from the alignment matrix.
2) -f filename: this file (default: read from the standard input) contains
the names of the sequences. The corresponding sequence may follow
its name if the sequence is enclosed between backslashes (\).
Otherwise, the sequence is assumed to be in a separate file having
the indicated name.
In the sequences, whitespace, slashes (/), periods, dashes (unless
part of an integer when the "-i" option is used), and comments
beginning with ';', '%', or '#' are ignored. When using letter
characters (i.e., with the "-a" or "-A" alphabet option), integers
are also ignored so that the sequence file can contain positional
information. When using integer characters (i.e., with the "-i"
alphabet option) the integers must be separated by whitespace.
A "-c" preceding the name of a sequence file indicates that the
corresponding sequence is circular.
3) Alphabet options---the three options in this section are mutually
exclusive (default: "-a alphabet"). The a priori probabilities mentioned
below are used when converting an alignment matrix to a weight matrix.
-a filename: file containing the alphabet and the proportionalities
for determining a priori probabilities.
Each line contains a letter (a symbol in the alphabet) followed by
an optional proportionality (default: 1.0). The proportionality is
based on the relative a priori probabilities of the letters. For
nucleic acids, this might be be the genomic frequency of the bases or
the frequencies observed in the data used to generate the alignment.
In nucleic acid alphabets, a letter and its complement appear on the
same line, separated by a colon (a letter can be its own complement,
e.g. when using a dimer alphabet). Complementary letters may use the
same proportionality. Only the standard 26 letters are
permissible; however, when the "-CS" option is used, the alphabet is
case sensitive so that a total of 52 different characters are possible.
POSSIBLE LINE FORMATS WITHOUT COMPLEMENTARY LETTERS:
letter
letter proportionality
POSSIBLE LINE FORMATS WITH COMPLEMENTARY LETTERS:
letter:proportionality
letter:complement proportionality
letter:complement proportionality:complement's_proportionality
-i filename: same as the "-a" option, except that the symbols of
the alphabet are represented by integers rather than by letters.
Any integer permitted by the machine is a permissible symbol.
-A alphabet_and_proportionality_information: same as "-a" option, except
information appears on the command line (e.g., -A a:t 3 c:g 2).
4) Alphabet modifiers indicating whether ascii alphabets are case
sensitive---the two options in this section are mutually exclusive
with each other and with the "-i" option (default: ascii alphabets are
case insensitive).
-CS: ascii alphabets are case sensitive.
-CM: ascii alphabets are case insensitive, but mark the location
of lowercase letters by printing a line containing their locations.
This option is useful when lowercase letters indicate a functional
landmark such as a transcriptional start in a DNA sequence.
5) Options for adjusting or restricting which information
and scores are printed.
The "-ls", "-li", and "-lp" options are mutually exclusive.
-c: also score the complementary sequences. The complements are
determined by the program and are not explicitly stated in the
sequence files.
-ls number: lower threshold for printing scores, inclusive
(formerly the -l option).
-li: assume that the maximum ln(p-value) for printing scores equals
the negative of the sample-size adjusted information content;
indirectly determines the lower threshold for printing scores.
-lp number: the maximum ln(p-value) for printing scores; indirectly
determines the lower threshold for printing scores.
-up number: upper threshold for printing scores, exclusive.
-t: just print the top score for each sequence.
-t number: print the indicated number of top scores for each sequence.
-ds: if the "-t number" option is used, print the top scores for each
sequence in the order of decreasing score (default: order the
scores according to their position within the sequence).
-e number: the small difference for considering 2 scores equal
(default: 0.000001)
-s: print the sequence corresponding to each score that is printed.
6) Options indicating how unrecognized symbols are treated (default: -u1).
Symbols are letters when option "-a" or "-A" is used;
symbols are integers when option "-i" is used.
The following three options are mutually exclusive.
-u0: treat unrecognized symbols as errors and exit the program.
-u1: treat unrecognized symbols as discontinuities, but print a warning.
Treating a symbol as a discontinuity means that any L-mer
containing the unrecognized symbol will be ignored.
-u2: treat unrecognized symbols as discontinuities, and print NO warning.
7) Options for adjusting the estimation of p-value.
If the -R option is set to zero, the p-value is not estimated.
-R number: the range for approximating a column of the weight matrix with
integers (default: 10000). This number is the difference
between the largest and smallest integers used to estimate
the scores. Higher values increase precision, but will take
longer to calculate the table of p-values.
-M number: the minimum score for approximating p-values (default: 0).
Higher values will increase precision,
but may miss interesting scores.
---------------------------------------------------------------------------
Copyright 1991, 1992, 1993, 1994, 1998, 2003 Gerald Z. Hertz
May be copied for noncommercial purposes.
Author:
Gerald Z. Hertz
GMAT-INF-GC (version 2c)
This program determines the information content of an alignment
matrix. It can also optionally determine and graph the information
content for each individual position of the matrix.
Matrices can be either horizontal or vertical. In a horizontal
matrix, the columns correspond to the positions within the alignment,
and the rows correspond to the letters. Each row begins with the
corresponding letter (or integer, if the "-i" option is used). An
optional row corresponding to the number of gaps begins with a dash (-).
If the matrix contains a gap row, then it can also contain the 4
optional correlation rows: LL corresponding to the number of letters
preceded by a letter, -L corresponding to the number of letters
preceded by a gap, L- corresponding to the number of gaps preceded by
a letter, and -- corresponding to the number of gaps preceded by a gap.
In a vertical matrix, the rows correspond to the positions within the
alignment, and the columns correspond to the letters. The first row
contains the letters (or integers, if the "-i" option is used)
corresponding to each column. The first row can also contain the gap
and correlation labels as described in the previous paragraph
(i.e. -, LL, -L, L-, --) to indicate the presence of the corresponding
optional columns. In both types of matrices, spaces, tabs, and
vertical bars (|) are ignored.
The input files can contain comments according to the following
convention. The portion of a line following a ';', '%', or '#' is
considered a comment and is ignored. Comments can begin anywhere in a
line and always end at the end of the line. The output of this
program is sent to the standard output.
The following options can be determined on the command line.
0) -h: print these directions.
1) Information options.
-n integer: The number of sequences in the alignment. This option
is necessary only if no position of the alignment contains a
representative from every sequence. This situation can only
occur in alignments that ignore terminal gaps. (default: the
maximum number of sequences at each position of the alignment)
-sa: Adjust the information for sample size by subtracting the
average background expected from a random alignment.
-st number: Adjust the information by subtracting the indicated
number of standard deviations expected from a random alignment
from each position of the alignment. This option can only be
used when the "-sa" option is also used.
2) Graphing options (default: do NOT graph the information content)
-g1: determine and graph the information content for each individual
position of the matrix and print the matrix.
-g2: determine and graph the information content for each individual
position of the matrix, but do NOT print the matrix.
3) Matrix options.
-m filename: file containing the matrix (default is the standard input).
-v: the matrix is a vertical matrix (default: horizontal matrix).
4) Alphabet options---the three options in this section are
mutually exclusive (default: "-a alphabet").
-a filename: file containing the alphabet and normalization information.
[Use "-af" when using the VMS operating system]
Each line contains a letter (a symbol in the alphabet) followed by
an optional normalization number (default: 1.0). The normalization
is based on the letter's relative prior probability when generating
the alignment. For nucleic acids, this would typically be the
genomic frequency of the bases or the frequency observed in the
dataset used to generate the alignment. In nucleic acid alphabets,
a letter and its complement appear on the same line, separated by a
colon (a letter can be its own complement, e.g. when using a dimer
alphabet). Complementary letters may use the same normalization
number. Only the standard 26 letters are permissible; however,
when the "-CS" option is used, the alphabet is case sensitive so
that a total of 52 different characters are possible.
POSSIBLE LINE FORMATS WITHOUT COMPLEMENTARY LETTERS:
letter
letter normalization
POSSIBLE LINE FORMATS WITH COMPLEMENTARY LETTERS:
letter:complement
letter:complement normalization
letter:complement normalization:complement's_normalization
-i filename: same as the "-a" option, except that the symbols of
the alphabet are represented by integers rather than by letters.
Any integer permitted by the machine is a permissible symbol.
[Use "-if" when using the VMS operating system]
-A alphabet_and_normalization_information: same as "-a" option, except
information appears on the command line (e.g., -A a:t 3 c:g 2).
[Use "-ac" when using the VMS operating system]
5) Alphabet modifier indicating whether ascii alphabets are case
sensitive---the following option is mutually exclusive with
the "-i" option (default: ascii alphabets are case insensitive).
-CS: ascii alphabets are case sensitive.
[Use "-as" when using the VMS operating system]
---------------------------------------------------------------------------
Copyright 1998 Gerald Z. Hertz
May be copied for noncommercial purposes.
Author:
Gerald Z. Hertz
P-VALUE (version 3a)
This program estimates the p-value of negative log-likelihood scores
for an alignment matrix (i.e. the probability of observing a
particular negative log-likelihood score or greater in an arbitrary
alignment of random sequences). Negative log-likelihood scores are
the negative of the standard log-likelihood ratio statistic and equals
the information content multiplied by the number of sequences in the
alignment. The probability calculation assumes that the probability
of a letter at each position of a sequence is independent and
identically distributed.
P-values values are estimated using a technique from large-deviation
statistics. The estimate is most inaccurate and may be off by a factor
of 2 for scores very close to the maximum possible score.
Negative log-likelihood scores are read from the standard input,
unless the -b or the -s command line option is used (see section 2).
The output is printed to the standard output.
COMMAND LINE OPTIONS:
0) -h: print these directions.
1) Description of the alignment whose p-value is being determined.
-L integer: the width of the alignment being analyzed (required).
-n integer: the number of sequences in the alignment (required).
-c3: the alignment is constrained to be a symmetrical nucleotide alignment;
requires that a symmetrical alphabet be defined in section 3.
2) The score---i.e. the negative log-likelihood ratio---whose p-value
is being determined. If neither option is used, the scores are read
from the standard input.
-s score: determine the p-value of the indicated score.
-b integer: divide the range of possible scores into the indicated
number of equal-sized bins. Determine and print the p-value
for the minimum score in each bin and for the maximum possible
score. (The number of scores printed will be one more than the
indicated integer.)
3) Alphabet options
The following 3 options are mutually exclusive (default: "-a alphabet").
-a filename: file containing the alphabet and normalization information.
Each line contains a letter (a symbol in the alphabet) followed by an
optional normalization number (default: 1.0). The normalization is
based on the relative prior probabilities of the letters. The prior
probability of a letter might be its overall frequency in all the
sequences of a particular organism or of a particular subset of
sequences. In nucleic acid alphabets, a letter and its complement
appear on the same line, separated by a colon (a letter can be its own
complement). Complementary letters may use the same normalization
number. Only the standard 26 letters are permissible; however, when
the "-CS" option is used, the alphabet is case sensitive so that a
total of 52 different characters are possible.
POSSIBLE LINE FORMATS WITHOUT COMPLEMENTARY LETTERS:
letter
letter normalization
POSSIBLE LINE FORMATS WITH COMPLEMENTARY LETTERS:
letter:complement
letter:complement normalization
letter:complement normalization:complement's_normalization
-i filename: same as the "-a" option, except that the symbols of
the alphabet are represented by integers rather than by letters.
Any integer permitted by the machine is a permissible symbol.
-A alphabet_and_normalization_information: same as "-a" option, except
information appears on the command line (e.g., -A a:t 3 c:g 2).
4) Alphabet modifier indicating whether ascii alphabets are case
sensitive---the following option is mutually exclusive with
the "-i" option (default: ascii alphabets are case insensitive).
-CS: ascii alphabets are case sensitive.
---------------------------------------------------------------------------
Copyright 1994, 1995, 1996, 2003 Gerald Z. Hertz
May be copied for noncommercial purposes.
Author:
Gerald Z. Hertz
MAKE-MATRIX (version 2)
This program determines a matrix from a list of aligned sequences.
The input is from the standard input and the output goes to the
standard output. Each sequence is on a single line, must be preceded
by the name of the sequence, and can contain the following characters:
letters indicated in the alphabet information (see below), a dash
corresponding to an internal gap, a period corresponding to a terminal
gap, and '$' and '&' corresponding to dummy alignment characters (when
the -d option is used). Each sequence must have the same number of
characters. Spaces, integers, slashes, backslashes, and apostrophes
are ignored. In addition, comment lines beginning with a ';', '#', or
'%' are ignored.
OPTIONS
0) -h: print these directions.
1) General information
-v: print the output as a vertical matrix in which the columns
correspond to the letters and the rows correspond to the
positions within the alignment. The default is to print the
output as a horizontal matrix in which the rows correspond
to the letters and the columns correspond to the positions
within the alignment.
-d: sequences may contain the dummy alignment characters "$" and "&".
-i: do not print columns containing dummy alignment characters.
2) Options determining whether the alignment matrix should contain a gap
and correlation rows.
DEFAULT: if the alignment does not contain gaps, do not print gap or
correlation rows; if the alignment contains gaps, print both
gap and correlation rows.
-cn: do not print correlation rows; if no gaps, do not print the gap row.
-g: do not print the correlation rows; print the gap row,
even if no gaps are in the alignment.
-cg: print the gap and correlation rows,
even if no gaps are in the alignment.
3) Alphabet options
The next two options are mutually exclusive (default: "-a alphabet").
-a filename: file containing the alphabet and normalization information.
Each line contains a letter (a symbol in the alphabet) followed by
an optional normalization number (default: 1.0). THE NORMALIZATION
IS NOT USED IN THIS PROGRAM, but this format is retained to
maintain compatibility with my other programs. The normalization is
based on the relative prior probabilities of the letters. For
nucleic acids, this might be be the genomic frequency of the bases.
In nucleic acid alphabets, a letter and its complement appear on
the same line, separated by a colon (a letter can be its own
complement, e.g. when using a dimer alphabet). Complementary
letters may use the same normalization number. Only the
standard 26 letters are permissible; however, when the "-CS" option is
used, the alphabet is case sensitive so that a total of 52 different
characters are possible.
POSSIBLE LINE FORMATS WITHOUT COMPLEMENTARY LETTERS:
letter
letter normalization
POSSIBLE LINE FORMATS WITH COMPLEMENTARY LETTERS:
letter:complement
letter:complement normalization
letter:complement normalization:complement's_normalization
-A alphabet_and_normalization_information: same as "-a" option, except
information appears on the command line (e.g., -A a:t 3 c:g 2).
4) Alphabet modifier indicating whether ascii alphabets are case
sensitive (default: ascii alphabets are case insensitive).
-CS: ascii alphabets are case sensitive.