T. N. T.
Tree Analysis Using New Technology
Version 1.1 ã P. Goloboff, J. S. Farris, and K. Nixon
Introduction-Getting Started Saving suboptimal trees
What TNT doesn't do... Memory
Management
Printing & Metafiles Tree Collapsing
Continuous characters Batch Menus
Merging Data Files Measures of
support
Basic Character Settings Linux versions
Scopes, sets, and blocks Consensus
estimation
Character and block names Search
algorithms
Step-Matrices Setting search parameters
Ancestral states Implementation notes…
Nexus files Scripting(advanced users)
General:
Optimality
criteria:
Search
options:
Tree
diagnosis:
Tree
comparisons:
Group
supports, randomization:
Scripting
TNT is a program for phylogenetic analysis under parsimony. It provides fast tree-searching algorithms,
as well as extensive capabilities for tree diagnosis, consensus, and
manipulation. This documentation
explains some important things that are not immediately apparent (or cannot be
done easily) from the menus themselves. For details on how to use commands,
users must refer to the on-line help of TNT (accessed typing help
<enter> at the command prompt or the command dialog opened with File/CommandLine). This file, tnt.htm, provides a
general description of the program; if the file is copied to your Windows
directory, then when selecting the Help menu option, the Explorer is
called to read this file. Throughout
this help file, commands that can be typed at the command line (or included in
a file), are marked with italics and bold; menu options are underlined.
The basic thing you need to do to run TNT is to prepare a data input file. You can either create this from the program
itself, with the File/New and Data/Edit menu options (Windows
only), or in a text editor. This file
should contain the data matrix (format specified in the next section, Data Input); it may contain also specifications of character
names and settings, like weights and additivities (this can be later specified
once within the program, but it is advisable to have this in the file, so you
do not forget to change something you need to change; see under Basic Character
Settings, Step-Matrices, Ancestral states, and Character and block names). Check the
files zilla.tnt, example.tnt, and contin.tnt,
three example data files provided with the program. Note that all the input files must be text-only files.
Once you have read the data in, you are ready to start doing searches,
etc. Like most parsimony programs (i.e. Hennig86, Paup, or Pee-Wee/NONA), TNT
has an internal (memory) buffer for trees, and many of the commands and options
(optimization, consensing, etc.) are available only when there are trees in
memory, as these commands are performed on pre-existing trees. Some search algorithms create trees de
novo (others work on preexisting trees, modifying them to try to find
better trees; see under Tree-searching algorithms). The trees in the memory buffer can be saved to
disk, so they are not lost when you exit the program; you can subsequently read
the trees into the program. The text
output produced by the program (i.e. results and timings of runs,
tree-diagrams, etc.) is written onto an internal text buffer, which can be
later saved to disk.
If you have a Windows version, make sure you install the font provided
with the program, tred.ttf, in your
system. This font is required for TNT
to draw trees properly. If the tred font is not installed, TNT will
attempt to use the font terminal
(which also draws trees, if not as nice as tred's).
There are many things the program does not do. First, the program is conceived as a tool for standard parsimony
analysis. It does not do maximum
likelihood, it does not use specific models of evolution, it does not do
bayesian analyses. Although we
anticipate incorporating some likelihood algorithms, they will be quite
rudimentary.
Second,
the program does not do sequence alignment of any kind. Sequence comparison is (computationally) quite
a different problem from standard parsimony analysis. If you need sequence
alignment, get a program that does it properly.
Third,
the program does not incorporate methods which are potentially useful but are
somewhat experimental. The
"support weighting" method of Farris (implemented in Farris' program zac)
is not included in TNT. That method
(which is based on mapping character changes on a tree in order to evaluate it)
requires operations quite different from the ones used in TNT; we do not anticipate
including that method in the program.
Fourth, the program uses exact
algorithms for Sankoff characters.
These work fine for low-medium numbers of states, but they are not very
efficient for very large numbers of states.
Thus, for step matrix characters with many states, TNT is not very
efficient. Note that this applies only
to Sankoff ("step-matrix") characters; additive and nonadditive
characters with many states are
treated efficiently. Fifth, the exact solution algorithms in
TNT ("implicit enumeration") are not specially efficient –their speed
is about the same as that of PAUP* or the 15-year-old, 16-bit Hennig86. Since the program is conceived mostly as a
tool for analysis of large data sets, there seemed to be no point in using
energy in improving the implicit enumeration algorithms. Sixth, TNT does not implement any kind
of multiple-cut-swapping (commands mswap, swap, tswap,
and altswap
in Nona/Pee-Wee). Those
swappers were less effective than multiple addition sequences. Likewise, the swap command of
Nona/Pee-Wee (i.e. report fit differences when moving specified clade around in
the tree) is not directly implemented in TNT (it would probably be easy to
implement that using the sprit or tbrit options of the
scripting language). These are probably
the only options present in Nona/Pee-Wee but absent in TNT. The Windows version of TNT has rudimentary printing
capabilities. You can print the display
buffer as a text file, using File/Output/PrintDisplayBuffer (optionally,
you can save the text in the display buffer to a file, and print it later with
your favorite word processor; choose File/Output/OpenOutputFile, and all
newly produced output will be written to the file; optionally, with File/Output/SaveDisplayBuffer,
you can save previously produced output).
To print tree diagrams, you have to set the pre-view for trees as ON
(under Format/PreviewTrees). If
you want to map (=optimize) characters in color, choose Format/MapCharactersInColor
(set the colors you like with Format/ColorsForMapping). Then, draw the trees and (when in pre-view
screen) press "P." When in
pre-view screen, pressing "M" saves the tree diagram to a metafile
(which can be later processed/printed with Corel Draw, or exported to Microsoft
Word or Power Point). TNT reads matrices in Hennig86 format, with some refinements. Normally, the data will be included in a
file. The basic format for data input is: xread 'optional
title, starting and ending with quotes (ASCII 39)' nchar ntax Taxon0 0000000000 Taxon2 1011110000 Taxon3 1111111000 ..... TaxonN 1111111000 ; the semicolon at the end is
normally not required but is useful to detect errors (e.g. is the number of
taxa set too high?). Polymorphisms can be read, by enclosing the state(s) in square
brackets. By default, up to 16 states
are allowed by xread, using symbols 0-9 to indicate states 0-9, and A-F to
indicate states 10-15. This can be
changed with the nstates command, which defines the number of states (and data
type) to be subsequently read; in Windows versions, this can be changed with Format/DataType.
If the data are defined as DNA (with Format/DataType
under menus, or using the command nstates dna;) the IUPAC
nucleotide codes (including polymorphisms) are recognized and used
throughout. The states are internally
stored as A=0, G=1, C=2, D=3, and gap=4.
If the data are instead defined as proteins (with the menus, or using nstates
prot;),
the IUPAC codes are recognized and used throughout. For 32 states, non-protein data, the symbols 0-9 and A-V indicate
states 0-31. If the data are defined as smaller numbers of states, they occupy less
memory. This may be necessary for large
data sets. The data can also be read as interleaved. For this, all that is required is that each block of data is
preceded by either the & symbol or the ASCII character number 4 (the latter
is what PAUP* uses). The taxa in each block must have the same exact names
(including capitals). The taxa in each
block can be arranged in a different order; the taxon numbers correspond to the
first occurence of a name. A taxon
which is not included in a given block, will have only missing entries for the
characters in that block; the same number of characters must be specified for
all the taxa within a given block. Each
row of characters must end with a carriage return. The end of the data is indicated with a semicolon; if some
characters were never specified, they are left as missing entries (and a
warning is issued). If some taxon
names were never defined, a warning is issued.
The interleaved format allows mixing of different data types, by simply
enclosing the corresponding string (prot, num, or dna) in square brackets right
after the & symbol. When the data are being read as interleaved, it is possible for
different blocks of data to have different formats. For this, specify between square brackets, immediately following
the '&' that indicates a new block, the format of the data (numeric, dna,
proteins, continuous). To specify how
to read a state in subsequent commands that refer to states (cost,
change, or usminmax), use the '/' option of the nstates
command. While other programs require that continuous characters be discretized,
TNT can deal with continuous characters as such. This avoids the use of the rather ad hoc methods that have
been proposed to discretize continuous distributions in phylogenetic analysis
(gap-coding, Thiele's method, etc.; see Farris, 1990 for discussion of some of
these methods). If there is significant variability in one of the terminals, it
will probably be best represented by a range of one (or two) standard
deviations around its mean. For normal
distributions, this automatically takes care of non-significant differences. Continuous characters in TNT can have states 0-65, with 3 decimals
(thus, when not doing implied weighting, tree scores are reported always with 3
decimals). They are optimized always as
additive, i.e., using Farris optimization.
They cannot be turned to nonadditive or sankoff. The continuous characters must be the first
ones in the matrix (i.e. the first block or blocks, followed by blocks with
other data formats). Each value must be
separated by a blank or whitespace; ranges are indicated by two values
separated by a dash (-). Missing
entries are indicated only as question marks (?). The treatment for the data when there are continuous characters is
rather transparent. The
commands/options that show reconstructions (recons command, or Optimize/Characters/Reconstructions)
or count number of specific transformations (change command, Optimize/CountSpecificChanges)
cannot be applied to continuous characters.
Continuous characters can be edited with the xread=
command, and with the menu option Data/Edit/Taxon, but not with Data/Edit/Character. Continuous characters can be named with the cnames
command, but their states cannot. Standardization of continuous characters (i.e. rescaling so that the
values are on a common scale) is left to the user. Under pre-defined weights, standardization will normally be an
issue. Under implied weighting,
only the default formula can be applied (i.e. user-weighting functions are not
allowed for continuous characters).
Since continuous characters tend to have a larger s-m difference
(s: steps, m: minimum
possible), implied weighting should normally penalize them rather heavily.
Thus, standardization of continuous characters is probably not so important
under implied weighting. The dmerge command merges data files, and saves them as blocks; the
character names and settings are also merged (in Windows version, File/MergeDataFiles,
using Ctrl+Left Mouse Button to select multiple files). The dmerge command also merges (by using
* as argument) the characters with identical character names (in Windows
version, with Data/MergeHomonymousCharacters). Some basic rules of correspondence between states and state names
must be satisfied. In the Windows version, characters settings can be determined with Data/CharacterSettings.
Current settings can be displayed with Data/ShowStatus. Alternatively,
settings can be included in the data file, with the ccode command.
The ccode command is similar to Hennig86's. It uses specifiers to determine actions on
subsequently selected characters: + make additive - make nonadditive [ activate ] deactivate /N give weight N (if no number immediately follows the slash, set weight to 1) ( apply step-matrix costs ) make additive or nonadditive again The ccode command always must end with a semicolon. The ccode command does not recognize character
names, or character groups (it does recognize the @ option to select from a
particular block). By default, TNT considers all characters nonadditive (this is the oppossite of NONA and Hennig86),
active, and of weight 1. The costs of character-state transformations can be defined with the cost,
smatrix,
or cstree
commands (see below); note that the costs for character-state trees are stored
in the costs, so they are a type of "step-matrix" characters. Step-matrix characters for which the
transformation costs fit a character-state tree (i.e. additive relationships
between states) are (for implied weighting purposes) internally recoded as
binary. For many commands, it is necessary to specify series of numbers
(representing sets of trees, taxa, or characters). Numbering for everything starts from 0 (as in
Hennig86 and Pee-Wee/NONA). The
characters (if they have been named, see below) and taxa can be accessed using
either their numbers or their names.
For any command that takes ranges, two numbers separated by a period
(X.Y) indicate a range (from X to Y, inclusive). A blank space before the period indicates that the range starts
from the first possible value, and a blank after indicates that the range ends
in the last possible value. Note that
names cannot be separated by a period to indicate ranges. Except for the ccode
command (where they have a special meaning), the + (or – ) symbols indicate
inclusion (or exclusion) from the list.
Also, up to 32 groups of trees, characters, and taxa can be
defined. After having defined a group,
enclosing in curly braces ({}) the group number (or name, if named) is
equivalent to specifying each of the members of the group. For taxon specifications, using @X Y refers
to all the terminals included in node Y of tree X; under Windows versions, the
equivalent is accomplished by selecting nodes under tree-viewing mode (Trees/View;
see below). For character
specifications, using @X Y Z refers to the Yth and Zth characters in block X
(note that Y and Z can themselves be ranges; if the data actually contain
blocks, the first one is numbered as 1, and block 0 refers to the entire set of
characters). The blocks can be referred
to by their names (if defined, see below; "all" is always used as
name for the entire matrix). Within execution of a given command, the @X for
characters remains in effect untill deactivated explicitly with @0 or
@all. If the data are read as
interleaved, the blocks of data are automatically preserved (although they can
be changed with the blocks command). Character names can be defined with Data/EditData/CharacterNames,
or alternatively, with the cnames command. The syntax for cnames
is as follows: cnames
[0
First_block ; [1
Second_block ; {0
zero state_zero state_one state_two ; {1
one state_zero state_one state_two ; {2
two state_zero state_one state_two ; {+ three
state_zero state_one state_two ; {+
four state_zero state_one state_two ; ; An additional semicolon is needed to end the cnames statement. The names must all be defined together;
character name definitions can be changed with cnames +N charname statename(s);
(this changes character/state names, only one at a time). The names preceded by
opening square brackets, [, are names for blocks, and the names preceded by
curly braces, {, are character and state names. Not all characters (or blocks) must be named in a cnames
statement; name what you need. The +
symbol means "the character (or block) following the last one named"
(if no character or block had been named before, this is the first character or
block). The characters within a block can be named according to their in-block
numbering sequence, by using the @ option: cnames
[+
First_block ; [+
Second_block ; @First_block
{0
zero_of_first .... ; {1
one_of_first .... ; {2
two_of_first .... ; {3
three_of_first .... ; {4
four_of_first .... ; @Second_block
{0
zero_of_second ... ; {1
one_of_second ... ; {+
two_of_second ... ; {+
three_of_second ... ; ; Changing the reading frame with @blockname resets to 0 the value of the
+ symbol for character names. The
step-matrices can be defined from the menus (with Data/CharacterSettings),
or via the cost or smatrix commands. The first defines transformation costs for
sets of characters in particular; the second defines and stores them, for later
application to sets of characters. The cost command has the following
syntax: cost
N = ...costs... ; where "costs" (see below)
are the step-matrix costs of character(s) N; note that the cost command does not make
characters step-matrix, only defines their costs (the ccode command is
necessary for this, with "(" making the character step-matrix, and
")" non-step-matrix). The
smatrix command has the following synthax: smatrix
=S (xxxx) ...costs... ; where "costs" are saved
in internal step-matrix number S (named xxxx). Once defined, the step-matrix
can be applied to character(s) N, using smatrix
+S N ; or smatrix
+xxxx N ; The syntax for the costs themselves
is similar for both the cost and smatrix command: defines as Z the cost of
transformation from (any state in) X to (any state in) Y, and as W the cost of
transformation between (any states in) U or V.
X, Y, U or V can be several states, enclosed in square brackets (as in
the xread
command); if the data have been defined as DNA, the IUPAC codes for
polymorphisms apply; using ? indicates all possible states. If Z or W is replaced by the letter i, the
cost is set to 1000 (presumably, "impossible," but this really
depends on the other costs). When the
> symbol is replaced by a slash, /, this indicate symmetry in transformation
costs. The
minimum possible number of steps for step-matrix characters is calculated with approximate
algorithms (in PAUP* the minimum cannot be calculated at all for
step-matrix characters, and therefore implied weighting cannot be
applied). Beware that for very complex
transformation costs (e.g. random numbers between 1 and 30 for each cost), the
algorithms used by TNT can sometimes overestimate the minimum possible length
in a few steps. These algorithms are
more prone to error when there are polymorphic taxa. If in doubt of whether the program is calculating accurately the
minimum possible steps for a sankoff character, use the usminmax command
to determine a user-defined minimum. If
the user-minimum is set to 1, all sankoff characters will be considered
"informative." Then, you
either optimize all characters, or you can do a search for each character, so
as to make sure what the real minimum for each character is. The
step calculations during searches are exact for any type of
transformation. The algorithms used
consider as possible assignments for optimizations all the states that occur
between 0 and the largest state (states beyond the largest one occurying in the
matrix are not considered; states between 0 and the largest state are
considered even if they are not present in any terminal taxon). This is more or less equivalent to to the
"allstates" options of PAUP*, with the difference that PAUP* tries
(as I understand it) all the states defined in the step-matrix (instead of the
largest state observed in the terminals).
It is possible (although unlikely) that in the case of complex
transformation costs, if some state larger than the ones present in the
terminals is intermediate between others, it is required to obtain the minimum
length. For example, supposse you have DNA some characters where the only
observed states are A, C and G, where all transformations cost 10, except to or
from T, which costs 6; whether or not T is included among the possible states
determines whether the minimum cost can be 20 or 18 (perhaps even influencing
whether the character is considered informative or not). Then, you can tell TNT that the maximum
number of states to consider for all sankoff characters is some state (in the
case of DNA data, the largest site is T = 3 and a gap is a fifth state = 4; in
the case of protein data the largest state is U = 20, and in the case of 32
alphanumeric states the maximum state is V = 31). This must be done prior to the the definition of
costs. This obviously slows down calculations, so TNT pays attention to this
only in those cases when it does make a difference (i.e. only when length could
change by including the larger state(s)). Note
that the branch-swapping algorithms only look at trees where the currently
designated outgroup is outside everything else. If you have asymmetric costs for some characters, and you want
the program to choose the rooting that minimizes the overall cost, you have to
add an all-missing ancestor. This is different from the way PAUP* works: when
step-matrix asymmetric characters are present, PAUP* also tries rerootings of
the tree (i.e. it does moves which put the outgroup "inside"). If you do not include an all-missing
ancestor in TNT, or do not force monophyly of all but the outgroup taxon in
PAUP*, the two programs may produce different results. For characters with symmetric transformation costs, the root states will
always include the states in the outgroup taxon among the possible states. This may not be the case for step-matrix
asymmetric characters; for example, suppose that going from 0 to 1 is very
expensive and going from 1 to 0 is very cheap; if the outgroup has state 0, it
may be better to assign state 1 to the root, and then postulate several
independent changes from 1 to 0 than a single change from 0 to 1. That may or may not be reasonable, depending
on what the outgroup states represent.
If the outgroup represents a series of successive sister groups with the
same state (example: a wingless ancestor for hexapods, with wing gain being more
expensive than wing loss), assigning to the root a state different from the one
in the outgroup taxon would imply a number of uncounted events. In those cases, it is possible to consider
the states in the outgroup as ancestral –that is, choose for possible state
assignments for the root only from among the states in the outgroup taxon. This
can be done by choosing Data/CharacterSettings, then clicking on the
"ancestral" option, and selecting the characters you want o set.
Alternatively, you can use the ancstates command, and include
this in the data file itself. Note that
the designation of a character as "ancestral" has no effect if the
character is not an asymmetric step-matrix character. TNT implements the implied weighting method of Goloboff (1993), using
floating-point (=exact) fit calculations.
The fit for characters is (by
default) calculated as f = k / ( e + k ) (where e = extra steps,
and k = constant of concavity, changed with piwe=k). During the searches, the program reports the
score as an increasing function of the homoplasy (or "distortion" to
be minimized), which is 1 – f = e / ( e + k ). If you prefer the score reported as fit, it can be calculated
once the search is finished. Since floating points calculations are used, it is very important that you evaluate group support when doing implied
weighting. Since exact ties are
very unlikely, this criterion may produce overresolved trees if poorly
supported groups are not collapsed. The fit for discrete additive characters (including those with
character-state-trees) is calculated by decomposing the character into binary
variables. Note that this may produce
fits different from those of Pee-Wee or PAUP* (which do not do this
conversion); the fits will always be measured as higher in TNT; it has been
suggested (De Laet, 1997) that the recoding produces a more meaningful
evaluation of the relative weights. This affects only data sets with discrete
additive characters. If the same characters are defined as continuous, the fit
is calculated without decomposing into binary variables (thus producing the same
fit values that Pee-Wee or PAUP* produce for the additive coding); treating the
additive characters as continuous obviously slows calculations. In the case of extra internal steps defined by the user, the same number
of extra steps will be added for each of the variables into which the character
is decomposed. If you want some of the
variables to have different numbers of steps, you have to recode the character
as binary yourself. The method is applied to Sankoff characters as well, in which case the
fit for a character with e extra steps and a minimum cost among states
of c is calculated as k / ( (
e/c ) + k ) (where k is the constant of concavity). When the costs are defined as all unity, the
formula produces results identical to those of the standard formula used by
Goloboff (1993). If the user has
defined some characters has having extra internal steps, these are added to e. It is possible to define a user
weighting function, by specifying the values of weights for different numbers
of extra steps, relative to no extra steps.
For this, the data must be read with implied weighting ON. The only requirement for user defined
weighting functions is that no (relative) weight is negative. Weighting functions were the reliability
increases with the first extra steps, and then decreases, are (from the
numerical point of view) perfectly valid.
To define a weighting function, type (or include in a file to be parsed)
"piwe[" followed by as many numbers as extra steps you can
have in your data set. The numbers are
read as the relative costs of transforming from 0 to 1, 1 to 2, etc. If the relative weights defined are less
than the maximum possible extra steps, the undefined transformations are given
a weight equal to that of the last transformation defined. If more transformations than the maximum
possible extra steps are defined, they are simply ignored. The user defined weighting functions are
applicable only to discrete, non-sankoff characters. When user-defined weights
are in effect, the fit cannot be calculated (i.e. only the score as an
increasing function of the homoplasy is reported). Note that user-defined
weighting functions can be easily used for clique analysis (i.e. defining the
function as piwe[1 0 ;). The tree-view mode is toggled with Trees/View. Under
tree-viewing, it is possible to select nodes by clicking at them (with the left
mouse button). The node selection
toggles between red/green/unselected.
All the terminals included in nodes selected with red will be considered
as "selected," unless they belong to a node selected in green which
is a member of the red selected node (combinations of red/green selections can
be used to easily define paraphyletic groups).
Note that a green selection serves no purpose unless the green node
itself is within a red node. When
the tree is locked to manual edition (and regardless of nodes selected with the
left button), positioning the cursor at a node and clicking on the right mouse
button produces a list of synapomorphies of the node (using character/state
names, if defined and in effect). When the tree is unlocked, it can be edited by clicking on the right
mouse button. If a single node (red or
green) is selected, and the right button is clicked while the cursor is on an
unselected node, the selected node (=source) is clipped and moved to be sister
of the unselected (=target) node. If
the target node is the ancestor of source node, the branch leading to the
source node is collapsed. If the right
mouse button is clicked on a red selected node (regardless of whether other
nodes are selected), all the terminals belonging to the node are pruned from
the tree (unless they in turn belong to green selected nodes, in which case
they are left in the tree). If the right mouse button is clicked on a green
selected node, and the tree is incomplete, the user can select a terminal to
add to the tree (as sister to the green selected node). When
the tree is edited, the changes are irreversible. If you are not sure of whether you want to keep the changes, you
should save the tree to file before editing it. To avoid mistakes, the trees are by default locked to manual edittion;
they can be unlocked with Settings/Lock trees. The program can save, during
branch-swapping, suboptimal trees. The
values of suboptimal are set with Analyze/Suboptimal, or with the suboptimal
command. Two values can be set: the
absolute and the relative fit difference to accept a rearrangement; the
relative fit difference (Goloboff & Farris, 2002) is calculated between the
tree being swapped and the rearrangement.
These two values take effect also when collapsing trees for consensus
calculation with temporary collapsing under SPR or TBR. The relative fit difference between the tree
being swapped and the rearrangement is calculated exactly (see Goloboff &
Farris, 2002, for details on how this can be done quickly), except for TBR with
asymmetric step-matrix characters, where the relative fit difference can be
either over or under estimated by the algorithms used. Note that the new technology algorithms (sectorial searches, tree-fusing),
since they are not designed primarily to save multiple trees, do not have a
well defined behaviour in terms of the suboptimal trees saved; some trees will
be saved, others won't. Thus, if you want suboptimal trees, do branch-swapping
after having concluded a search with new tech algorithms. Except for the stuff that windows itself uses, memory management in TNT
uses special memory allocators written by Steve Farris. This memory manager insures that TNT has no
memory leaks. For example, reading a
new data set is guaranteed to first set the whole system back to its original
state, and then reallocate memory. The
same applies to macros, which use a different memory segment. The only thing that can cause memory
fragmentation when using TNT is repeating some of the following operations many
times: (1) re-setting the maximum amount of “general RAM” the program can use,
(2) re-setting the size of the display buffer, (3) re-setting the total amount
of RAM to be used by the macro language, or (4) re-setting the number of nested
input files. These things use normal
memory allocators. Other than this, TNT
can be used for any period of time, or for any number of operations, without
causing any memory fragmentation. TNT implements several rules for tree collapsing (see Coddington and
Scharff, 1994, for discussion). For
rule 1, which may produce collapsed trees longer than the original resolution,
the length-calculating commands (for example, bremer support or the tree filter
when suboptimal trees are to be discarded) calculate the length before
collapsing; note that the final collapsed trees may be longer. The collapsing for consensuses (including
the bremer support) is calculated only temporarily. Note that rule 4 cannot be used during consensus calculation
(because it may multiply the number of trees), but the results for the (strict)
consensus under rule 1 will by necessity be identical to those for rule 4. The implementation of most rules requires no special comments. The exception is rule 4, implemented in TNT
as follows: 1) take the tree to be collapsed 2) collapse all nodes for which ancestor and descendant
state sets are identical 3) mark all nodes for which ancestor and descendant
state sets share state(s). 4) for each combination of possible collapsings
for the nodes marked in 3, if the length of the new tree equals that for
original tree, and no node in the new tree remains unsupported under rule 1,
add the new tree to memory. 5) if any of the trees added to memory at point 4 is a
duplicate of a previously collapsed tree, discard it. The trees produced will thus not be
further collapsed applying rule 1 to them; when the trees are only condensed,
existing duplicates are not eliminated automatically (they are when the trees
are "filtered"). This
implementation of rule 4 is almost always more time consuming than the other
rules (except SPR or TBR, naturally), in some cases, much more time consuming.
For some input trees, this implementation of rule 4 may produce a larger number
of output trees. Coddington and Scharff
(1994) did not prove that all rule 4 trees can be found by collapsing all rule
1 trees, but it is easily proven (De Laet, pers. comm.) that, if the trees are
a set of all the trees distinct under rule 1, just deleting the trees that
become longer when collapsing produces a set identical to the rule 4 set. However, if the trees to be collapsed are a
subset of the rule 1 trees, discarding the trees that become longer may produce
spurious results –and that’s why rule 4 is implemented as it is. If you are
certain that your trees are all the distinct rule 1 trees, you can get the set
of rule 4 trees by simply (a) collapsing the trees under rule 1, and then (b)
getting rid of the trees that are longer (filtering without collapsing). The
branch-swappers use only binary trees as input. If some non-binary trees exist when a search is invoked, they are
automatically resolved before the search proceeds. This may cause some trees that were distinct to become identical
when resolved. Additionally, collapsing
under rule 1 (i.e. collapse all ambiguously supported branches) may make the
trees longer than the initial binary trees.
When any of these two things happens, a search may deliver N optimal
trees of minimum length, but if a subsequent search is done, only a reduced
number will be used as input for subsequent searches (this happens very often
in PAUP*, which always collapses the trees after a search). To prevent this, TNT by default retains the
trees as binary after the search (note that the trees are anyway compared to
make sure that they are distinct if they
are collapsed under the criterion in effect). Using these trees for consensus, bremer support calculation,
etc., requires that they be collapsed temporarily. The temporary collapsing is set with collapse+; (or under Settings/ConsenseOptions). When
trees are temporarily collapsed, eliminating taxa requires special
considerations. The characters of the
taxa still influence the optimizations; only the taxon positions are
ignored. For SPR and TBR collapsing,
the results produced are those that would be obtained by (1) saving all the
binary trees that could be found by swapping on the tree to be collapsed (if a
better tree is found, swapping continues nonetheless from the original tree),
(2) pruning all the taxa to be eliminated, and (3) consensing the pruned trees. For optimization-based collapsing,
evaluation of support for a branch considers whether merging two branches in a
singe branch when clipping a taxon produces a supported branch (note this is not the same as first collapsing the
trees under rule 1, and then clipping taxa).
Eliminating taxa when resampling, estimating consensuses, or searching
untill the consensus becomes stable, is done in the same way. If you want to remove the taxon from the analysis
(as opossed to removing it from the consensus), deactivate the taxon (with the taxcode
command, or the Data/Taxa menu option). Beware
that, although tree-collapsing involves no special theoretical considerations,
it is of practical importance. Misusing
the collapsing possibilities provided by the program may produce absurd
results. With File/CommandLine, it is possible to enter commands
for TNT. <Tab> switches focus
between main window and command line; <enter> always takes you to the
command line. When focus is on command
line, pressing – or + moves screen one page up or down (respectively). The keys / and \ move through the buffer
that remembers the last commands executed.
A list of the commands (with their syntax) is obtained entering
"help" at the command line. You can also give TNT (essentially) any commands you want when invoking
it from the DOS prompt (in the Windows version, commands must be preceded by a
semicolon; strings before a semicolon are interpreted as names of files to
open). The cost command
takes the redirect output symbol, >,
as a specifier; in the command line the > symbol can be replaced by a
closing curly brace, }. TNT allows instructions to be stored (in memory or in file), for later
execution, under Settings/Batch. Note that some purely interactive functions (like searching for
text in the text buffer, or selecting tree nodes under tree viewing mode)
cannot be stored as batch instructions.
To store instructions, choose Set menus to batch, and then simply
click on the options you want (you can later edit them). When the instructions are subsequently
executed, they are run in order. Make
sure the warnings are disconnected before running a set of instructions (as
otherwise the program may halt when it finds a warning). TNT
implements two types of character support measures: Bremer supports and
resampling. The Bremer supports are
calculated from the trees in memory: it is up to the user to find suboptimal
trees. The resampling can be of
different types, some of which are preferrable to others. Standard bootstrapping is influenced by uninformative
characters (and by characters irrelevant to monophyly of a given group). Bootstrapping (both standard and Poisson)
and jacknifing (except for a resampling probability of 50%) are affected by
character weight and transformation costs (e.g. additive characters). Symmetric resampling is affected by
none. Outputting results with the
frequency may produce alterations in the apparent support for groups with very
low support (and it cannot measure support for groups with very low
support). Both GC and frequency slopes
solve this problem (for slopes, supported groups have negative slopes; the
closer to 0, the better supported the group).
The supports can be measured on the groups present in any given tree (normally,
a most parsimonious tree, but using other trees provides a way to measure how
strongly contradicted some groups are).
The slopes must be measured by reference to a pre-existing tree. If
some taxon has its position in the tree very poorly defined, it may strongly
decrease the support for many groups in the tree. If you wish to find out whether the rest of the tree is well
supported, you can eliminate the taxon from the consensus when resampling. To find out which taxon is floating around
you may have to first do a more superficial estimation, saving the trees, and
trying to prune different taxa (note that in such case you have to take the
precaution of not collapsing the trees with SPR or TBR, as otherwise you may
never find the floating taxon: once you found it, you turn collapsing back to
SPR or TBR –the most effective—and do a more careful estimation of the
resampling frequencies eliminating the evil terminals). Linux and Mac OS X versions of TNT
have a couple differences with the standard versions. First, those versions can take a comma ( , ) instead of a
semicolon ( ; ). This is because the
shells there use semicolons to separate commands, and giving arguments to TNT
from the command line becomes difficult. Second difference is that those
versions can run in the background. If
you want to start TNT in the background, use bground as first
argument, then give the normal arguments you would give the program (do not
activate reports and do not change the silent status, but you can set
report+ to report every certain number of trees, replications, or seconds, so
that you can monitor progress by inspecting the output file; make sure you do
have output files, otherwise the program would run and exit without doing
anything!), and end with the usual '&'.
TNT will not write to the console, or expect console input (at least if
no errors occur). It will write its results to a file and exit smoothly (exit
code is 0; if the program runs out of commands while in the background it exits
with code 1). Should some error occur during the search, the program writes a
message to the ouptut file, and exits with code 255. Try
./tnt bground p zilla.tnt, echo=, log
peep, rep+1, mu10=ho3, le, ne, quit, & and do "tail –f peep" to
monitor search progress. If you install the PVM system in
your machine, you can run TNT in parallel.
If you have difficult data sets, this may be advisable even if you have
a single-processor, single-machine, as the parallelization interface of TNT
allows for a lot of control in defining search routines (e.g. you can
re-analyze sectors of the tree using any search routine you want). Unlike other phylogeny programs which can
search in parallel (e.g. POY; Wheeler et al., 2003), TNT does not have
pre-established routines for working in parallel. Rather, TNT allows to parallelize routines, interacting with the
scripting language; although harder to use, this is much more flexible and can
be tailored to specific needs. The
command that handles most of the options for parallelization is ptnt. With the command ptnt begin TNT can launch
jobs in parallel; each job can
consist of one or several tasks. Each task in the job receives a set of
instructions; the instructions are determined by the user, and can be
(essentially) any set of TNT commands (possibly including instructions to
launch other jobs in parallel). After launching a job, the master TNT process
continues operation normally (i.e. asynchronously). The results from the slaves are brought back to the master with
the command ptnt get; the master does not receive results from the slaves
untill the master itself executes a ptnt get command; slaves that
finished running remain dormant (i.e. not using processor time) until the
master requests results from them (or until the master sends the slaves a new
set of instructions). The master may launch several jobs at the same time; up
to 32 jobs can be running at the same time, and the master may handle exchanges
of trees/information between them. When a job is launched, it automatically
receives from its parent the data (and, optionally, starting trees), as well as
its identification number (for t tasks, numbering goes from 0 to t-1). Since a slave task knows its own number,
this can be used to have some tasks perform differently (alternatively, the
same thing can be accomplished by launching other job(s)). A TNT process that
launched some jobs can monitor (with ptnt wait or ptnt gwait) the progress
of the jobs, letting the user access to numbers of trees, replications, scores,
etc., in the slave tasks; based on this information, the parent process can
then recall jobs back (ptnt get), tell the slaves to skip
some instructions and move to a specific part of the instructions (ptnt
goto and ptnt skipto), or send new sets of instructions to
the slaves (ptnt again). The parent
process can get trees from its slave tasks, at any point (with ptnt
spy). A
very simple example shows how to parallelize resampling, in a job that receives
the name psample ("parallel sampling"): ptnt begin psample 10 /*
run 10 slave tasks */ +. –myself = /*
run in all hosts but master node */ /* …these are the slave instructions… */ collapse tbr ; resample rep 10 savetrees ; /* resample */ tchoose 1. ;
/* choose
individual trees */ return ; ptnt
wait ( tasksleft[psample] = = 0 ); /* wait for all tasks to end */ keep 0 ; ptnt
get psample ; /* get results from psample */ collapse notemp ;
/* trees are already collapsed by the slaves */! freqdif ; /* calculate freqdifs */ If the number of tasks (10 in this
case) is not specified, as many tasks as hosts minus one are used (and no task
is launched on master node); if no host selection is specified, the default
selection is used (i.e. the same one indicated in the example). TNT implements the consensus
estimation method of Goloboff and Farris (2001). This is based on making a certain number of independent searches,
and checking whether the end results of all of them share groups. A group which is present in all (or most) of
the searches, even if the search failed to produce optimal trees, is likely to
be present in the set of optimal trees (and thus the consensus). This can be
used to have a quick idea of the approximate results for your matrix, or to
create constraints to speed up searches.
The consensus estimation is invoked with the qnelsen command,
or under Analyze/ConsensusEstimation. The program implements several types
of search algorithms; in Windows versions, all of them are accessed with the Analyze
menu option. For really small data sets
(i.e. below 20-30 taxa), exact solutions can be obtained via implicit
enumeration (or "branch-and-bound"). For larger data sets, heuristic
("trial-and-error") methods are required. The basic algorithms
implemented in the program are wagner trees (with sequential addition of the
taxa, in the best available position, to form a tree as short as possible), and
different types of swappers and algorithms that make rearrangements on
preexisting trees. Branch-swaping can be of type SPR or TBR. For relatively messy, but not very big data sets, the best algorithm
consists of multiple random addition sequences plus TBR (RAS+TBR); this is invoked
with Analyze/TraditionalSearch, using wagner trees as "starting
trees," or with the mult command. Whether or not you are
likely to have found the best trees, or all the tree islands, depends on the
ease with which the algorithm converges to the minimum length found. For this
reason, the program reports the number of replications that reached the best
score found. The program also reports
whether some replications had partial overflows of the tree file, in which
case, some most parsimonious trees may not have been found during the search;
further branch-swapping starting from the trees in memory ("RAM") can
be used to find the additional trees. For
more difficult data sets, the special and combined algorithms will be required.
Note that instead of wasting search effort, it is often preferrable to make one
very aggressive search to make sure what the best score for your data set is,
and then produce additional (independent) hits to that score until the strict
consensus has become stable (as discussed in Goloboff, 1999). There are four
basic types of special algorithms implemented in TNT: ratchet, drifting,
sectorial searches, and fusing. The
ratchet in TNT is slightly different from the ratchet as originally described
by Nixon (1999), and as implemented in Nona/Pee-Wee. The ratchet consists of two phases, perturbation and search,
which are sequentially repeated. In
either phase, the ratchet in TNT does not look for multiple trees (in
Pee-Wee/Nona it was necessary to set the number of trees to save in each phase,
but the best number is always just one); multiple trees can be found only as
the search phase finishes. The ratchet
always uses TBR branch swapping, and it alternates the search phase (S) with
three types of perturbation phase: original weights (O), upweighting (U), and
deleting (D). Thus, the cycles will be:
O,S,U,S,D,S,O,S,U,S,D,S,O… etc. Any of the O, U, or D phases can be skipped
(from the Ratchet settings dialog, or with the ratchet command). During the perturbation phase, rearrangements
of score equal or better than the tree being swapped are always accepted,
untill a certain number of rearrangements have been made (or untill a certain
percentage of the total swapping on the tree has been completed). This seems to provide better results than
the original ratchet. The "autoconstrained" option calculates the
(strict) consensus of the previous tree and the tree resulting from the
rearrangement phase, and during the subsequent search phase, only
rearrangements that do not violate monophyly of the shared groups are
done. The rationale for this is that
improvements are likely to be found in areas that have changed during the
perturbation phase. Every certain numer
of constrained cycles, an unconstrained search phase is done (to insure global
optimality). If this is used in
combination with user-defined constraints, both are combined. Tree-drifting is quite similar to the ratchet, but the perturbation
phase, instead of reweighting the characters, accepts the moves with a
probability that depends on both the relative fit difference (Goloboff and
Farris, 2001; see Goloboff, 1999) and the absolute fit difference between the
tree being swapped and the new tree. It
has the advantage that memory managing is easier, and for that reason is the
only perturbator that sectorial-searches (see below) can use. The perturbation
phase also alternates between acceptance of optimal trees only (O), and
suboptimal trees (U): O,S,U,S,O,S,U… etc. Any of the O or U phases can be
skipped (the O phase, with drift:noequal; or with the skip
optimal-only drifting option in the tree-drift settings dialog box; the U
becomes effectively an O by setting the score difference to a very low number). Sectorial-searches are of two basic types: constraint-based and random;
a combination of both is the mixed sectorial searches. Sectorial searches take a portion of the
tree, create a reduced data set, and produce a mini-analysis of that data
set. If a better solution for that
portion of the tree is found, it is replaced back into the original tree. Every
certain number of (user-determined) rearrangements, a round of global TBR is
done (to insure global optimality). If
the sectors are below a certain size, the mini-analysis consists of three
RAS+TBR (random addition sequence wagner trees plus TBR); if the three
sequences end up in trees of the same score, the mini-analysis is interrupted
(i.e. the reduced data is "clean" and not worth of further analysis);
otherwise, three additional RAS+TBR are done.
If the sectors are above that size, a certain number of iterations of
tree-drfiting is done to the reduced sector.
The three types of sectorial-search differ in how the sectors are
selected. Constraint-based searches
choose the sectors by reference to constraints: the nodes that are resolved as
polytomies in the constraint tree, connecting to no less than N branches (where
N is some number defined by the user; default is 10). A reduced data sets
corresponding to each of the polytomies in the constraint tree is analyzed, in
turn; this is repeated a certain number of times (default=3), because changing
some sector may imply that other sectors should change as well. Random sectors are chosen at random, to be
no smaller and no bigger than a certain (user defined) size. The mixed
sectorial searches use both constraint-based and random sectors; they can only
be used with multiple addition sequences.
For the mixed sectorial search, a temporary constraint tree is created
by consensing the tree found at some stage of the present addition sequence
(user defined: this can be the wagner tree, which is the default, or optionally
the tree produced by SPR or TBR), and this constraint tree is used to do a
constraint-based sectorial search. Once
this is finished, a random sectorial search is done (unconstrained, or using
only the user-defined constraints if they are in effect). Tree-fusing
mixes trees, producing better trees. If the trees comes from different
searches, the score is often considerably improved in very short time. Tree-fusing produces (so to speak) a
synergistic effect for all those previous searches; this also means that the
rate of score improvement is not necessarily lineal (i.e. all the searches
before fusing were not producing a score improvement beyond some bound, but
were nonetheless accumulating "momentum" for the subsequent fusing). For many data sets, the time to
find best scores during searches may depend drammatically on the aggresivenes
and parameters used during the search.
For example, using very superficial searches for difficult data sets
will keep grinding for a very long time before the best score is actually
found; using more aggressive parameters will find best score sooner. The parameters to change are size and number
of sectors to run under random sectorial searches, number of initial
replications to submit to fusing, and rounds of tree-drifting or ratchet. A
careful consideration of how the data set behaves to changes in parameters is
the best way to proceed, but if you have no idea of how a data set behaves, or
don't know how to set the parameters for a search, you may trust to the program
the choice of parameters. This is done
with the set initial level option of the New Technology dialog
box (or the level option of the xmult command). If entered as a command, the level must be a
number between 0 and 10 (0 is very superficial; 10 is extremely exhaustive); if
entered in the dialog box for search level, it must be a number between 0 and
100 (again, 100 is the heaviest search).
If using a driven search (i.e. more than a single hit of the xmult
command), you may have the program check whether best score is being
found easily or not (and then decrease or increase the search level accordingly). This is set with the chklevel option
of the xmult command.
Again, letting the program try to determine the best parameters during
the search is a poor substitute for finding the best parameters and letting
them fixed during the search, and is intended for cases where you have no
time/experience to determine best parameters by yourself. This section describes some details on how the searches are implemented. It is intended to give the user clues on how
to best use the program, or at least understand some of the ways in which the
program behaves under different circumstances. When all the characters have equal weights, the swapper proceeds the fastest. If some characters are weighted differently,
the search algorithms proceed somewhat slower (i.e. it takes longer to complete
a wagner tree or swap completely through one tree; note that finding a shortest
tree might still be easier if the weighting increases the structure of the data
set). If the characters are weighted
using very different weights (e.g. every character is assigned a weight from 1
to 50, at random), the search algorithms reach a floor (depending on the data
sets, about 50% slower than equal weigths).
Searching under implied weights, since it uses floating-point
calculations, is the slowest, being 2 or 3 times slower than equal weights
(note than in PAUP*, using implied weighting slows down the searches by orders
of magnitude). The largest speed
differences with PAUP* are in running large data sets (between 20 and 50 times,
sometimes more than 100 times), and implied weighting (over 500 times, for
large data sets). The numbers of states also affect running times. The more states the characters can have, the
slower the program runs. The program internally packs characters in groups of
characters with 2 states, 3 to 4 states, 5 to 8 states, 9 to 16 states, and 17
to 32 states. Sankoff characters are packed always separately. This implies that the time differences will
come when the numbers of characters in each of those categories changes
significantly. For sankoff-style
characters, the time increases rapidly with the square of the number of states
(in some portions of the code, the time increases actually with the cube). Thus, the algorithms used react badly to
increasing numbers of states; for Sankoff characters with 16 states or more,
TNT is about as slow as POOP* (or even slower). The implementation of searches under implicit enumeration is similar to
that in other programs, and there is nothing special about it. As usual, exact searches take the longest
for the least poorly structured data, and the times increase drammatically with
the numbers of taxa. Since TNT is
conceived to work mostly with large data sets, there seemed to be no much point
in using energy to optimize the implicit enumeration algorithms. The tree swappers can complete swapping on a tree faster if no
additional equally parsimonious trees are being found. This is because in such a case length
calculations for a rearrangement can be abandoned as soon as the program
discovers that the tree is as long as the best one found so far; otherwise,
additional characters have to be checked to make sure the tree is no longer
than the best one found so far (and, to make things worse, the tree has to be
optimized for collapsing and compared to all the pre-existing trees, to make
sure it is not identical as collapsed).
This is used to speed up searches under multiple random addition
sequences, where a reduced number of trees is retained for each replication
(such that most of the time the search is being done with a full memory
buffer). Except in the case of constraints, the addition sequence for wagner trees
is totally randomized, except for the outgroup taxon (which is always the first
taxon added). The insertion sequence of
new taxa to the growing tree is (by default) not totally random: the possible
locations for the new taxon are tried from the root of the tree to the tips, or
from the tips to the root (which option is chosen is determined at random, for
each taxon to add; both are equiprobable).
In some extreme cases, this may be undesirable. If the data are entirely uninformative (e.g.
a matrix with no characters) and no tree collapsing is in effect, then
pectinate trees result from such insertion sequence. It is possible to
randomize the insertion sequence for new taxa, so that all locations are tried
in a random order. This is set with rseed[;
the alternative is rseed]; (default, up/down sequence). For large data sets, building a wagner tree
with a randomized insertion sequence may actually take shorter than with the
up/down sequence (that is probably because the best placement is often found
somewhat in the middle of the tree, not in the tips or root; checking the tips
or root first provides worse bounds to give up length calculations, for a
greater portion of the time). For data
with no informative characters (and with collapsing turned off), wagner trees
with a random insertion sequence are (more or less) equivalent to random
trees. If your data are more or less
structured, setting one or the other is irrelevant. Note that randomization of the insertion sequence has no effect when
wagner trees are built under constraints (i.e. the insertion sequence is always
up/down in this case). The two types of swapper are designed to work as best as possible under
diferent circumstances. The time to complete swapping on a tree (that is, on a tree which does
not lead to better trees by swapping) changes with about the square of the
number of taxa for both SPR and TBR (this is true even when TBR actually
“looks” at a number of rearrangements which increases with the cube of the
taxa). SPR is designed to work best at the initial stages of a search, when
better trees are being found frequently; it is not as fast as it could be for
trees which are already optimal or near-optimal. TBR is designed instead to work best for large numbers of taxa, when the
trees being swapped are optimal or near-optimal, that is, when better trees are
not being found frequently, and also to work best on well-structured data (i.e.
data where there are numerous relatively well supported groups). If the data are very poorly structured,
swapping through a tree takes longer.
For Källersjo et al.’s 2500-taxon matrix (1999), the time to complete
TBR swappping on a near-optimal tree (running on a 800 MHz PIII machine) is
about 12 secs.; PAUP* on the same machine takes over 18 minutes to complete TBR
swapping on the same tree (i.e. a 90 times speed difference; PAUP* takes about 6 minutes to complete SPR
on the same tree, so that completing TBR for 2500 taxa in TNT is 30 times
faster than completing SPR with PAUP*). Neither type of swapper is highly optimized for low numbers of
taxa. For small data sets, expect no
big differences ( i.e. no more than 5-10 times) in running times for TNT and
other programs. These take no time
anyway, so it seemed best to preserve an implementation focused on large data
sets. Heuristic searches.- There are three types of constraints: positive
tree-constraints, positive group constraints, and negative group
constraints. Tree constraints are
defined with the = option of the force command
(or, from the menus, as “positive constraints” with no floating taxa). The latter are defined with all other
options of the force command (or,
from the menus, with positive constraints with floating taxa, or negative
constraints). The three types can be
used simultaneously (e.g. “search for the best trees where A+B+C from a
monophyletic group, but A+B do not”). The
tree constraints are implemented in such a way that the moves violating the
groups established are never attempted.
This type of constraint therefore speeds up the searches (if many groups
are defined, by a significant factor).
Note that the number of trees examined to SPR-swap or TBR-swap through a
single tree is the same under constraints (i.e. the moves are counted as if
effected and rejected). The
group constraints (positive and negative) are implemented internally by
creating a matrix of group membership variables (where taxa belonging to a
group have state 1, taxa outside the group have state 0, and floating taxa have
a missing entry). For group
constraints, all the same moves are evaluated for length. If the tree is within the current length
bound, then the group constraints are evaluated, and if the new tree does not
pass, it is rejected (even if shorter).
Thus, using group constraints does not speed up searches at all. If the group constraints make the tree much
longer (i.e. if they forbid many groups present in the consensus of MPT’s),
they will actually slow down the search (i.e. many moves will produce better or
equally good trees, which have to be subsequently rejected because they violate
constraints). For wagner trees, the
program adds each taxon in a position such that no constraint is violated at
any point. For negative constraints,
this means that the first two taxa which are part of a group constrained for
non-monophyly must be placed as non-sister groups. Of course, a better tree might be found if the non-monophyly of
the group is delayed (to be broken later, for example, by adding another taxon
in between them), but then if non-monophyly has been delayed, it may now be
impossible to create non-monophyly without at the same time violating some
positive constraints. This means that,
for negative constraints that increase tree length considerably, the wagner
trees may be quite far from the best length achievable under the constraints
(e.g. it is advisable not to use wagner trees to estimate consensuses or
jacknife frequencies under those circumstances). Additionally, the random addition sequence may change somewhat
when positive (tree or group) constraints are used in combination with negative
constraints. The first two members of a
group to be non-monophyletic must effectively be non-monophyletic when they are
both added to the tree, but this may be impossible if those two are in turn the
only two members of a group constrained for monophly added so far. The addition sequence is then checked when
positive and negative groups are enforced, changing it to prevent this. The addition sequence is therefore not
completely random when there are positive and negative constraints in effect. Exact searches.- Both positive tree constraints and positive group
constraints speed up searches under implicit enumeration (the more so if they
constrain groups which are anyway present in the optimal trees). Negative tree constraints slow them down, to
the extent that they make the best possible trees longer. TNT has a scripting language, which allows the user to automate complex
tasks, or "program" TNT to do things it normally doesn't do. The scripting language has flow control,
loops, and internal variables that can be accessed by the user. The scripting language is normally disabled;
it is enabled by entering "macro=;" at the command
prompt. The macro command also
controls the basic settings for the macro language, like total RAM for macros,
number of variables, number of nested loops, and protection level for accessing
variables/arrays (for details, see online help for macro). The variables and flow control use
numbers (actual values or internal variables) or expressions (compound
expressions, always must be enclosed in parentheses to be properly parsed). Examples of scripting language can be
found in the files contained in zipdruns.exe (a self-extracting
excutable). Exactsol calculates
trees by implicit enumeration (using a complete down-pass optimization for each
tree: it is very slow; the addtax routine is used as a function from
within exactsol). Macswap
does SPR swapping on a tree, and prints out a message every time it finds a
better tree (it also uses down-pass optimization to evaluate trees). Sank does sankoff-optimization; it is
more elaborate than the other two (it can be used only on small trees; make
sure you have set up the number of variables to a large enough number). User variables.- User variables can be accessed in any context, by enclosing
the number of variable to be accessed in single quotes. Thus, if variable number 0 has a value of
1200, inputtting the string hold '0' is equivalent to inputting hold 1200. Variables can be given names, in which case their values are
accessed by enclosing their name in single quotes. Using double quotes instead of the single quotes is equivalent to
using single quotes, except in the case of parallel versions of TNT, where
double quotes passed as instructions to slaves are interpreted on slave, and
single quotes are always interpreted locally. Using the quotes or double quotes
properly, the user variables can act as pointers. User variables can be defined (when declaring their name) as
arrays (maximum is 5-dimensional arrays).
The indices are then enclosed in brackets, separated by a comma. Thus, 'n[j]' is the
j-th value for variable n (note that the arrays are zero-based), and 'p[x,y,z]'
is the z-th value for row x, column y, of variable p. Note that 'n'[j] will be
parsed as the number corresponding to variable n followed by the expression [j]
(which, according to context, may or may not be a legal expression). The indexing works by substituting values;
each time a new "dimension" is found, the cell is considered to be
that for the base cell plus the value of the dimension; the user could set the
values so that a "matrix" is properly organized, but declaring a
variable as multi-dimensional when it is given a name automatically sets up the
arrays. It is also possible to enclose in the
single or double quotes an expression (which must be enclosed in parentheses);
what is inside the parentheses may, in turn, contain references to user
variables. Thus, set 5 '(%1)' ; will set variable 5
as equal to the variable numbered as the first argument. Similarly,
if variable 0 equals 10, variable 1 equals 20, and variable 30 equals
100, then the expression set
5 '('0'+'1')' ; will assign the
value of 100 to variable 5 (i.e. assigns to variable 5 the value of the
variable whose number equals the sum of the values of variables 0 and 1). User variables can also be used to
store strings, accessed by preceding the number (or name) of the variable by a
dollar sign, $. Strings can be stored
as literal (in which case, expressions like '0' within the string are replaced
by the value that variable 0 has at the time of expanding the string; variables
can be expanded only as numbers, as well as arguments; no string expansion is
allowed within strings). The values for the user variables can
be changed only with the set command. Set followed by a variable number
(or name), followed by an expression, sets the value for the variable to the
value for the expression. The number of
the variable to be set can itself be an expression (but if so, remember, it
must always be enclosed in parentheses). The expression can be an expression as
in the C language; logical expressions are evaluated to either 0 (FALSE) or 1
(TRUE). The expression can be replaced
by either ++ , -- , += , -=, *=, /= (with meaning as in C; note that only
post-decrements and post-increments are allowed). Thus if ( length [ 0 ] > ' maxlen'
) set 1 '1' + 1 ; else set 0 '0' + 1 ; end will increase in one
the variable number 1 if the length of tree 0 is greater than the value stored
in the variable named by the user as maxlen, and otherwise increase in
one the variable 0. The same is
achieved with the more compact expression set ( length [ 0 ] > 'maxlen'
) ++ ; Expressions.- Expressions can be any combination of operations. Valid operators are + - * and /, as well as
bit-operators ( & | ^ ). Valid
logical comparisons are ==, > , < , >= , <= , ! , and !=. Since precedence is evaluated always from
left to right, be careful to delimit subexpressions as desired: if (
'maxlen' <= 1000 + 500 ) is actually
equivalent to if ( (
'maxlen' <= 1000 ) + 500 ) so that it will
first evaluate if maxlen is less than 1000, and then sum the result to 500 (if
maxlen is less than 500, result is 501, otherwise result is 500; the expression
always evaluates "true" ...
probably not what you wanted). Either
remember the precedence and invert the order of expressions: if ( 1000 + 500 >= 'maxlen'
) or use the original
order and parentheses : if ( 'maxlen' <= ( 1000 +
500 ) ) Note that
parentheses must always be balanced. Changing values of variables.- The values of variables are changed with the set
command: set N (expression) ; The
"expression" may in this context not be enclosed in parentheses (this
is the only exception to the rule given above). The expression in this context always must end with a semicolon. The example above will set variable number N
to the value of the expression. You can
set the i-th variable (where i itself is any expression) from N, as if N was an
array, by using: set N[i] ( expression) ; If some private
variables have been defined in previous files (see "Private variables"
below), the number N is set automatically to the number within that file (use
names to access variables from previous files). To set variables as strings, use: set
N $text ; text can be any
number of characters, with the proviso that the string starts at variable N,
and continues to subsequent variables; every 4 letters of the string occupy one
user variable (which is a long int), so that the actual number of user
variables needed to store a string is the length of the string divided by 4. The semicolon indicates the end of the
string (which can contain blanks or spaces; a period followed by a comma is
replaced by a semicolon, as in the quote command). If "text" contains expressions to be replaced (i.e.
single quotes or symbols for arguments), it is replaced on reading. It is possible to define the string as
"literal" by including an equal (=) sign before the number of
variable: set =N $text ; A literal string has
expressions expanded when it is itself expanded (so that it can be expanded to different
strings). No string expansion is
allowed within strings (although it may work in some cases, its results are not
predictable). A string can be copied from one
variable onto another easily, with the set command: set 10 $$0 ; copies string 0 onto
string 10 (make sure string 0 doesn't extend to string 10 before doing this). Private variables.- When input files are nested, all the variables named in
preceding files are "private" –that is, an internal numbering
starting from the last unnamed variable is used. Accessing variables by their number is then relative to a
file. If a name is given for a
variable, the variable names are checked from latest to earliest, so that if
variable names are duplicated in nested files, the most internal files will
refer to their own variable (for example, simple counters like "i" or
"a"); names defined in a preceding file will be recognized only if
not defined later. The number of
private variables is determined by the largest unnamed variable, so if you want
to refer to some variables by their number (for simple counters), it is wise to
reserve the lowest numbers for those, and start naming from variable 10 or more
(this allows to keep "memory management" easy and clean). You can
also keep more variables than the named ones as "private," simply by
issuing a private n; command (where n is the number of private variables)
at some point before calling the next input file.. At the start of each file, you can also include a safe
vars n; statement (which checks whether n user variables are actually available to the user from that file,
and calls an error message otherwise). In this way, each input file can have its own internal
numbering of the variables; this allows to define "functions" in
files, that are safely called from other files. The only way in which a
file can access a variable named in a previously opened file is by referring to
it by its name (and only if the same name did not exist in the file itself: in
that case, the variable corresponding to the file itself is accessed). When an
input file is closed, all the variables that had been named in that file are
denamed; if you want to internally rename variables within a file, you have to
issue first a var-; command, which denames all non-private variables (var-N;
denames all variables above N, N included).
Changing values for arrays.- The setarray command sets arrays. This can be used to read in matrices (of any dimension): setarray 5,4
matrix 0 1
2 3 4 5
6 7 8
9 10 11 1213 14
15 16
17 18 19 ; this will set the
values for a variable called matrix.
The numbers can be any valid expression; note that 0/1 matrices must
have a blank space between the numbers. Naming user variables.- User variables can be named, and accessed through their
name: var = 0 tree_number 1 result 2 other ; If private variables
have been declared in previous files, the numbers (which can themselves be
expressions) correspond to the internal numbering for current file. If any
number is replaced by a + this is interpreted as "the lowest unnamed
variable" (if there are private variables, the first time + is used within
a file, it refers to the lowest possible number that can be accessed within
that file). If you are going to store values starting from variable number N,
you can define the number of variables to store by using a comma and the number
of cells, after the number: var = N, (ntax+1) Taxon_list +, (ntrees+1) Tree_list +, (nchar+1) Char_list +,(ntax+1),(nchar+1) Matrix ; After defining the
name for variable 10 as "Taxon_list", it will consider that the next
variable to be named is variable number 10+ntax+1, so that the variables do not
overlap. If there are 12 taxa and 15
trees, then the name "Tree_list" is assigned to variable number 22,
and the name "Char_list" is assigned to variable 37. Since it is always possible to refer to
variables only by their names, the user does not need to worry to which
particular variable a name has been assigned.
An alternative (equivalent) format for
declaration of multidimensional arrays is: var = N Taxon_list [ ( ntax + 1 )
] + Tree_list
[ ( ntrees + 1 ) ] + Char_list [ ( nchar + 1 ) ] + Matrix
[ (ntax +1 ) , (nchar +1 ) ] ; When using the square-bracket notation
for declaration, the commas are optional.
Declaring the variables as multidimensional automatically
sets up the arrays in a proper way; changing the value for the arrays can
"disassemble" the matrix and make further accesses difficult or
impossible. To prevent this, it is
possible to have the macros check whether all the accesses to arrays are legal
(and then permitted only if the variable has been declared as such). Without protection, the user is free to
access and change any values (e.g. to set up his own arrays). The protection level is set to N with macro
prot N. Floating point printing.- By default, all the
numbers in user variables are stored as floating-point. When printing the values, you can use
different numbers of significant digits.
Note that many operations under scripting involve (internally) expanding
the numbers and then parsing the strings, so that changing the number of
significant digits will change the precision with which many routines are
executed. The command macfloat N; sets the number of
digits to be N (if 0, the printed
numbers look like integers). The option
macfloat
e; uses exponential notation (not readable by commands, useful only to
print out tables). Storing the values as floating point
uses 64 bits per value; if you want to use only 32 bits (integer calculations,
to save memory), issue a macfloat-; command before turning the
macros on. Internal variables.- The user can
access a number of internal variables (i.e. number of taxa, characters,
etc.). You can get a list of those
entering help+;at the command prompt.
Internal variables can be accessed only in certain contexts in the macro
language. The general rule is that user
variables can be accessed from any command (by enclosing their number or name
in single quotes), while internal variables can be accessed only by the macro
commands (loop, if, set, var, etc.). Note.- The internal variables are a key component of the
TNT macro language; if accessing other internal variables, besides the ones
already included, could be useful to some users, get in touch with PAG and ask
him about the possibility of including those variables in TNT (given the design
of the macro language, it is really easy to add new internal variables). Flow control.- The flow control is established with the if
and loop
commands. There are two possible
syntaxes for the if command; one is: if ( condition ) action(s) end where
"condition" is any valid expression, and "action(s)" is any
set of valid TNT commands. If the
condition is "satisfied" (i.e. evaluates to non-zero), then action(s)
are executed; otherwise, all actions before the end are skipped (note
that TNT interprets its scripts as it reads them, so that the action(s) may
themselves contain syntax errors, which won't be discovered if the action(s)
are skipped; the errors will only be discovered when the actions are executed). The alternative is if ( condition ) action(s)
A ; else action(s)
B ; end which will execute
action(s) A if the condition is satisfied, action(s) B otherwise. Each if must always be matched by else
or end;
each else
must always be matched by end. Nested if commands are legal,
without restrictions as to the number of nestings. The other command for flow control is
loop. The syntax is, loop i+j k action(s) stop the numbers i, j,
and k can be numbers, variables, or expressions. This will repeatedly execute
action(s), k – i + j / j times. If the
symbol '#' followed by the number of loop (see below) is included in action(s),
it is replaced by the value for the iteration being done. Thus, the symbol will the first time be
replaced by i, incrementing in subsequent iterations by j, until value exceeds
value k (if i is greater than k, the loop is decreasing, and j is taken to be
negative, even if preceded by +). Nested loops are possible (the maximum
nesting is determined with macro*). To refer to the value for each loop, use the '#' followed by the
corresponding number. Each loop
must always be matched by a stop. Within the loop, the commands endloop, setloop
n, and continue, are recognized (endloop ends the loop, setloop
resets the value of the loop to n,
and continue
skips all the action(s) between the current point and the stop that ends the loop). Essentially any action can be done from
within the loops, except denaming user variables. The loops use RAM from the macros, so if some long loop cannot be
executed under certain conditions, switch the macro option to off and increase
the RAM for macros. Return values.- When a file with instructions is closed, it
may be desirable for it to have some kind of return value (having it to store
the exit value in a user variable may be cumbersome and make the file less
flexible). The command return
N; closes the input file and sets the internal variable exstatus
to the value N. Recursive calls.- Input files with
instructions can be called recursively.
To call a file from itself, use the recurse command (followed, if
desired, by arguments, just like used for the run command). This (together with the declaration of
private variables) allows to define quite complex routines. The number of times a file can call itself
with recurse
depends on the maximum number of input files (default = 10); the number of
input files allowed is changed with mxproc (the file
"exactsol.run" does this change automatically). An example is below,
in the files "exactsol.run" and "addtax.run." That finds an exact solution for any number
of taxa (it uses a complete down-pass optimization for each partial tree, so it
is rather inefficient). File
"exactsol.run" must contain: silent =all ; set 0 ntax + 2 ; mxproc '0' ; /* make sure you have enough input calls */ var = 0 attax 1 bound
2 wkdone ; /* attax , bound , and wkdone are private */ set bound 10000000; /* use a big
initial bound */ keep 0 ; tread ( 0 ( 1 2 ) ) ; /*
set an initial network */ set attax 3 ; /* attax = next taxon to add */ report- ; addtax ; /* this is the "recursive" function */ progress/ ; report= ; tc 1. ; silent - all ; proc/ ; File
"addtax.run" must contain: set 0 'attax' ; set 1 ntax + 1 ; private 2 ; loop 1 nnodes [ 0 ] if ( #1 == '1' || !isintree [
0 #1 ] ) continue ; end if ( 'attax' == 4 ) set wkdone ++ ; progress
(('wkdone'*100)/15) 100 Working ; /* progress
*/ end edit ] 0
'1' 'attax' #1 ; /* attach new taxon */ set 2 length [ 0 ] ; if ( '2' <= 'bound' ) if ( 'attax' < ntax )
set attax ++ ; recurse ; /*
call yourself !! */ set attax -- ; else if ( '2' <
'bound' ) /*
update bound */ keep 1 ; set bound '2' ;
end if ( '2' == 'bound'
&& ( ( ntrees + 1 ) < maxtrees ) ) copytree 0 ;
/* save
multiple trees */ end end end pruntax 0 / 'attax' ; /* deattach
taxon */ stop proc/; Note that the C
style comments in the two example files are perfectly valid. Goto.- The
instruction goto filename tag [arguments] parses file filename starting from the point labeled tag (with label tag). The file can be the same file, or a
different one. This can be used to
have related routines in the same file (a sort of "library"). A file can be defined as the default file to
parse, with goto = filename.
Subsequent calls to goto must specify only the tag (and arguments, if any). Branch-swapping.- In a way similar to the loop command, the
commands sprit and tbrit initiate a loop, with the
difference that in each iteration, the tree given as argument N is changed
(under spr or tbr branch-swapping): sprit N action(s) stop Action(s) can be any
valid TNT command, plus additional commands recognized only from within sprit
or tbrit: continue skip remaining actions and do next rearrangement endswap end the
swapping (if never saved, tree N is unmodified) resetswap save current rearrangement and restart swapping While within a sprit
or tbrit
loop, the internal variable percswap equals the percentage of
rearrangements done to complete swapping on current tree. Together with the progress command, this
can be used to report partial progress of the macro instructions. Sprit or tbrit can be nested with loops, but
no sprit
or tbrit
can be nested within a sprit or tbrit. Defining dialogs
(Windows only).- It is possible
to define templates for dialog boxes, with the command opendlg. Names for files can be written onto
variables by using the getfname command. See online help for
syntax; the file "dialog.run" (included in the self-extracting file
zipdruns.exe) contains an example, which emulates the dialog box for Analyze/TraditionalSearch. A more elaborate script,
"bremer.run", allows calculating bremer supports using various
options. This section provides citations for
the sources of some methods used in TNT.
If you use the method in a published paper, you should (besides citing
TNT itself) cite the proper source for the method. Bremer, K. 1990.
Combinable component consensus.
Cladistics 6:369-372 (as one might guess from the title, this paper
describes the combinable component consensus). Bremer, K. 1994.
Branch support and tree stability.
Cladistics 10:295-304 (a formal description of the "bremer
support"). Coddington, J., and N. Scharff. 1994. Problems with zero-length branches. Cladistics 10:415-423 (a discussion of the
problems associated with tree-collapsing, and proposal of "rule 4"). De Laet, J.
1997. A reconsideration of three-item analysis, the use of implied weights in
cladistics, and a practical application in Gentianaceae. Ph.D. Thesis, Katholieke Universiteit
Leuven, Faculteit der Wetenschappen, Leuven, Belgium, 214 pp. (this contains
discussion of implied weighting in general, and it discusses the idea of
recoding additive characters in binary form for implied weighting). Farris, J., V. Albert, M. Källersjö, D. Lipscomb, and
A. Kluge. 1996. Parsimony jackknifing
outperforms neighbor joining.
Cladistics 12:99-124 (this paper describes the idea of resampling characters
with independent probability of removal, to eliminate influence of uninformative
or irrelevant characters, as well as the idea of superficial searches for
consensus estimations). Farris, 1970.
Methods for computing Wagner trees.
Syst. Zool. 19:83-92 (the paper describing the first formal approach to
Wagner trees and optimization for additive and continuous characters). Farris, J. 1990.
Phenetics in camouflage. Cladistics 6:91-100 (a discussion of problems
associated with some methods to discretize continuous characters in
phylogenetic analysis). Felsenstein, J. 1985.
Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783-791 (the original paper
proposing to use the bootstrap on phylogenies). Fitch, 1971.
Toward defining the course of evolution: minimal change for a specific
tree topology. Syst. Zool. 20:406-416
(this paper describes optimization for non-additive characters). Goloboff, P. 1993.
Estimating character weights during tree search. Cladistics 9:83-91 (description of the
method of "implied weights"). Goloboff, P. 1994.
Character optimization and calculation of tree lengths. Cladistics 9:433-436 (description of the
basic algorithm for faster calculation of tree lengths during searches; instead
of making T-2 node comparisons for each tree of T taxa, the algorithm requires
making only one node comparison per tree).
Goloboff, P. 1996.
Methods for faster parsimony analysis. Cladistics 12:199-220 (this paper
describes the "two pass incremental optimization" which is the basis
for calculations in TNT. It also
describes the "qcollapse" algorithm for pre-collapsing of trees
during searches). Goloboff, P. 1998.
Tree searches under Sankoff parsimony.
Cladistics 14:229-237 (this paper describes calculation of tree lengths
for Sankoff characters; the method used in TNT is a combination of the exact
method described in this paper, with the incremental optimization described in
Goloboff, 1996). Goloboff, P. 1999. Analyzing large data sets in
reasonable times: solutions for composite optima. Cladistics 15:415-428
(description of tree-drifting, sectorial searches, tree-fusing, and the idea of
searching until the consensus stabilizes). Goloboff, P. 2002. Optimization of polytomies: state
set and parallel operations. Molecular Phylogenetics and Evolution 22:269-275
(a description of the method used in TNT to optimize polytomies for
non-additive characters). Goloboff, P. 2007.
Calculating SPR-distances between trees. Cladistics (in press). Goloboff, P., C. Mattoni, and S. Quinteros. 2005.
Continuous characters analyzed as such.
Cladistics 22: 589-601. (a description of the method used to optimize
continuous characters on polytomous trees, and a description of the
implementation of continous characters in TNT; see also Farris, 1970). Goloboff, P., and D. Pol. 2002. Semi-strict
supertrees. Cladistics 18:514-525 (description of the supertree method used in
TNT, and some cautionary remarks about the use and interpretation of
supertrees). Goloboff, P., and J. Farris. 2001. Methods for
quick consensus estimation. Cladistics 17:S26-S34 (description of the methods
to estimate consensuses, the relative bremer support, and tree collapsing using
branch swapping). Goloboff, P., J. Farris, M. Källersjö, B.
Oxelmann, M. Ramirez, and C. Szumik. 2003.
Improvements to resampling measures of group support. Cladistics
19:324-332 (description of symmetric resampling, frequency differences, and
frequency slopes). Maddison, W. 1989. Reconstructing character evolution on polytomous cladograms. Cladistics 5: 365-377. (the paper describing
the method used to calculate length of "soft" polytomies). Nixon, K. 1999.
The parsimony ratchet, a new method for rapid parsimony analysis.
Cladistics 15:407-414 (the original description of the ratchet, already a
citation classic). Hendy, M, and D. Penny. Branch and bound algorithms to determine minimal evolutionary
trees. Mathematical Biosciences 59:277-290 (description of the implicit
enumeration algorithm, used for solutions guaranteed to find all trees of best
score). Printing or saving to
a Metafile, output or trees (Windows only) (back to index)
Data Input (back to index)
Taxon1 0010111000
Continuous characters (back to index)
Merging Data Files (back to index)
Basic Character Settings
(back to index)
Scopes, sets, and blocks
(back to index)
Character and block names (back to index)
Step-Matrices (back
to index)
X>Y Z U/V W ;
Ancestral states (for step-matrix asymmetric
characters) (back to index)
Nexus files (back
to index)
The basic features of the Nexus format are supported in TNT. When parsing a file, if the first string in
the file is "#NEXUS" (and the macro language is turned off), then TNT
will parse the file as a Nexus file.
So, you simply open the nexus file in the same way you would open a
native TNT file. The only blocks that
will be read are those corresponding to data, trees, and assumptions (commands
ctype, usertype, include, exclude, and weights). The data can be read in the "dna," "protein"
or "standard" formats only (e.g. format datatype = standard;). Unrecognized commands or blocks are simply
ignored or skipped. Note that within
the NEXUS file, numbering for taxa or characters starts from 1 (one), not from
0 (zero). You can include TNT commands
in the nexus file, at the end of the file, preceded by "begin tnt;" (with an end at
the end, which is ignored by TNT; otherwise, PAUP* complains that end of file
was found when reading a TNT block).
TNT can also export data or trees in Nexus format. For this, choose
the Data/Export menu option (or the export command). The export command allows embedding
commands (for PAUP* or TNT) at the end of the file, using + after the file
name, followed by the commands (a semicolon terminates; to embedd a semicolon
in the file, use a period followed by a comma, as in the quote command).
Implied Weighting (back to index)
Tree-view: Selecting Nodes and Editing Tree
(Windows only) (back to index)
Saving suboptimal trees (back to index)
Memory management (back to index)
Tree collapsing (back
to index)
Command-Line (back
to index)
Batch Menus (Windows only) (back to index)
Measures of character support (back to index)
A note on Linux-Mac OS X versions (back to index)
Running TNT in parallel
Consensus
estimation (back to index)
Tree-searching algorithms (back to index)
Determining best search parameters (back to index)
Implementation notes for tree-searching (back to index)
General
Wagner trees
Swappers
Effects of constraints on running
times
Scripting (for programmers or advanced users) (back to index)
Citations (back to index)