T. N. T.

Tree Analysis Using New Technology

Version 1.1 ã P. Goloboff, J. S. Farris, and K. Nixon

License Agreement

 

   Introduction-Getting Started      Saving suboptimal trees

   What TNT doesn't do...            Memory Management

   Printing & Metafiles              Tree Collapsing

   Data Input                        Command-Line

   Continuous characters             Batch Menus

   Merging Data Files                Measures of support

   Basic Character Settings          Linux versions

   Scopes, sets, and blocks          Consensus estimation

   Character and block names         Search algorithms

   Step-Matrices                     Setting search parameters

   Ancestral states                  Implementation notes…

   Nexus files                       Scripting(advanced users)

   Implied Weighting                 Citations

 

                     

Overview

 


General:

 

 

Optimality criteria:

 

 

Search options:

 

 

Tree diagnosis:

 

 

Tree comparisons:

 

 

Group supports, randomization:

 

 

Scripting

 

 

 


 

Introduction - Getting Started  (back to index)

 

TNT is a program for phylogenetic analysis under parsimony.  It provides fast tree-searching algorithms, as well as extensive capabilities for tree diagnosis, consensus, and manipulation.  This documentation explains some important things that are not immediately apparent (or cannot be done easily) from the menus themselves. For details on how to use commands, users must refer to the on-line help of TNT (accessed typing help <enter> at the command prompt or the command dialog opened with File/CommandLine).  This file, tnt.htm, provides a general description of the program; if the file is copied to your Windows directory, then when selecting the Help menu option, the Explorer is called to read this file.  Throughout this help file, commands that can be typed at the command line (or included in a file), are marked with italics and bold; menu options are underlined.

The basic thing you need to do to run TNT is to prepare a data input file.  You can either create this from the program itself, with the File/New and Data/Edit menu options (Windows only), or in a text editor.  This file should contain the data matrix (format specified in the next section, Data Input); it may contain also specifications of character names and settings, like weights and additivities (this can be later specified once within the program, but it is advisable to have this in the file, so you do not forget to change something you need to change; see under Basic Character Settings, Step-Matrices, Ancestral states, and Character and block names).   Check the files zilla.tnt, example.tnt, and contin.tnt, three example data files provided with the program.  Note that all the input files must be text-only files.

Once you have read the data in, you are ready to start doing searches, etc. Like most parsimony programs (i.e. Hennig86, Paup, or Pee-Wee/NONA), TNT has an internal (memory) buffer for trees, and many of the commands and options (optimization, consensing, etc.) are available only when there are trees in memory, as these commands are performed on pre-existing trees.  Some search algorithms create trees de novo (others work on preexisting trees, modifying them to try to find better trees; see under Tree-searching algorithms). The trees in the memory buffer can be saved to disk, so they are not lost when you exit the program; you can subsequently read the trees into the program.  The text output produced by the program (i.e. results and timings of runs, tree-diagrams, etc.) is written onto an internal text buffer, which can be later saved to disk. 

If you have a Windows version, make sure you install the font provided with the program, tred.ttf, in your system.  This font is required for TNT to draw trees properly.  If the tred font is not installed, TNT will attempt to use the font terminal (which also draws trees, if not as nice as tred's).

 

What TNT doesn't do ...or does poorly... or doesn't do yet...  (back to index)

 

There are many things the program does not do.  First, the program is conceived as a tool for standard parsimony analysis.  It does not do maximum likelihood, it does not use specific models of evolution, it does not do bayesian analyses.  Although we anticipate incorporating some likelihood algorithms, they will be quite rudimentary. 

         Second, the program does not do sequence alignment of any kind.  Sequence comparison is (computationally) quite a different problem from standard parsimony analysis. If you need sequence alignment, get a program that does it properly.

         Third, the program does not incorporate methods which are potentially useful but are somewhat experimental.  The "support weighting" method of Farris (implemented in Farris' program zac) is not included in TNT.  That method (which is based on mapping character changes on a tree in order to evaluate it) requires operations quite different from the ones used in TNT; we do not anticipate including that method in the program. 

         Fourth, the program uses exact algorithms for Sankoff characters.  These work fine for low-medium numbers of states, but they are not very efficient for very large numbers of states.  Thus, for step matrix characters with many states, TNT is not very efficient.  Note that this applies only to Sankoff ("step-matrix") characters; additive and nonadditive characters with many states are treated efficiently.

         Fifth, the exact solution algorithms in TNT ("implicit enumeration") are not specially efficient –their speed is about the same as that of PAUP* or the 15-year-old, 16-bit Hennig86.  Since the program is conceived mostly as a tool for analysis of large data sets, there seemed to be no point in using energy in improving the implicit enumeration algorithms.

         Sixth, TNT does not implement any kind of multiple-cut-swapping (commands mswap, swap, tswap, and altswap in Nona/Pee-Wee).  Those swappers were less effective than multiple addition sequences.  Likewise, the swap command of Nona/Pee-Wee (i.e. report fit differences when moving specified clade around in the tree) is not directly implemented in TNT (it would probably be easy to implement that using the sprit or tbrit options of the scripting language).  These are probably the only options present in Nona/Pee-Wee but absent in TNT.

        

 

Printing or saving to a Metafile, output or trees (Windows only)  (back to index)

 

The Windows version of TNT has rudimentary printing capabilities.  You can print the display buffer as a text file, using File/Output/PrintDisplayBuffer (optionally, you can save the text in the display buffer to a file, and print it later with your favorite word processor; choose File/Output/OpenOutputFile, and all newly produced output will be written to the file; optionally, with File/Output/SaveDisplayBuffer, you can save previously produced output). 

To print tree diagrams, you have to set the pre-view for trees as ON (under Format/PreviewTrees).  If you want to map (=optimize) characters in color, choose Format/MapCharactersInColor (set the colors you like with Format/ColorsForMapping).  Then, draw the trees and (when in pre-view screen) press "P."  When in pre-view screen, pressing "M" saves the tree diagram to a metafile (which can be later processed/printed with Corel Draw, or exported to Microsoft Word or Power Point).

 

 

Data Input   (back to index)

 

TNT reads matrices in Hennig86 format, with some refinements.  Normally, the data will be included in a file. The basic format for data input is:

 

xread

'optional title, starting and ending

 with quotes (ASCII 39)'

nchar ntax

Taxon0   0000000000

Taxon1   0010111000

Taxon2   1011110000

Taxon3   1111111000

.....

TaxonN   1111111000

;

 

the semicolon at the end is normally not required but is useful to detect errors (e.g. is the number of taxa set too high?).

Polymorphisms can be read, by enclosing the state(s) in square brackets.  By default, up to 16 states are allowed by xread, using symbols 0-9 to indicate states 0-9, and A-F to indicate states 10-15.  This can be changed with the nstates command, which defines the number of states (and data type) to be subsequently read; in Windows versions, this can be changed with Format/DataType.  If the data are defined as DNA (with Format/DataType under menus, or using the command nstates dna;) the IUPAC nucleotide codes (including polymorphisms) are recognized and used throughout.  The states are internally stored as A=0, G=1, C=2, D=3, and gap=4.  If the data are instead defined as proteins (with the menus, or using nstates prot;), the IUPAC codes are recognized and used throughout.  For 32 states, non-protein data, the symbols 0-9 and A-V indicate states 0-31.

If the data are defined as smaller numbers of states, they occupy less memory.  This may be necessary for large data sets.

The data can also be read as interleaved.  For this, all that is required is that each block of data is preceded by either the & symbol or the ASCII character number 4 (the latter is what PAUP* uses). The taxa in each block must have the same exact names (including capitals).  The taxa in each block can be arranged in a different order; the taxon numbers correspond to the first occurence of a name.  A taxon which is not included in a given block, will have only missing entries for the characters in that block; the same number of characters must be specified for all the taxa within a given block.  Each row of characters must end with a carriage return.  The end of the data is indicated with a semicolon; if some characters were never specified, they are left as missing entries (and a warning is issued).   If some taxon names were never defined, a warning is issued.  The interleaved format allows mixing of different data types, by simply enclosing the corresponding string (prot, num, or dna) in square brackets right after the & symbol.

When the data are being read as interleaved, it is possible for different blocks of data to have different formats.  For this, specify between square brackets, immediately following the '&' that indicates a new block, the format of the data (numeric, dna, proteins, continuous).  To specify how to read a state in subsequent commands that refer to states (cost, change, or usminmax), use the '/' option of the nstates command.

         

 

Continuous characters (back to index)

 

While other programs require that continuous characters be discretized, TNT can deal with continuous characters as such.  This avoids the use of the rather ad hoc methods that have been proposed to discretize continuous distributions in phylogenetic analysis (gap-coding, Thiele's method, etc.; see Farris, 1990 for discussion of some of these methods). If there is significant variability in one of the terminals, it will probably be best represented by a range of one (or two) standard deviations around its mean.  For normal distributions, this automatically takes care of non-significant differences.

Continuous characters in TNT can have states 0-65, with 3 decimals (thus, when not doing implied weighting, tree scores are reported always with 3 decimals).  They are optimized always as additive, i.e., using Farris optimization.  They cannot be turned to nonadditive or sankoff.  The continuous characters must be the first ones in the matrix (i.e. the first block or blocks, followed by blocks with other data formats).  Each value must be separated by a blank or whitespace; ranges are indicated by two values separated by a dash (-).  Missing entries are indicated only as question marks (?).

The treatment for the data when there are continuous characters is rather transparent.  The commands/options that show reconstructions (recons command, or Optimize/Characters/Reconstructions) or count number of specific transformations (change command, Optimize/CountSpecificChanges) cannot be applied to continuous characters.  Continuous characters can be edited with the xread= command, and with the menu option Data/Edit/Taxon, but not with  Data/Edit/Character.  Continuous characters can be named with the cnames command, but their states cannot.

Standardization of continuous characters (i.e. rescaling so that the values are on a common scale) is left to the user.  Under pre-defined weights, standardization will normally be an issue.  Under  implied weighting, only the default formula can be applied (i.e. user-weighting functions are not allowed for continuous characters).  Since continuous characters tend to have a larger s-m difference (s: steps, m: minimum possible), implied weighting should normally penalize them rather heavily. Thus, standardization of continuous characters is probably not so important under implied weighting. 

 

Merging Data Files (back to index)

 

The dmerge command merges data files, and saves them as blocks; the character names and settings are also merged (in Windows version, File/MergeDataFiles, using Ctrl+Left Mouse Button to select multiple files).  The dmerge command also merges (by using * as argument) the characters with identical character names (in Windows version, with Data/MergeHomonymousCharacters).  Some basic rules of correspondence between states and state names must be satisfied.

 

 

Basic Character Settings  (back to index)

 

In the Windows version, characters settings can be determined with Data/CharacterSettings. Current settings can be displayed with Data/ShowStatus. Alternatively, settings can be included in the data file, with the ccode command.

The ccode command is similar to Hennig86's.  It uses specifiers to determine actions on subsequently selected characters:

 

+    make additive

-    make nonadditive

[    activate

]    deactivate

/N    give weight N (if no number immediately

    follows the slash, set weight to 1)

(    apply step-matrix costs

)    make additive or nonadditive again

 

The ccode command always must end with a semicolon.  The ccode command does not recognize character names, or character groups (it does recognize the @ option to select from a particular block). By default, TNT considers all characters nonadditive (this is the oppossite of NONA and Hennig86), active, and of weight 1.

The costs of character-state transformations can be defined with the cost, smatrix, or cstree commands (see below); note that the costs for character-state trees are stored in the costs, so they are a type of "step-matrix" characters.  Step-matrix characters for which the transformation costs fit a character-state tree (i.e. additive relationships between states) are (for implied weighting purposes) internally recoded as binary.

 

 

Scopes, sets, and blocks  (back to index)

 

For many commands, it is necessary to specify series of numbers (representing sets of trees, taxa, or characters).  Numbering for everything starts from 0 (as in Hennig86 and Pee-Wee/NONA).  The characters (if they have been named, see below) and taxa can be accessed using either their numbers or their names.  For any command that takes ranges, two numbers separated by a period (X.Y) indicate a range (from X to Y, inclusive).  A blank space before the period indicates that the range starts from the first possible value, and a blank after indicates that the range ends in the last possible value.  Note that names cannot be separated by a period to indicate ranges. Except for the ccode command (where they have a special meaning), the + (or – ) symbols indicate inclusion (or exclusion) from the list.  Also, up to 32 groups of trees, characters, and taxa can be defined.  After having defined a group, enclosing in curly braces ({}) the group number (or name, if named) is equivalent to specifying each of the members of the group.  For taxon specifications, using @X Y refers to all the terminals included in node Y of tree X; under Windows versions, the equivalent is accomplished by selecting nodes under tree-viewing mode (Trees/View; see below).  For character specifications, using @X Y Z refers to the Yth and Zth characters in block X (note that Y and Z can themselves be ranges; if the data actually contain blocks, the first one is numbered as 1, and block 0 refers to the entire set of characters).  The blocks can be referred to by their names (if defined, see below; "all" is always used as name for the entire matrix). Within execution of a given command, the @X for characters remains in effect untill deactivated explicitly with @0 or @all.  If the data are read as interleaved, the blocks of data are automatically preserved (although they can be changed with the blocks command).

 

Character and block names   (back to index)

 

Character names can be defined with Data/EditData/CharacterNames, or alternatively, with the cnames command. The syntax for cnames is as follows:

 

cnames

[0 First_block ;

[1 Second_block ;

{0 zero state_zero state_one state_two ;

{1 one state_zero state_one state_two ;

{2 two state_zero state_one state_two ;

{+ three state_zero state_one state_two ;

{+ four state_zero state_one state_two ;

;

 

An additional semicolon is needed to end the cnames statement.  The names must all be defined together; character name definitions can be changed with cnames +N charname statename(s); (this changes character/state names, only one at a time). The names preceded by opening square brackets, [, are names for blocks, and the names preceded by curly braces, {, are character and state names.  Not all characters (or blocks) must be named in a cnames statement; name what you need.  The + symbol means "the character (or block) following the last one named" (if no character or block had been named before, this is the first character or block). The characters within a block can be named according to their in-block numbering sequence, by using the @ option:

 

cnames

[+ First_block ;

[+ Second_block ;

@First_block

{0 zero_of_first .... ;

{1 one_of_first .... ;

{2 two_of_first .... ;

{3 three_of_first .... ;

{4 four_of_first .... ;

@Second_block

{0 zero_of_second ... ;

{1 one_of_second ... ;

{+ two_of_second ... ;

{+ three_of_second ... ;

;

 

Changing the reading frame with @blockname resets to 0 the value of the + symbol for character names.

 

 

Step-Matrices (back to index)

 

         The step-matrices can be defined from the menus (with Data/CharacterSettings), or via the cost or smatrix commands.  The first defines transformation costs for sets of characters in particular; the second defines and stores them, for later application to sets of characters. The cost command has the following syntax:

 

         cost N =   ...costs... ;

 

where "costs" (see below) are the step-matrix costs of character(s) N; note that the cost command does not make characters step-matrix, only defines their costs (the ccode command is necessary for this, with "(" making the character step-matrix, and ")" non-step-matrix).  The smatrix command has the following synthax:

 

         smatrix =S (xxxx) ...costs... ;

 

where "costs" are saved in internal step-matrix number S (named xxxx). Once defined, the step-matrix can be applied to character(s) N, using

 

smatrix +S N ;

or

         smatrix +xxxx N ;

 

The syntax for the costs themselves is similar for both the cost and smatrix command:

 

X>Y Z U/V W ;

 

defines as Z the cost of transformation from (any state in) X to (any state in) Y, and as W the cost of transformation between (any states in) U or V.  X, Y, U or V can be several states, enclosed in square brackets (as in the xread command); if the data have been defined as DNA, the IUPAC codes for polymorphisms apply; using ? indicates all possible states.  If Z or W is replaced by the letter i, the cost is set to 1000 (presumably, "impossible," but this really depends on the other costs).  When the > symbol is replaced by a slash, /, this indicate symmetry in transformation costs.

         The minimum possible number of steps for step-matrix  characters is calculated with approximate algorithms (in PAUP* the minimum cannot be calculated at all for step-matrix characters, and therefore implied weighting cannot be applied).  Beware that for very complex transformation costs (e.g. random numbers between 1 and 30 for each cost), the algorithms used by TNT can sometimes overestimate the minimum possible length in a few steps.   These algorithms are more prone to error when there are polymorphic taxa.  If in doubt of whether the program is calculating accurately the minimum possible steps for a sankoff character, use the usminmax command to determine a user-defined minimum.  If the user-minimum is set to 1, all sankoff characters will be considered "informative."  Then, you either optimize all characters, or you can do a search for each character, so as to make sure what the real minimum for each character is.

         The step calculations during searches are exact for any type of transformation.  The algorithms used consider as possible assignments for optimizations all the states that occur between 0 and the largest state (states beyond the largest one occurying in the matrix are not considered; states between 0 and the largest state are considered even if they are not present in any terminal taxon).  This is more or less equivalent to to the "allstates" options of PAUP*, with the difference that PAUP* tries (as I understand it) all the states defined in the step-matrix (instead of the largest state observed in the terminals).  It is possible (although unlikely) that in the case of complex transformation costs, if some state larger than the ones present in the terminals is intermediate between others, it is required to obtain the minimum length. For example, supposse you have DNA some characters where the only observed states are A, C and G, where all transformations cost 10, except to or from T, which costs 6; whether or not T is included among the possible states determines whether the minimum cost can be 20 or 18 (perhaps even influencing whether the character is considered informative or not).  Then, you can tell TNT that the maximum number of states to consider for all sankoff characters is some state (in the case of DNA data, the largest site is T = 3 and a gap is a fifth state = 4; in the case of protein data the largest state is U = 20, and in the case of 32 alphanumeric states the maximum state is V = 31).  This must be done prior to the the definition of costs. This obviously slows down calculations, so TNT pays attention to this only in those cases when it does make a difference (i.e. only when length could change by including the larger state(s)).

         Note that the branch-swapping algorithms only look at trees where the currently designated outgroup is outside everything else.  If you have asymmetric costs for some characters, and you want the program to choose the rooting that minimizes the overall cost, you have to add an all-missing ancestor. This is different from the way PAUP* works: when step-matrix asymmetric characters are present, PAUP* also tries rerootings of the tree (i.e. it does moves which put the outgroup "inside").  If you do not include an all-missing ancestor in TNT, or do not force monophyly of all but the outgroup taxon in PAUP*, the two programs may produce different results.

 

Ancestral states (for step-matrix asymmetric characters)  (back to index)

 

For characters with symmetric transformation costs, the root states will always include the states in the outgroup taxon among the possible states.  This may not be the case for step-matrix asymmetric characters; for example, suppose that going from 0 to 1 is very expensive and going from 1 to 0 is very cheap; if the outgroup has state 0, it may be better to assign state 1 to the root, and then postulate several independent changes from 1 to 0 than a single change from 0 to 1.  That may or may not be reasonable, depending on what the outgroup states represent.  If the outgroup represents a series of successive sister groups with the same state (example: a wingless ancestor for hexapods, with wing gain being more expensive than wing loss), assigning to the root a state different from the one in the outgroup taxon would imply a number of uncounted events.  In those cases, it is possible to consider the states in the outgroup as ancestral –that is, choose for possible state assignments for the root only from among the states in the outgroup taxon. This can be done by choosing Data/CharacterSettings, then clicking on the "ancestral" option, and selecting the characters you want o set. Alternatively, you can use the ancstates command, and include this in the data file itself.  Note that the designation of a character as "ancestral" has no effect if the character is not an asymmetric step-matrix character.

 

Nexus files (back to index)

 

The basic features of the Nexus format are supported in TNT.  When parsing a file, if the first string in the file is "#NEXUS" (and the macro language is turned off), then TNT will parse the file as a Nexus file.  So, you simply open the nexus file in the same way you would open a native TNT file.  The only blocks that will be read are those corresponding to data, trees, and assumptions (commands ctype, usertype, include, exclude, and weights).  The data can be read in the "dna," "protein" or "standard" formats only (e.g. format datatype = standard;).  Unrecognized commands or blocks are simply ignored or skipped.  Note that within the NEXUS file, numbering for taxa or characters starts from 1 (one), not from 0 (zero).  You can include TNT commands in the nexus file, at the end of the file, preceded by "begin tnt;" (with an end at the end, which is ignored by TNT; otherwise, PAUP* complains that end of file was found when reading a TNT block).

TNT can also export data or trees in Nexus format. For this, choose the Data/Export menu option (or the export command).  The export command allows embedding commands (for PAUP* or TNT) at the end of the file, using + after the file name, followed by the commands (a semicolon terminates; to embedd a semicolon in the file, use a period followed by a comma, as in the quote command).

 

Implied Weighting (back to index)

 

TNT implements the implied weighting method of Goloboff (1993), using floating-point (=exact) fit calculations.  The fit for  characters is (by default) calculated as f = k / ( e + k ) (where e = extra steps, and k = constant of concavity, changed with piwe=k).  During the searches, the program reports the score as an increasing function of the homoplasy (or "distortion" to be minimized), which is 1 – f = e / ( e + k ).  If you prefer the score reported as fit, it can be calculated once the search is finished. 

Since floating points calculations are used, it is very important that you evaluate group support when doing implied weighting.  Since exact ties are very unlikely, this criterion may produce overresolved trees if poorly supported groups are not collapsed. 

The fit for discrete additive characters (including those with character-state-trees) is calculated by decomposing the character into binary variables.  Note that this may produce fits different from those of Pee-Wee or PAUP* (which do not do this conversion); the fits will always be measured as higher in TNT; it has been suggested (De Laet, 1997) that the recoding produces a more meaningful evaluation of the relative weights. This affects only data sets with discrete additive characters. If the same characters are defined as continuous, the fit is calculated without decomposing into binary variables (thus producing the same fit values that Pee-Wee or PAUP* produce for the additive coding); treating the additive characters as continuous obviously slows calculations.

In the case of extra internal steps defined by the user, the same number of extra steps will be added for each of the variables into which the character is decomposed.  If you want some of the variables to have different numbers of steps, you have to recode the character as binary yourself.

The method is applied to Sankoff characters as well, in which case the fit for a character with e extra steps and a minimum cost among states of c is calculated as  k / ( ( e/c ) + k ) (where k is the constant of concavity).  When the costs are defined as all unity, the formula produces results identical to those of the standard formula used by Goloboff (1993).  If the user has defined some characters has having extra internal steps, these are added to e. 

         It is possible to define a user weighting function, by specifying the values of weights for different numbers of extra steps, relative to no extra steps.  For this, the data must be read with implied weighting ON.  The only requirement for user defined weighting functions is that no (relative) weight is negative.  Weighting functions were the reliability increases with the first extra steps, and then decreases, are (from the numerical point of view) perfectly valid.  To define a weighting function, type (or include in a file to be parsed) "piwe[" followed by as many numbers as extra steps you can have in your data set.  The numbers are read as the relative costs of transforming from 0 to 1, 1 to 2, etc.  If the relative weights defined are less than the maximum possible extra steps, the undefined transformations are given a weight equal to that of the last transformation defined.  If more transformations than the maximum possible extra steps are defined, they are simply ignored.  The user defined weighting functions are applicable only to discrete, non-sankoff characters. When user-defined weights are in effect, the fit cannot be calculated (i.e. only the score as an increasing function of the homoplasy is reported). Note that user-defined weighting functions can be easily used for clique analysis (i.e. defining the function as piwe[1 0 ;).

 

 

Tree-view: Selecting Nodes and Editing Tree (Windows only) (back to index)

 

            The tree-view mode is toggled with Trees/View. Under tree-viewing, it is possible to select nodes by clicking at them (with the left mouse button).  The node selection toggles between red/green/unselected.  All the terminals included in nodes selected with red will be considered as "selected," unless they belong to a node selected in green which is a member of the red selected node (combinations of red/green selections can be used to easily define paraphyletic groups).  Note that a green selection serves no purpose unless the green node itself is within a red node.

         When the tree is locked to manual edition (and regardless of nodes selected with the left button), positioning the cursor at a node and clicking on the right mouse button produces a list of synapomorphies of the node (using character/state names, if defined and in effect).

When the tree is unlocked, it can be edited by clicking on the right mouse button.  If a single node (red or green) is selected, and the right button is clicked while the cursor is on an unselected node, the selected node (=source) is clipped and moved to be sister of the unselected (=target) node.  If the target node is the ancestor of source node, the branch leading to the source node is collapsed.  If the right mouse button is clicked on a red selected node (regardless of whether other nodes are selected), all the terminals belonging to the node are pruned from the tree (unless they in turn belong to green selected nodes, in which case they are left in the tree). If the right mouse button is clicked on a green selected node, and the tree is incomplete, the user can select a terminal to add to the tree (as sister to the green selected node).

         When the tree is edited, the changes are irreversible.  If you are not sure of whether you want to keep the changes, you should save the tree to file before editing it.  To avoid mistakes, the trees are by default locked to manual edittion; they can be unlocked with Settings/Lock trees.

 

Saving suboptimal trees (back to index)

 

The program can save, during branch-swapping, suboptimal trees.  The values of suboptimal are set with Analyze/Suboptimal, or with the suboptimal command.  Two values can be set: the absolute and the relative fit difference to accept a rearrangement; the relative fit difference (Goloboff & Farris, 2002) is calculated between the tree being swapped and the rearrangement.  These two values take effect also when collapsing trees for consensus calculation with temporary collapsing under SPR or TBR.  The relative fit difference between the tree being swapped and the rearrangement is calculated exactly (see Goloboff & Farris, 2002, for details on how this can be done quickly), except for TBR with asymmetric step-matrix characters, where the relative fit difference can be either over or under estimated by the algorithms used.

Note that the new technology algorithms (sectorial searches, tree-fusing), since they are not designed primarily to save multiple trees, do not have a well defined behaviour in terms of the suboptimal trees saved; some trees will be saved, others won't. Thus, if you want suboptimal trees, do branch-swapping after having concluded a search with new tech algorithms.

 

Memory management (back to index)

 

Except for the stuff that windows itself uses, memory management in TNT uses special memory allocators written by Steve Farris.  This memory manager insures that TNT has no memory leaks.  For example, reading a new data set is guaranteed to first set the whole system back to its original state, and then reallocate memory.  The same applies to macros, which use a different memory segment.  The only thing that can cause memory fragmentation when using TNT is repeating some of the following operations many times: (1) re-setting the maximum amount of “general RAM” the program can use, (2) re-setting the size of the display buffer, (3) re-setting the total amount of RAM to be used by the macro language, or (4) re-setting the number of nested input files.  These things use normal memory allocators.  Other than this, TNT can be used for any period of time, or for any number of operations, without causing any memory fragmentation.

 

Tree collapsing (back to index)

 

TNT implements several rules for tree collapsing (see Coddington and Scharff, 1994, for discussion).  For rule 1, which may produce collapsed trees longer than the original resolution, the length-calculating commands (for example, bremer support or the tree filter when suboptimal trees are to be discarded) calculate the length before collapsing; note that the final collapsed trees may be longer.  The collapsing for consensuses (including the bremer support) is calculated only temporarily.  Note that rule 4 cannot be used during consensus calculation (because it may multiply the number of trees), but the results for the (strict) consensus under rule 1 will by necessity be identical to those for rule 4.

The implementation of most rules requires no special comments.  The exception is rule 4, implemented in TNT as follows: 

 

1) take the tree to be collapsed

2) collapse all nodes for which ancestor and descendant state sets are identical

3) mark all nodes for which ancestor and descendant state sets share state(s).

4) for each combination of possible collapsings for the nodes marked in 3, if the length of the new tree equals that for original tree, and no node in the new tree remains unsupported under rule 1, add the new tree to memory.

5) if any of the trees added to memory at point 4 is a duplicate of a previously collapsed tree, discard it.

 

The trees produced will thus not be further collapsed applying rule 1 to them; when the trees are only condensed, existing duplicates are not eliminated automatically (they are when the trees are "filtered").  This implementation of rule 4 is almost always more time consuming than the other rules (except SPR or TBR, naturally), in some cases, much more time consuming. For some input trees, this implementation of rule 4 may produce a larger number of output trees.  Coddington and Scharff (1994) did not prove that all rule 4 trees can be found by collapsing all rule 1 trees, but it is easily proven (De Laet, pers. comm.) that, if the trees are a set of all the trees distinct under rule 1, just deleting the trees that become longer when collapsing produces a set identical to the rule 4 set.  However, if the trees to be collapsed are a subset of the rule 1 trees, discarding the trees that become longer may produce spurious results –and that’s why rule 4 is implemented as it is. If you are certain that your trees are all the distinct rule 1 trees, you can get the set of rule 4 trees by simply (a) collapsing the trees under rule 1, and then (b) getting rid of the trees that are longer (filtering without collapsing).

         The branch-swappers use only binary trees as input.  If some non-binary trees exist when a search is invoked, they are automatically resolved before the search proceeds.  This may cause some trees that were distinct to become identical when resolved.  Additionally, collapsing under rule 1 (i.e. collapse all ambiguously supported branches) may make the trees longer than the initial binary trees.  When any of these two things happens, a search may deliver N optimal trees of minimum length, but if a subsequent search is done, only a reduced number will be used as input for subsequent searches (this happens very often in PAUP*, which always collapses the trees after a search).  To prevent this, TNT by default retains the trees as binary after the search (note that the trees are anyway compared to make sure that they are distinct if they are collapsed under the criterion in effect).  Using these trees for consensus, bremer support calculation, etc., requires that they be collapsed temporarily.  The temporary collapsing is set with collapse+; (or under Settings/ConsenseOptions).

         When trees are temporarily collapsed, eliminating taxa requires special considerations.  The characters of the taxa still influence the optimizations; only the taxon positions are ignored.  For SPR and TBR collapsing, the results produced are those that would be obtained by (1) saving all the binary trees that could be found by swapping on the tree to be collapsed (if a better tree is found, swapping continues nonetheless from the original tree), (2) pruning all the taxa to be eliminated, and (3) consensing the pruned trees.  For optimization-based collapsing, evaluation of support for a branch considers whether merging two branches in a singe branch when clipping a taxon produces a supported branch (note this is not the same as first collapsing the trees under rule 1, and then clipping taxa).  Eliminating taxa when resampling, estimating consensuses, or searching untill the consensus becomes stable, is done in the same way.  If you want to remove the taxon from the analysis (as opossed to removing it from the consensus), deactivate the taxon (with the taxcode command, or the Data/Taxa menu option).

         Beware that, although tree-collapsing involves no special theoretical considerations, it is of practical importance.  Misusing the collapsing possibilities provided by the program may produce absurd results. 

 

Command-Line (back to index)

 

With File/CommandLine, it is possible to enter commands for TNT.  <Tab> switches focus between main window and command line; <enter> always takes you to the command line.  When focus is on command line, pressing – or + moves screen one page up or down (respectively).  The keys / and \ move through the buffer that remembers the last commands executed.  A list of the commands (with their syntax) is obtained entering "help" at the command line.

You can also give TNT (essentially) any commands you want when invoking it from the DOS prompt (in the Windows version, commands must be preceded by a semicolon; strings before a semicolon are interpreted as names of files to open).  The cost command takes the redirect output symbol, >,  as a specifier; in the command line the > symbol can be replaced by a closing curly brace, }.

 

Batch Menus (Windows only)  (back to index)

 

TNT allows instructions to be stored (in memory or in file), for later execution, under Settings/Batch.   Note that some purely interactive functions (like searching for text in the text buffer, or selecting tree nodes under tree viewing mode) cannot be stored as batch instructions.  To store instructions, choose Set menus to batch, and then simply click on the options you want (you can later edit them).  When the instructions are subsequently executed, they are run in order.  Make sure the warnings are disconnected before running a set of instructions (as otherwise the program may halt when it finds a warning).

 

Measures of character support  (back to index)

 

         TNT implements two types of character support measures: Bremer supports and resampling.  The Bremer supports are calculated from the trees in memory: it is up to the user to find suboptimal trees.  The resampling can be of different types, some of which are preferrable to others.  Standard bootstrapping is influenced by uninformative characters (and by characters irrelevant to monophyly of a given group).  Bootstrapping (both standard and Poisson) and jacknifing (except for a resampling probability of 50%) are affected by character weight and transformation costs (e.g. additive characters).  Symmetric resampling is affected by none.  Outputting results with the frequency may produce alterations in the apparent support for groups with very low support (and it cannot measure support for groups with very low support).  Both GC and frequency slopes solve this problem (for slopes, supported groups have negative slopes; the closer to 0, the better supported the group).  The supports can be measured on the groups present in any given tree (normally, a most parsimonious tree, but using other trees provides a way to measure how strongly contradicted some groups are).  The slopes must be measured by reference to a pre-existing tree.

         If some taxon has its position in the tree very poorly defined, it may strongly decrease the support for many groups in the tree.  If you wish to find out whether the rest of the tree is well supported, you can eliminate the taxon from the consensus when resampling.  To find out which taxon is floating around you may have to first do a more superficial estimation, saving the trees, and trying to prune different taxa (note that in such case you have to take the precaution of not collapsing the trees with SPR or TBR, as otherwise you may never find the floating taxon: once you found it, you turn collapsing back to SPR or TBR –the most effective—and do a more careful estimation of the resampling frequencies eliminating the evil terminals).

 

A note on Linux-Mac OS X versions (back to index)

 

Linux and Mac OS X versions of TNT have a couple differences with the standard versions.  First, those versions can take a comma ( , ) instead of a semicolon ( ; ).  This is because the shells there use semicolons to separate commands, and giving arguments to TNT from the command line becomes difficult. Second difference is that those versions can run in the background.  If you want to start TNT in the background, use bground as first argument, then give the normal arguments you would give the program (do not activate reports and do not change the silent status, but you can set report+ to report every certain number of trees, replications, or seconds, so that you can monitor progress by inspecting the output file; make sure you do have output files, otherwise the program would run and exit without doing anything!), and end with the usual '&'.  TNT will not write to the console, or expect console input (at least if no errors occur). It will write its results to a file and exit smoothly (exit code is 0; if the program runs out of commands while in the background it exits with code 1). Should some error occur during the search, the program writes a message to the ouptut file, and exits with code 255.  Try

 

     ./tnt bground p zilla.tnt, echo=, log peep, rep+1, mu10=ho3, le, ne, quit, &

 

and do "tail –f peep" to monitor search progress.

 

Running TNT in parallel

 

If you install the PVM system in your machine, you can run TNT in parallel.  If you have difficult data sets, this may be advisable even if you have a single-processor, single-machine, as the parallelization interface of TNT allows for a lot of control in defining search routines (e.g. you can re-analyze sectors of the tree using any search routine you want).  Unlike other phylogeny programs which can search in parallel (e.g. POY; Wheeler et al., 2003), TNT does not have pre-established routines for working in parallel.  Rather, TNT allows to parallelize routines, interacting with the scripting language; although harder to use, this is much more flexible and can be tailored to specific needs.  The command that handles most of the options for parallelization is ptnt.  With the command ptnt begin TNT can launch jobs in parallel; each job can consist of one or several tasks.  Each task in the job receives a set of instructions; the instructions are determined by the user, and can be (essentially) any set of TNT commands (possibly including instructions to launch other jobs in parallel). After launching a job, the master TNT process continues operation normally (i.e. asynchronously).  The results from the slaves are brought back to the master with the command ptnt get; the master does not receive results from the slaves untill the master itself executes a ptnt get command; slaves that finished running remain dormant (i.e. not using processor time) until the master requests results from them (or until the master sends the slaves a new set of instructions). The master may launch several jobs at the same time; up to 32 jobs can be running at the same time, and the master may handle exchanges of trees/information between them. When a job is launched, it automatically receives from its parent the data (and, optionally, starting trees), as well as its identification number (for t tasks, numbering goes from 0 to t-1).  Since a slave task knows its own number, this can be used to have some tasks perform differently (alternatively, the same thing can be accomplished by launching other job(s)). A TNT process that launched some jobs can monitor (with ptnt wait or ptnt gwait) the progress of the jobs, letting the user access to numbers of trees, replications, scores, etc., in the slave tasks; based on this information, the parent process can then recall jobs back (ptnt get), tell the slaves to skip some instructions and move to a specific part of the instructions (ptnt goto and ptnt skipto), or

send new sets of instructions to the slaves (ptnt again).  The parent process can get trees from its slave tasks, at any point (with ptnt spy).

         A very simple example shows how to parallelize resampling, in a job that receives the name psample ("parallel sampling"):

 

 

ptnt begin psample 

10                                         /*  run 10 slave tasks  */

+. –myself  =                /* run in all hosts but master node */

/* …these are the slave instructions…  */

collapse tbr ;

resample rep 10 savetrees ;     /*  resample */

tchoose 1. ;                                  /* choose individual trees */

return ;

         ptnt wait  ( tasksleft[psample] = = 0 );    /* wait for all tasks to end */

keep 0 ;

         ptnt get psample ;                     /* get results from psample  */

          collapse notemp ;         /* trees are already collapsed by the slaves */!

freqdif ;                                           /* calculate freqdifs  */

 

If the number of tasks (10 in this case) is not specified, as many tasks as hosts minus one are used (and no task is launched on master node); if no host selection is specified, the default selection is used (i.e. the same one indicated in the example).

 

 

Consensus estimation (back to index)

 

TNT implements the consensus estimation method of Goloboff and Farris (2001).  This is based on making a certain number of independent searches, and checking whether the end results of all of them share groups.  A group which is present in all (or most) of the searches, even if the search failed to produce optimal trees, is likely to be present in the set of optimal trees (and thus the consensus). This can be used to have a quick idea of the approximate results for your matrix, or to create constraints to speed up searches.  The consensus estimation is invoked with the qnelsen command, or under Analyze/ConsensusEstimation.

 

Tree-searching algorithms  (back to index)

 

The program implements several types of search algorithms; in Windows versions, all of them are accessed with the Analyze menu option.  For really small data sets (i.e. below 20-30 taxa), exact solutions can be obtained via implicit enumeration (or "branch-and-bound"). For larger data sets, heuristic ("trial-and-error") methods are required. The basic algorithms implemented in the program are wagner trees (with sequential addition of the taxa, in the best available position, to form a tree as short as possible), and different types of swappers and algorithms that make rearrangements on preexisting trees. Branch-swaping can be of type SPR or TBR.

For relatively messy, but not very big data sets, the best algorithm consists of multiple random addition sequences plus TBR (RAS+TBR); this is invoked with Analyze/TraditionalSearch, using wagner trees as "starting trees," or with the mult command. Whether or not you are likely to have found the best trees, or all the tree islands, depends on the ease with which the algorithm converges to the minimum length found. For this reason, the program reports the number of replications that reached the best score found.  The program also reports whether some replications had partial overflows of the tree file, in which case, some most parsimonious trees may not have been found during the search; further branch-swapping starting from the trees in memory ("RAM") can be used to find the additional trees.

         For more difficult data sets, the special and combined algorithms will be required. Note that instead of wasting search effort, it is often preferrable to make one very aggressive search to make sure what the best score for your data set is, and then produce additional (independent) hits to that score until the strict consensus has become stable (as discussed in Goloboff, 1999). There are four basic types of special algorithms implemented in TNT: ratchet, drifting, sectorial searches, and fusing.  The ratchet in TNT is slightly different from the ratchet as originally described by Nixon (1999), and as implemented in Nona/Pee-Wee.  The ratchet consists of two phases, perturbation and search, which are sequentially repeated.  In either phase, the ratchet in TNT does not look for multiple trees (in Pee-Wee/Nona it was necessary to set the number of trees to save in each phase, but the best number is always just one); multiple trees can be found only as the search phase finishes.  The ratchet always uses TBR branch swapping, and it alternates the search phase (S) with three types of perturbation phase: original weights (O), upweighting (U), and deleting (D).  Thus, the cycles will be: O,S,U,S,D,S,O,S,U,S,D,S,O… etc. Any of the O, U, or D phases can be skipped (from the Ratchet settings dialog, or with the ratchet command).  During the perturbation phase, rearrangements of score equal or better than the tree being swapped are always accepted, untill a certain number of rearrangements have been made (or untill a certain percentage of the total swapping on the tree has been completed).  This seems to provide better results than the original ratchet. The "autoconstrained" option calculates the (strict) consensus of the previous tree and the tree resulting from the rearrangement phase, and during the subsequent search phase, only rearrangements that do not violate monophyly of the shared groups are done.  The rationale for this is that improvements are likely to be found in areas that have changed during the perturbation phase.  Every certain numer of constrained cycles, an unconstrained search phase is done (to insure global optimality).  If this is used in combination with user-defined constraints, both are combined.

Tree-drifting is quite similar to the ratchet, but the perturbation phase, instead of reweighting the characters, accepts the moves with a probability that depends on both the relative fit difference (Goloboff and Farris, 2001; see Goloboff, 1999) and the absolute fit difference between the tree being swapped and the new tree.  It has the advantage that memory managing is easier, and for that reason is the only perturbator that sectorial-searches (see below) can use. The perturbation phase also alternates between acceptance of optimal trees only (O), and suboptimal trees (U): O,S,U,S,O,S,U… etc. Any of the O or U phases can be skipped (the O phase, with drift:noequal; or with the skip optimal-only drifting option in the tree-drift settings dialog box; the U becomes effectively an O by setting the score difference to a very low number).

Sectorial-searches are of two basic types: constraint-based and random; a combination of both is the mixed sectorial searches.  Sectorial searches take a portion of the tree, create a reduced data set, and produce a mini-analysis of that data set.  If a better solution for that portion of the tree is found, it is replaced back into the original tree. Every certain number of (user-determined) rearrangements, a round of global TBR is done (to insure global optimality).  If the sectors are below a certain size, the mini-analysis consists of three RAS+TBR (random addition sequence wagner trees plus TBR); if the three sequences end up in trees of the same score, the mini-analysis is interrupted (i.e. the reduced data is "clean" and not worth of further analysis); otherwise, three additional RAS+TBR are done.  If the sectors are above that size, a certain number of iterations of tree-drfiting is done to the reduced sector.  The three types of sectorial-search differ in how the sectors are selected.  Constraint-based searches choose the sectors by reference to constraints: the nodes that are resolved as polytomies in the constraint tree, connecting to no less than N branches (where N is some number defined by the user; default is 10). A reduced data sets corresponding to each of the polytomies in the constraint tree is analyzed, in turn; this is repeated a certain number of times (default=3), because changing some sector may imply that other sectors should change as well.  Random sectors are chosen at random, to be no smaller and no bigger than a certain (user defined) size. The mixed sectorial searches use both constraint-based and random sectors; they can only be used with multiple addition sequences.  For the mixed sectorial search, a temporary constraint tree is created by consensing the tree found at some stage of the present addition sequence (user defined: this can be the wagner tree, which is the default, or optionally the tree produced by SPR or TBR), and this constraint tree is used to do a constraint-based sectorial search.  Once this is finished, a random sectorial search is done (unconstrained, or using only the user-defined constraints if they are in effect).

         Tree-fusing mixes trees, producing better trees. If the trees comes from different searches, the score is often considerably improved in very short time.  Tree-fusing produces (so to speak) a synergistic effect for all those previous searches; this also means that the rate of score improvement is not necessarily lineal (i.e. all the searches before fusing were not producing a score improvement beyond some bound, but were nonetheless accumulating "momentum" for the subsequent fusing).

 

Determining best search parameters (back to index)

 

For many data sets, the time to find best scores during searches may depend drammatically on the aggresivenes and parameters used during the search.  For example, using very superficial searches for difficult data sets will keep grinding for a very long time before the best score is actually found; using more aggressive parameters will find best score sooner.  The parameters to change are size and number of sectors to run under random sectorial searches, number of initial replications to submit to fusing, and rounds of tree-drifting or ratchet.

         A careful consideration of how the data set behaves to changes in parameters is the best way to proceed, but if you have no idea of how a data set behaves, or don't know how to set the parameters for a search, you may trust to the program the choice of parameters.  This is done with the set initial level option of the New Technology dialog box (or the level option of the xmult command).  If entered as a command, the level must be a number between 0 and 10 (0 is very superficial; 10 is extremely exhaustive); if entered in the dialog box for search level, it must be a number between 0 and 100 (again, 100 is the heaviest search).  If using a driven search (i.e. more than a single hit of the xmult command), you may have the program check whether best score is being found easily or not (and then decrease or increase the search level accordingly).  This is set with the chklevel option of the xmult command.  Again, letting the program try to determine the best parameters during the search is a poor substitute for finding the best parameters and letting them fixed during the search, and is intended for cases where you have no time/experience to determine best parameters by yourself.

 

Implementation notes for tree-searching (back to index)

 

General

This section describes some details on how the searches are implemented.  It is intended to give the user clues on how to best use the program, or at least understand some of the ways in which the program behaves under different circumstances.

When all the characters have equal weights, the swapper proceeds the fastest.  If some characters are weighted differently, the search algorithms proceed somewhat slower (i.e. it takes longer to complete a wagner tree or swap completely through one tree; note that finding a shortest tree might still be easier if the weighting increases the structure of the data set).  If the characters are weighted using very different weights (e.g. every character is assigned a weight from 1 to 50, at random), the search algorithms reach a floor (depending on the data sets, about 50% slower than equal weigths). 

Searching under implied weights, since it uses floating-point calculations, is the slowest, being 2 or 3 times slower than equal weights (note than in PAUP*, using implied weighting slows down the searches by orders of magnitude).  The largest speed differences with PAUP* are in running large data sets (between 20 and 50 times, sometimes more than 100 times), and implied weighting (over 500 times, for large data sets).

The numbers of states also affect running times.  The more states the characters can have, the slower the program runs. The program internally packs characters in groups of characters with 2 states, 3 to 4 states, 5 to 8 states, 9 to 16 states, and 17 to 32 states. Sankoff characters are packed always separately.  This implies that the time differences will come when the numbers of characters in each of those categories changes significantly.  For sankoff-style characters, the time increases rapidly with the square of the number of states (in some portions of the code, the time increases actually with the cube).  Thus, the algorithms used react badly to increasing numbers of states; for Sankoff characters with 16 states or more, TNT is about as slow as POOP* (or even slower).

The implementation of searches under implicit enumeration is similar to that in other programs, and there is nothing special about it.  As usual, exact searches take the longest for the least poorly structured data, and the times increase drammatically with the numbers of taxa.  Since TNT is conceived to work mostly with large data sets, there seemed to be no much point in using energy to optimize the implicit enumeration algorithms.

The tree swappers can complete swapping on a tree faster if no additional equally parsimonious trees are being found.  This is because in such a case length calculations for a rearrangement can be abandoned as soon as the program discovers that the tree is as long as the best one found so far; otherwise, additional characters have to be checked to make sure the tree is no longer than the best one found so far (and, to make things worse, the tree has to be optimized for collapsing and compared to all the pre-existing trees, to make sure it is not identical as collapsed).  This is used to speed up searches under multiple random addition sequences, where a reduced number of trees is retained for each replication (such that most of the time the search is being done with a full memory buffer).

 

Wagner trees

Except in the case of constraints, the addition sequence for wagner trees is totally randomized, except for the outgroup taxon (which is always the first taxon added).  The insertion sequence of new taxa to the growing tree is (by default) not totally random: the possible locations for the new taxon are tried from the root of the tree to the tips, or from the tips to the root (which option is chosen is determined at random, for each taxon to add; both are equiprobable).  In some extreme cases, this may be undesirable.  If the data are entirely uninformative (e.g. a matrix with no characters) and no tree collapsing is in effect, then pectinate trees result from such insertion sequence. It is possible to randomize the insertion sequence for new taxa, so that all locations are tried in a random order.  This is set with rseed[; the alternative is rseed]; (default, up/down sequence).  For large data sets, building a wagner tree with a randomized insertion sequence may actually take shorter than with the up/down sequence (that is probably because the best placement is often found somewhat in the middle of the tree, not in the tips or root; checking the tips or root first provides worse bounds to give up length calculations, for a greater portion of the time).  For data with no informative characters (and with collapsing turned off), wagner trees with a random insertion sequence are (more or less) equivalent to random trees.  If your data are more or less structured, setting one or the other is irrelevant. 

Note that randomization of the insertion sequence has no effect when wagner trees are built under constraints (i.e. the insertion sequence is always up/down in this case).

 

Swappers

The two types of swapper are designed to work as best as possible under diferent circumstances. 

The time to complete swapping on a tree (that is, on a tree which does not lead to better trees by swapping) changes with about the square of the number of taxa for both SPR and TBR (this is true even when TBR actually “looks” at a number of rearrangements which increases with the cube of the taxa).

SPR is designed to work best at the initial stages of a search, when better trees are being found frequently; it is not as fast as it could be for trees which are already optimal or near-optimal. 

TBR is designed instead to work best for large numbers of taxa, when the trees being swapped are optimal or near-optimal, that is, when better trees are not being found frequently, and also to work best on well-structured data (i.e. data where there are numerous relatively well supported groups).  If the data are very poorly structured, swapping through a tree takes longer.  For Källersjo et al.’s 2500-taxon matrix (1999), the time to complete TBR swappping on a near-optimal tree (running on a 800 MHz PIII machine) is about 12 secs.; PAUP* on the same machine takes over 18 minutes to complete TBR swapping on the same tree (i.e. a 90 times speed difference;  PAUP* takes about 6 minutes to complete SPR on the same tree, so that completing TBR for 2500 taxa in TNT is 30 times faster than completing SPR with PAUP*).

Neither type of swapper is highly optimized for low numbers of taxa.  For small data sets, expect no big differences ( i.e. no more than 5-10 times) in running times for TNT and other programs.   These take no time anyway, so it seemed best to preserve an implementation focused on large data sets.

 

Effects of constraints on running times

Heuristic searches.- There are three types of constraints: positive tree-constraints, positive group constraints, and negative group constraints.  Tree constraints are defined with the = option of the force command (or, from the menus, as “positive constraints” with no floating taxa).  The latter are defined with all other options of the force command (or, from the menus, with positive constraints with floating taxa, or negative constraints).  The three types can be used simultaneously (e.g. “search for the best trees where A+B+C from a monophyletic group, but A+B do not”).

         The tree constraints are implemented in such a way that the moves violating the groups established are never attempted.  This type of constraint therefore speeds up the searches (if many groups are defined, by a significant factor).  Note that the number of trees examined to SPR-swap or TBR-swap through a single tree is the same under constraints (i.e. the moves are counted as if effected and rejected).

         The group constraints (positive and negative) are implemented internally by creating a matrix of group membership variables (where taxa belonging to a group have state 1, taxa outside the group have state 0, and floating taxa have a missing entry).  For group constraints, all the same moves are evaluated for length.  If the tree is within the current length bound, then the group constraints are evaluated, and if the new tree does not pass, it is rejected (even if shorter).  Thus, using group constraints does not speed up searches at all.   If the group constraints make the tree much longer (i.e. if they forbid many groups present in the consensus of MPT’s), they will actually slow down the search (i.e. many moves will produce better or equally good trees, which have to be subsequently rejected because they violate constraints).

         For wagner trees, the program adds each taxon in a position such that no constraint is violated at any point.  For negative constraints, this means that the first two taxa which are part of a group constrained for non-monophyly must be placed as non-sister groups.  Of course, a better tree might be found if the non-monophyly of the group is delayed (to be broken later, for example, by adding another taxon in between them), but then if non-monophyly has been delayed, it may now be impossible to create non-monophyly without at the same time violating some positive constraints.  This means that, for negative constraints that increase tree length considerably, the wagner trees may be quite far from the best length achievable under the constraints (e.g. it is advisable not to use wagner trees to estimate consensuses or jacknife frequencies under those circumstances).   Additionally, the random addition sequence may change somewhat when positive (tree or group) constraints are used in combination with negative constraints.  The first two members of a group to be non-monophyletic must effectively be non-monophyletic when they are both added to the tree, but this may be impossible if those two are in turn the only two members of a group constrained for monophly added so far.  The addition sequence is then checked when positive and negative groups are enforced, changing it to prevent this.  The addition sequence is therefore not completely random when there are positive and negative constraints in effect.

Exact searches.- Both positive tree constraints and positive group constraints speed up searches under implicit enumeration (the more so if they constrain groups which are anyway present in the optimal trees).  Negative tree constraints slow them down, to the extent that they make the best possible trees longer. 

 

Scripting (for programmers or advanced users) (back to index)

 

TNT has a scripting language, which allows the user to automate complex tasks, or "program" TNT to do things it normally doesn't do.  The scripting language has flow control, loops, and internal variables that can be accessed by the user.  The scripting language is normally disabled; it is enabled by entering "macro=;" at the command prompt.  The macro command also controls the basic settings for the macro language, like total RAM for macros, number of variables, number of nested loops, and protection level for accessing variables/arrays (for details, see online help for macro).

         The variables and flow control use numbers (actual values or internal variables) or expressions (compound expressions, always must be enclosed in parentheses to be properly parsed).

         Examples of scripting language can be found in the files contained in zipdruns.exe (a self-extracting excutable).  Exactsol calculates trees by implicit enumeration (using a complete down-pass optimization for each tree: it is very slow; the addtax routine is used as a function from within exactsol).  Macswap does SPR swapping on a tree, and prints out a message every time it finds a better tree (it also uses down-pass optimization to evaluate trees).  Sank does sankoff-optimization; it is more elaborate than the other two (it can be used only on small trees; make sure you have set up the number of variables to a large enough number).

 

User variables.- User variables can be accessed in any context, by enclosing the number of variable to be accessed in single quotes.  Thus, if variable number 0 has a value of 1200, inputtting the string hold '0' is equivalent to inputting hold 1200.  Variables can be given names, in which case their values are accessed by enclosing their name in single quotes.  Using double quotes instead of the single quotes is equivalent to using single quotes, except in the case of parallel versions of TNT, where double quotes passed as instructions to slaves are interpreted on slave, and single quotes are always interpreted locally. Using the quotes or double quotes properly, the user variables can act as pointers.  User variables can be defined (when declaring their name) as arrays (maximum is 5-dimensional arrays).  The indices are then enclosed in brackets, separated by a comma.

Thus, 'n[j]' is the j-th value for variable n (note that the arrays are zero-based), and 'p[x,y,z]' is the z-th value for row x, column y, of variable p. Note that 'n'[j] will be parsed as the number corresponding to variable n followed by the expression [j] (which, according to context, may or may not be a legal expression).  The indexing works by substituting values; each time a new "dimension" is found, the cell is considered to be that for the base cell plus the value of the dimension; the user could set the values so that a "matrix" is properly organized, but declaring a variable as multi-dimensional when it is given a name automatically sets up the arrays.

         It is also possible to enclose in the single or double quotes an expression (which must be enclosed in parentheses); what is inside the parentheses may, in turn, contain references to user variables.  Thus,

 

                 set 5 '(%1)' ;

 

will set variable 5 as equal to the variable numbered as the first argument.  Similarly,  if variable 0 equals 10, variable 1 equals 20, and variable 30 equals 100, then the expression

 

                  set 5 '('0'+'1')' ;

 

will assign the value of 100 to variable 5 (i.e. assigns to variable 5 the value of the variable whose number equals the sum of the values of variables 0 and 1).

         User variables can also be used to store strings, accessed by preceding the number (or name) of the variable by a dollar sign, $.  Strings can be stored as literal (in which case, expressions like '0' within the string are replaced by the value that variable 0 has at the time of expanding the string; variables can be expanded only as numbers, as well as arguments; no string expansion is allowed within strings).

         The values for the user variables can be changed only with the set command.  Set followed by a variable number (or name), followed by an expression, sets the value for the variable to the value for the expression.  The number of the variable to be set can itself be an expression (but if so, remember, it must always be enclosed in parentheses). The expression can be an expression as in the C language; logical expressions are evaluated to either 0 (FALSE) or 1 (TRUE).  The expression can be replaced by either ++ , -- , += , -=, *=, /= (with meaning as in C; note that only post-decrements and post-increments are allowed).  Thus

 

             if ( length [ 0 ] > ' maxlen' )

                      set 1 '1' + 1 ;

             else   set 0 '0' + 1 ;

             end

 

will increase in one the variable number 1 if the length of tree 0 is greater than the value stored in the variable named by the user as maxlen, and otherwise increase in one the variable 0.  The same is achieved with the more compact expression

 

              set  ( length [ 0 ] > 'maxlen'  ) ++ ;

 

Expressions.- Expressions can be any combination of operations.  Valid operators are + - * and /, as well as bit-operators ( & | ^ ).  Valid logical comparisons are ==, > , < , >= , <= , ! , and !=.  Since precedence is evaluated always from left to right, be careful to delimit subexpressions as desired:

 

if (  'maxlen' <= 1000 + 500 )

 

is actually equivalent to

 

if (  ( 'maxlen' <= 1000 ) + 500 )

 

so that it will first evaluate if maxlen is less than 1000, and then sum the result to 500 (if maxlen is less than 500, result is 501, otherwise result is 500; the expression always evaluates "true" ... probably not what you wanted).  Either remember the precedence and invert the order of expressions:

 

               if ( 1000 + 500 >= 'maxlen' )

 

or use the original order and parentheses :

 

               if ( 'maxlen' <= ( 1000 + 500 ) )

 

Note that parentheses must always be balanced.

 

Changing values of variables.- The values of variables are changed with the set command: 

 

        set N (expression) ;

 

The "expression" may in this context not be enclosed in parentheses (this is the only exception to the rule given above).  The expression in this context always must end with a semicolon.  The example above will set variable number N to the value of the expression.  You can set the i-th variable (where i itself is any expression) from N, as if N was an array, by using:

 

set N[i] ( expression) ;

 

If some private variables have been defined in previous files (see "Private variables" below), the number N is set automatically to the number within that file (use names to access variables from previous files).

To set variables as strings, use:

 

         set N $text ;

 

text can be any number of characters, with the proviso that the string starts at variable N, and continues to subsequent variables; every 4 letters of the string occupy one user variable (which is a long int), so that the actual number of user variables needed to store a string is the length of the string divided by 4.  The semicolon indicates the end of the string (which can contain blanks or spaces; a period followed by a comma is replaced by a semicolon, as in the quote command).  If "text" contains expressions to be replaced (i.e. single quotes or symbols for arguments), it is replaced on reading.  It is possible to define the string as "literal" by including an equal (=) sign before the number of variable:

 

         set =N $text ;

 

A literal string has expressions expanded when it is itself expanded (so that it can be expanded to different strings).  No string expansion is allowed within strings (although it may work in some cases, its results are not predictable).

         A string can be copied from one variable onto another easily, with the set command:

 

         set 10 $$0 ;

 

copies string 0 onto string 10 (make sure string 0 doesn't extend to string 10 before doing this).

 

Private variables.- When input files are nested, all the variables named in preceding files are "private" –that is, an internal numbering starting from the last unnamed variable is used.  Accessing variables by their number is then relative to a file.  If a name is given for a variable, the variable names are checked from latest to earliest, so that if variable names are duplicated in nested files, the most internal files will refer to their own variable (for example, simple counters like "i" or "a"); names defined in a preceding file will be recognized only if not defined later.  The number of private variables is determined by the largest unnamed variable, so if you want to refer to some variables by their number (for simple counters), it is wise to reserve the lowest numbers for those, and start naming from variable 10 or more (this allows to keep "memory management" easy and clean). You can also keep more variables than the named ones as "private," simply by issuing a private n; command (where n is the number of private variables) at some point before calling the next input file..  At the start of each file, you can also include a safe vars n; statement (which checks whether n user variables are actually available to the user from that file, and calls an error message otherwise).

In this way, each input file can have its own internal numbering of the variables; this allows to define "functions" in files, that are safely called from other files.  The only way in which a file can access a variable named in a previously opened file is by referring to it by its name (and only if the same name did not exist in the file itself: in that case, the variable corresponding to the file itself is accessed). When an input file is closed, all the variables that had been named in that file are denamed; if you want to internally rename variables within a file, you have to issue first a var-; command, which denames all non-private variables (var-N; denames all variables above N, N included). 

 

Changing values for arrays.- The setarray command sets arrays.  This can be used to read in matrices (of any dimension):

 

  setarray 5,4 matrix

       0    1     2     3

       4    5     6     7

       8    9   10   11

1213  14  15

16  17  18  19

    ;

 

this will set the values for a variable called matrix.  The numbers can be any valid expression; note that 0/1 matrices must have a blank space between the numbers.

 

Naming user variables.- User variables can be named, and accessed through their name:

 

         var =

0 tree_number

1 result

2 other ;

 

If private variables have been declared in previous files, the numbers (which can themselves be expressions) correspond to the internal numbering for current file. If any number is replaced by a + this is interpreted as "the lowest unnamed variable" (if there are private variables, the first time + is used within a file, it refers to the lowest possible number that can be accessed within that file). If you are going to store values starting from variable number N, you can define the number of variables to store by using a comma and the number of cells, after the number:

 

var =

                      N, (ntax+1)                   Taxon_list

                       +, (ntrees+1)                 Tree_list

                       +, (nchar+1)                   Char_list

                       +,(ntax+1),(nchar+1)     Matrix

                  ;

 

After defining the name for variable 10 as "Taxon_list", it will consider that the next variable to be named is variable number 10+ntax+1, so that the variables do not overlap.  If there are 12 taxa and 15 trees, then the name "Tree_list" is assigned to variable number 22, and the name "Char_list" is assigned to variable 37.  Since it is always possible to refer to variables only by their names, the user does not need to worry to which particular variable a name has been assigned. 

         An alternative (equivalent) format for declaration of multidimensional arrays is:

 

var =

                      N                    Taxon_list [ ( ntax + 1 ) ]

                       +                    Tree_list  [ ( ntrees + 1 ) ]

                       +                    Char_list  [ ( nchar + 1 ) ]

                       +                    Matrix  [ (ntax +1 ) , (nchar +1 ) ]

                ;

         When using the square-bracket notation for declaration, the commas are optional. 

Declaring the variables as multidimensional automatically sets up the arrays in a proper way; changing the value for the arrays can "disassemble" the matrix and make further accesses difficult or impossible.  To prevent this, it is possible to have the macros check whether all the accesses to arrays are legal (and then permitted only if the variable has been declared as such).   Without protection, the user is free to access and change any values (e.g. to set up his own arrays).  The protection level is set to N with macro prot N.

 

Floating point printing.-  By default, all the numbers in user variables are stored as floating-point.  When printing the values, you can use different numbers of significant digits.  Note that many operations under scripting involve (internally) expanding the numbers and then parsing the strings, so that changing the number of significant digits will change the precision with which many routines are executed.

The command macfloat N; sets the number of digits to be N (if 0, the printed numbers look like integers).  The option macfloat e; uses exponential notation (not readable by commands, useful only to print out tables). 

         Storing the values as floating point uses 64 bits per value; if you want to use only 32 bits (integer calculations, to save memory), issue a macfloat-; command before turning the macros on.

 

Internal variables.-  The user can access a number of internal variables (i.e. number of taxa, characters, etc.).  You can get a list of those entering help+;at the command prompt.  Internal variables can be accessed only in certain contexts in the macro language.  The general rule is that user variables can be accessed from any command (by enclosing their number or name in single quotes), while internal variables can be accessed only by the macro commands (loop, if, set, var, etc.).

Note.- The internal variables are a key component of the TNT macro language; if accessing other internal variables, besides the ones already included, could be useful to some users, get in touch with PAG and ask him about the possibility of including those variables in TNT (given the design of the macro language, it is really easy to add new internal variables).

 

Flow control.- The flow control is established with the if and loop commands.  There are two possible syntaxes for the if command; one is:

 

         if ( condition )

                action(s)

         end

 

where "condition" is any valid expression, and "action(s)" is any set of valid TNT commands.  If the condition is "satisfied" (i.e. evaluates to non-zero), then action(s) are executed; otherwise, all actions before the end are skipped (note that TNT interprets its scripts as it reads them, so that the action(s) may themselves contain syntax errors, which won't be discovered if the action(s) are skipped; the errors will only be discovered when the actions are executed).  The alternative is

 

         if ( condition )

                  action(s) A ;

         else

                  action(s) B ;

         end

 

which will execute action(s) A if the condition is satisfied, action(s) B otherwise.  Each if must always be matched by else or end; each else must always be matched by end. Nested if commands are legal, without restrictions as to the number of nestings.

         The other command for flow control is loop.  The syntax is,

 

         loop i+j  k  action(s)  stop

 

the numbers i, j, and k can be numbers, variables, or expressions. This will repeatedly execute action(s), k – i + j / j times.  If the symbol '#' followed by the number of loop (see below) is included in action(s), it is replaced by the value for the iteration being done.  Thus, the symbol will the first time be replaced by i, incrementing in subsequent iterations by j, until value exceeds value k (if i is greater than k, the loop is decreasing, and j is taken to be negative, even if preceded by +).

         Nested loops are possible (the maximum nesting is determined with macro*).  To refer to the value for each loop, use the '#' followed by the corresponding number.  Each loop must always be matched by a stop.  Within the loop, the commands endloop, setloop n, and continue, are recognized (endloop ends the loop, setloop resets the value of the loop to n, and continue skips all the action(s) between the current point and the stop that ends the loop).

         Essentially any action can be done from within the loops, except denaming user variables.  The loops use RAM from the macros, so if some long loop cannot be executed under certain conditions, switch the macro option to off and increase the RAM for macros.

 

Return values.-  When a file with instructions is closed, it may be desirable for it to have some kind of return value (having it to store the exit value in a user variable may be cumbersome and make the file less flexible).  The command return N; closes the input file and sets the internal variable exstatus to the value N.

 

Recursive calls.-  Input files with instructions can be called recursively.  To call a file from itself, use the recurse command (followed, if desired, by arguments, just like used for the run command).  This (together with the declaration of private variables) allows to define quite complex routines.  The number of times a file can call itself with recurse depends on the maximum number of input files (default = 10); the number of input files allowed is changed with mxproc (the file "exactsol.run" does this change automatically). An example is below, in the files "exactsol.run" and "addtax.run."   That finds an exact solution for any number of taxa (it uses a complete down-pass optimization for each partial tree, so it is rather inefficient).  File "exactsol.run" must contain:

 

silent =all ;

set 0 ntax + 2 ;

mxproc '0' ;                       /*  make sure you have enough input calls */

var = 0  attax  

         1  bound  

         2  wkdone ;               /*  attax , bound , and wkdone are private */

set bound 10000000;  /* use a big initial bound */

keep 0 ;

tread ( 0 ( 1 2 ) ) ;                /* set an initial network */

set attax 3 ;                         /*  attax = next taxon to add */

report- ;

addtax ;                              /*  this is the "recursive" function */

progress/ ;

report= ;

tc 1. ;

silent - all ;

proc/ ;

 

File "addtax.run" must contain:

 

set 0 'attax' ;

set 1 ntax + 1 ;

private 2 ;

loop 1 nnodes [ 0 ]

    if ( #1 == '1' || !isintree [ 0 #1 ] )

          continue ;

          end

    if ( 'attax' == 4 )

         set wkdone ++ ;

         progress (('wkdone'*100)/15) 100 Working ;  /* progress */

         end

    edit ]  0  '1'  'attax'  #1 ;                                           /* attach new taxon */

    set 2 length [ 0 ] ;

    if ( '2' <= 'bound' )

         if ( 'attax' < ntax )

             set attax ++ ;

             recurse ;                                                             /* call yourself !! */

             set attax -- ;

         else 

              if ( '2' < 'bound' )                                              /* update bound */

                  keep 1 ;

                  set bound '2' ;

                  end

              if ( '2' == 'bound' && ( ( ntrees + 1 ) < maxtrees ) )

                     copytree 0 ;                                       /* save multiple trees */

                     end

              end

         end

    pruntax 0 / 'attax' ;                               /* deattach taxon */

    stop

proc/;

 

Note that the C style comments in the two example files are perfectly valid.

 

Goto.- The instruction goto filename tag [arguments] parses file filename starting from the point labeled tag (with label tag).  The file can be the same file, or a different one.   This can be used to have related routines in the same file (a sort of "library").  A file can be defined as the default file to parse, with goto = filename.  Subsequent calls to goto must specify only the tag (and arguments, if any).

 

Branch-swapping.- In a way similar to the loop command, the commands sprit and tbrit initiate a loop, with the difference that in each iteration, the tree given as argument N is changed (under spr or tbr branch-swapping):

 

         sprit N    action(s)    stop

 

Action(s) can be any valid TNT command, plus additional commands recognized only from within sprit or tbrit:

        

     continue       skip remaining actions and do next rearrangement

     endswap               end the swapping (if never saved, tree N is unmodified)

     resetswap     save current rearrangement and restart swapping

 

While within a sprit or tbrit loop, the internal variable percswap equals the percentage of rearrangements done to complete swapping on current tree.  Together with the progress command, this can be used to report partial progress of the macro instructions.

 

Sprit or tbrit can be nested with loops, but no sprit or tbrit can be nested within a sprit or tbrit.

 

Defining dialogs (Windows only).- It is possible to define templates for dialog boxes, with the command opendlg.  Names for files can be written onto variables by using the getfname command. See online help for syntax; the file "dialog.run" (included in the self-extracting file zipdruns.exe) contains an example, which emulates the dialog box for Analyze/TraditionalSearch.  A more elaborate script, "bremer.run", allows calculating bremer supports using various options.

 

 

Citations  (back to index)

 

This section provides citations for the sources of some methods used in TNT.  If you use the method in a published paper, you should (besides citing TNT itself) cite the proper source for the method.

 

Bremer, K. 1990.  Combinable component consensus.  Cladistics 6:369-372 (as one might guess from the title, this paper describes the combinable component consensus).

Bremer, K. 1994.  Branch support and tree stability.  Cladistics 10:295-304 (a formal description of the "bremer support").

Coddington, J., and N. Scharff. 1994.  Problems with zero-length branches.  Cladistics 10:415-423 (a discussion of the problems associated with tree-collapsing, and proposal of "rule 4").

De Laet, J. 1997.  A reconsideration of three-item analysis, the use of implied weights in cladistics, and a practical application in Gentianaceae.  Ph.D. Thesis, Katholieke Universiteit Leuven, Faculteit der Wetenschappen, Leuven, Belgium, 214 pp. (this contains discussion of implied weighting in general, and it discusses the idea of recoding additive characters in binary form for implied weighting).

Farris, J., V. Albert, M. Källersjö, D. Lipscomb, and A. Kluge. 1996.  Parsimony jackknifing outperforms neighbor joining.  Cladistics 12:99-124 (this paper describes the idea of resampling characters with independent probability of removal, to eliminate influence of uninformative or irrelevant characters, as well as the idea of superficial searches for consensus estimations).

Farris, 1970.  Methods for computing Wagner trees.  Syst. Zool. 19:83-92 (the paper describing the first formal approach to Wagner trees and optimization for additive and continuous characters).

Farris, J. 1990.  Phenetics in camouflage. Cladistics 6:91-100 (a discussion of problems associated with some methods to discretize continuous characters in phylogenetic analysis).

Felsenstein, J. 1985.  Confidence limits on phylogenies: an approach using the bootstrap.  Evolution 39:783-791 (the original paper proposing to use the bootstrap on phylogenies).

Fitch, 1971.  Toward defining the course of evolution: minimal change for a specific tree topology.  Syst. Zool. 20:406-416 (this paper describes optimization for non-additive characters).

Goloboff, P. 1993.  Estimating character weights during tree search.  Cladistics 9:83-91 (description of the method of "implied weights").

Goloboff, P. 1994.  Character optimization and calculation of tree lengths.  Cladistics 9:433-436 (description of the basic algorithm for faster calculation of tree lengths during searches; instead of making T-2 node comparisons for each tree of T taxa, the algorithm requires making only one node comparison per tree).  

Goloboff, P. 1996.  Methods for faster parsimony analysis. Cladistics 12:199-220 (this paper describes the "two pass incremental optimization" which is the basis for calculations in TNT.  It also describes the "qcollapse" algorithm for pre-collapsing of trees during searches).

Goloboff, P. 1998.  Tree searches under Sankoff parsimony.  Cladistics 14:229-237 (this paper describes calculation of tree lengths for Sankoff characters; the method used in TNT is a combination of the exact method described in this paper, with the incremental optimization described in Goloboff, 1996).

Goloboff, P. 1999. Analyzing large data sets in reasonable times: solutions for composite optima. Cladistics 15:415-428 (description of tree-drifting, sectorial searches, tree-fusing, and the idea of searching until the consensus stabilizes).

Goloboff, P. 2002. Optimization of polytomies: state set and parallel operations. Molecular Phylogenetics and Evolution 22:269-275 (a description of the method used in TNT to optimize polytomies for non-additive characters).

Goloboff, P. 2007.  Calculating SPR-distances between trees.  Cladistics (in press).

Goloboff, P., C. Mattoni, and S. Quinteros. 2005. Continuous characters analyzed as such.  Cladistics 22: 589-601. (a description of the method used to optimize continuous characters on polytomous trees, and a description of the implementation of continous characters in TNT; see also Farris, 1970).

Goloboff, P., and D. Pol. 2002. Semi-strict supertrees. Cladistics 18:514-525 (description of the supertree method used in TNT, and some cautionary remarks about the use and interpretation of supertrees).

Goloboff, P., and J. Farris. 2001. Methods for quick consensus estimation. Cladistics 17:S26-S34 (description of the methods to estimate consensuses, the relative bremer support, and tree collapsing using branch swapping).

Goloboff, P., J. Farris, M. Källersjö, B. Oxelmann, M. Ramirez, and C. Szumik. 2003.  Improvements to resampling measures of group support. Cladistics 19:324-332 (description of symmetric resampling, frequency differences, and frequency slopes).

Maddison, W. 1989.  Reconstructing character evolution on polytomous cladograms.  Cladistics 5: 365-377. (the paper describing the method used to calculate length of "soft" polytomies).

Nixon, K. 1999.  The parsimony ratchet, a new method for rapid parsimony analysis. Cladistics 15:407-414 (the original description of the ratchet, already a citation classic).

Hendy, M, and D. Penny.  Branch and bound algorithms to determine minimal evolutionary trees. Mathematical Biosciences 59:277-290 (description of the implicit enumeration algorithm, used for solutions guaranteed to find all trees of best score).