Progressive algorithm for large-scale multiple sequence alignments.
-medoidtree switch) for ultra-scale alignments:
-gt nj option). NOTE:
Neighbour joining trees are calculated with a use of original
O(N3) algorithm, thus their applicability
on large sets is limited (unless they are used as subtrees with Medoid
Tree heuristic).-gz switch).-keep-duplicates switch.git clone https://github.com/refresh-bio/FAMSA --recursive
cd FAMSA && make
# align sequences with default parameters (single linkage tree)
./famsa ./test/adeno_fiber/adeno_fiber sl.aln
# align sequences using UPGMA tree with 8 computing threads, store the result in a gzip archive
./famsa -gt upgma -t 8 -gz ./test/adeno_fiber/adeno_fiber upgma.aln.gz
# export a neighbour joining guide tree to the Newick format
./famsa -gt nj -gt_export ./test/adeno_fiber/adeno_fiber nj.dnd
# align sequences with the previously generated guide tree
./famsa -gt import nj.dnd ./test/adeno_fiber/adeno_fiber nj.aln
# align sequences with an approximated medoid guide tree and UPGMA subtrees
./famsa -medoidtree -gt upgma ./test/hemopexin/hemopexin upgma.medoid.aln
# export a distance matrix to the CSV format (lower triangular)
./famsa -dist_export ./test/adeno_fiber/adeno_fiber dist.csv
# export a pairwise identity (PID) matrix to the CSV format (square)
./famsa -dist_export -pid -square_matrix ./test/adeno_fiber/adeno_fiber pid.csv
# profile-profile alignment without refining output
./famsa -refine_mode off ./test/adeno_fiber/upgma.no_refine.part1.fasta ./test/adeno_fiber/upgma.no_refine.part2.fasta pp.fastafamsa [options] <input_file> [<input_file_2>] <output_file>
Positional parameters: * input_file,
input_file_2 - input files in FASTA format (optionally
gzipped); first input can be replaced with STDIN string to read from
standard input; action depends on the number of input files: * one input
- multiple sequence alignment (input gaps, if present, are removed prior
the alignment), * two inputs - profile-profile aligment (gaps are
preserved). * output_file - output file (pass STDOUT when
writing to standard output); available outputs: * alignment in FASTA
format, * guide tree in Newick format (-gt_export option
specified), * distance matrix in CSV format (-dist_export
option specified).
Options: * -help - show advanced options *
-t <value> - no. of threads, NOTE: exceeding number
of physical (not logical) cores decreases performance, 0 indicates half
of all the logical cores (default: 0) * -v - verbose mode,
show timing information (default: disabled)
-gt <sl | upgma | nj | import <file>> - the
guide tree method (default: sl):
sl - single linkage,upgma - UPGMA,nj - neighbour joining,import <file> - import from a Newick file.-medoidtree - use MedoidTree heuristic for speeding up
tree construction (default: disabled)-medoid_threshold <n_seqs> - if specified, medoid
trees are used only for sets with n_seqs or more-gt_export - export a guide tree to output file in the
Newick format-dist_export - export a distance matrix to output file
in CSV format-square_matrix - generate a square distance matrix
instead of a default triangle-pid - calculate percent identity (the number of
matching residues divided by the shorter sequence length) instead of
distance-keep-duplicates - keep duplicated sequences during
alignment (default: disabled - duplicates are removed prior and restored
after the alignment)-gz - enable gzipped output (default: disabled)-gz-lev <value> - gzip compression level [0-9]
(default: 7)-refine_mode <on | off | auto> - refinement mode
(default: auto - the refinement is enabled for sets <=
1000 seq.)FAMSA has the ability to import/export alignment guide trees in Newick format. E.g., in order to generate a UPGMA tree from the input.fasta file and store it in the tree.dnd file, run:
famsa -gt upgma -gt_export input.fasta tree.dnd
To align the sequences from input.fasta using the tree from tree.dnd and store the result in out.fasta, run:
famsa -gt import tree.dnd input.fasta out.fasta
Below one can find example guide tree file for sequences A, B, and C:
(A:0.1,(B:0.2,C:0.3):0.4);
Note, that when importing the tree, the branch lengths are not taken into account, though they have to be specified in a file for successful parsing. When exporting the tree, all the branches are assigned with length 1, thus only the structure of the tree can be restored (we plan to output real lengths in the future release):
(A:1.0,(B:1.0,C:1.0):1.0);
The major algorithmic features in FAMSA are: * Pairwise distances
based on the longest common subsequence (LCS). Thanks to the bit-level
parallelism and utilization of SIMD extensions, LCS can be computed very
fast. * Single-linkage guide trees. While being very accurate,
single-linkage trees can be established without storing entire distance
matrix, which makes them suitable for large alignments. Although, the
alternative guide tree algorithms like UPGMA and neighbour joining are
also provided. * The new heuristic based on K-Medoid clustering for
generating fast guide trees. Medoid trees can be calculated in
O(N logN) time and work with all types of
subtrees (single linkage, UPGMA, NJ). The heuristic can be enabled with
-medoidtree switch and allow aligning millions of sequences
in minutes.
The analysis was performed on our extHomFam 2 benchmark produced by combining Homstrad (March 2020) references with Pfam 33.1 families (NCBI variant). The data set was deposited at Zenodo: https://zenodo.org/record/6524237. The following algorithms were investigated:
| Name | Version | Command line |
|---|---|---|
| ClustalΩ | 1.2.4 | clustalo --threads=32 -i <input> -o <output> |
| ClustalΩ iter2 | 1.2.4 | clustalo --threads=32 --iter 2 -i <input> -o <output> |
| MAFFT PartTree | 7.453 | mafft --thread 32 --anysymbol --quiet --parttree <input> -o <output> |
| MAFFT DPPartTree | 7.453 | mafft --thread 32 --anysymbol --quiet --dpparttree <input> -o <output> |
| Kalign3 | 3.3.2 | kalign -i <input> -o <output> |
| FAMSA | 1.6.2 | famsa -t 32 <input> <output> |
| FAMSA 2 | 2.0.1 | famsa -t 32 -gz <input> <output> |
| FAMSA 2 Medoid | 2.0.1 | famsa -t 32 -medoidtree -gt upgma -gz <input> <output> |
The tests were performed with 32 computing threads on a machine with
AMD Ryzen Threadripper 3990X CPU and 256 GB of RAM. For each extHomFam 2
subset we measured a fraction of properly aligned columns (TC score) as
well as a total running time and a maximum memory requirements. The
results are presented in the figure below. Notches at boxplots indicate
95% confidence interval for median, triangle represent means. The
missing series for some algorithm-set pairs indicate that the running
times exceeded a week. Kalign3 failed to process 10 families (5 in
second, 3 in fourth, and 2 in the largest subset). FAMSA 2 alignments
were stored in gzip format (-gz switch).
The most important observations are as follows: * FAMSA 2 was superior in terms of accuracy to all the competitors. Only on the smallest families (N < 10k) ClustalΩ kept up with our algorithm. * The advantage of FAMSA 2 increased with the number of sequences and reached 20-30 percent points for (100k, 250k] subset. * FAMSA 2 with medoid trees offered astonishing throughput (a familiy PF00005 of 3 million ABC transporters was aligned in 5 minutes) with accuracy only slightly inferior to that of the default single linkage trees. * None of the competing algorithms was able to complete all the families in the largest [250k, 3M) subset. * The memory requirements of FAMSA 2 allow ultra-scale analyzes at a desktop computer (24 GB for 3M sequences).
Benchmark data sets developed and used in the FAMSA study: * extHomFam: https://doi.org/10.7910/DVN/BO2SVW * extHomFam 2: https://zenodo.org/record/6524237