We will demonstrate an example run of DIAMOND on a small dataset here. For this we assume the tool has been installed according to the installation instructions.
First we will download the SCOPe database of structured domains in FASTA format, containing 14,323 sequences:
wget https://scop.berkeley.edu/downloads/scopeseq-2.07/astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa
The next step is to setup a binary DIAMOND database file that can be used for subsequent searches against the database:
diamond makedb --in astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa -d astral40
A binary file (astral40.dmnd
) containing the database sequences will be created in the current working directory. We will now conduct a search against this database using the same FASTA file of SCOPe domains as queries:
diamond blastp -q astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa -d astral40 -o out.tsv --very-sensitive
Here we set the output file to be out.tsv
and also use the --very-sensitive
setting. DIAMOND has a number of sensitivity settings to accomodate different applications. The default mode is the fastest and tailored towards finding homologies of >70% sequence identity, the --sensitive
mode is tailored to hits of >40% identity, while the --very-sensitive
and --ultra-sensitive
modes provide sensitivity accross the whole range of pairwise alignments.
If the run completed successfully, at the end we will see this console output providing some statistics about the number of hits that were found:
Total time = 5.494s
Reported 69284 pairwise alignments, 69284 HSPs.
14323 queries aligned.
We will inspect the beginning of the output file like so:
head out.tsv
Here we will see the first ten lines of the file, showing 10 pairwise alignments of the first 3 query sequences:
d1dlwa_ d1dlwa_ 100.0 116 0 0 1 116 1 116 1.1e-58 220.7
d1dlwa_ d2gkma_ 35.4 113 73 0 1 113 13 125 1.4e-16 80.9
d1dlwa_ d4i0va_ 31.9 119 75 2 1 113 2 120 9.5e-10 58.2
d1dlwa_ d6bmea_ 32.5 114 73 1 1 110 2 115 2.3e-08 53.5
d2gkma_ d2gkma_ 100.0 127 0 0 1 127 1 127 5.4e-67 248.4
d2gkma_ d1dlwa_ 34.8 115 75 0 13 127 1 115 1.4e-17 84.3
d2gkma_ d4i0va_ 33.6 110 69 1 13 118 2 111 2.4e-14 73.6
d2gkma_ d6bmea_ 35.5 110 67 1 13 118 2 111 7.7e-13 68.6
d2gkma_ d2bkma_ 37.3 67 38 2 13 76 5 70 1.7e-04 40.8
d1ngka_ d1ngka_ 100.0 126 0 0 1 126 1 126 1.2e-69 257.3
The file is generated in tabular-separated (TSV) format composed of 12 fields, a format corresponding to the format generated by BLAST using the option -outfmt 6
. The 12 fields are:
>
character until the first blank.Note that this output format can be customized with a number of non-default fields that are available. It is generally advisable to customize the format and limit it to the information required by downstream processing, as this may substantially increase performance. If no fields are selected that require alignment traceback (such as coordinates, length, identity, gap openings and mismatches), performance will be gained by omitting traceback computations.