OGRI (Overall Genome Relatedness Indices; Chun & Rainey 2014) is a command line programs written in Bash to compute pairwise similarity measures between whole genome sequences. Every computed similarity is based on local sequence alignments:
▹ Average Nucleotide Identity (ANI; Goris et al. 2007),
▹ Percentage of Conserved DNA (cDNA; Goris et al. 2007),
▹ OrthoANI (oANI; Lee et al. 2016),
▹ Percentage Of Conserved Proteins (POCP; Qin et al. 2014),
▹ CDS-based ANI (cANI; Konstantinidis & Tiedje 2005a; gANI; Varghese et al. 2015),
▹ Alignment Fraction (AF; Varghese et al. 2015),
▹ (one-way) Average Amino-acid Identity (AAI; Konstantinidis & Tiedje 2005b),
▹ Proteome Coverage (ProCov; Kim et al. 2021),
▹ Reciprocal AAI (rAAI; Nicholson et al. 2020).
The key aim of OGRI is to provide a wide range of genome proximity metrics in an accurate way, i.e. implemented following the specific descriptions given by each associated article (see Methods). Consequently, OGRI is not expected to run very fast (e.g. OGRI_B requires up to one minute to deal with two 5 Mbp-long genomes on 12 threads), even though faster running times are expected with a larger number of threads.
Every OGRI tool runs on UNIX, Linux and most OS X operating systems.
You will need to install the required programs listed in the following table, or to verify that they are already installed with the required version.
OGRI tool | program | package | version | sources |
---|---|---|---|---|
OGRI_B |
gawk | - | > 4.0.0 | ftp.gnu.org/gnu/gawk |
OGRI_B |
prodigal | - | ≥ 2.6.3 | github.com/hyattpd/Prodigal |
OGRI_B |
makeblastdb blastn blastp tblastn |
blast+ | ≥ 2.12.0 | ftp.ncbi.nlm.nih.gov/blast/executables/blast+ |
Clone this repository with the following command line:
git clone https://gitlab.pasteur.fr/GIPhy/OGRI.git
Go to the directory OGRI/
to give the execute permission to the file:
cd OGRI/
chmod +x OGRI_B.sh
and run it with the following command line model:
./OGRI_B.sh [options]
If at least one of the indicated programs (see Dependencies) is not available on your $PATH
variable (or if one compiled binary has a different default name), the OGRI tools will either exit with an error message (when the requisite programs are missing). To set a required program that is not available on your $PATH
variable, edit the file and indicate the local path to the corresponding binary(ies) within the code block REQUIREMENTS
.
Run OGRI_B without option to read the following documentation:
USAGE: OGRI.sh [OPTIONS] <fasta1> <fasta2> [<fasta3> ...]
OPTIONS:
-x only OGRIs based on genome fragments (ANI, oANI)
-y only OGRIs based on CDS (POCP, gANI, AF, AAI, ProCov, rAAI)
-z only OGRIs based on reciprocal searches (oANI, gANI, AF, ProCov, rAAI)
-b <int> number of bootstrap replicates for confidence intervals (default: 200)
-r <int> tab-delimited raw output (default: detailed output)
-t <int> number of threads (default: 2)
-h prints this help and exits
Each input file should be in FASTA format and may contain nucleotide sequences. At least two input files should be specified. If more than two files are specified, the pairwise similarities are computed between the genome in the first file and the genome in each other files.
By default, all OGRIs are computed (see Methods below). However, the number of computed OGRIs can be reduced using options -x
or -y
.
The 95% confidence interval is estimated for most OGRIs using a bootstrap approach. The default number of bootstrap replicates (i.e. 200) can be modified with option -b
.
Faster running times can be observed when using a large number of threads (option -t
).
By default, progress bars and detailed results are outputted in stderr and stdout, respectively. The progress bars can be suppressed by ending the command line with 2>/dev/null
.
Raw tab-delimited results (stdout) can be obtained with option -r
. Field names and numbers are summarized in the table below (see Methods for the meaning of each field).
field name | field number (default) |
field number (option -x ) |
field number (option -y ) |
field number (option -z ) |
---|---|---|---|---|
GENO1 | 1 | 1 | 1 | 1 |
GENO2 | 2 | 2 | 2 | 2 |
lgt1 | 3 | 3 | 3 | 3 |
lgt2 | 4 | 4 | 4 | 4 |
nFRA1 | 5 | 5 | ||
nFRA2 | 6 | 6 | ||
nFRA12 | 7 | 7 | ||
nFRA21 | 8 | 8 | ||
cDNA12 | 9 | 9 | ||
cDNA21 | 10 | 10 | ||
ANI12 [CI_ANI12] | 11 | 11 | ||
ANI21 [CI_ANI21] | 12 | 12 | ||
ANI [CI_ANI] | 13 | 13 | ||
nfRBH | 14 | 14 | 5 | |
oANI [CI_oANI] | 15 | 15 | 6 | |
nCDS1 | 16 | 5 | 7 | |
nCDS2 | 17 | 6 | 8 | |
nCDS12 | 18 | 7 | ||
nCDS21 | 19 | 8 | ||
POCP | 20 | 9 | ||
cCDS12 | 21 | 10 | ||
cCDS21 | 22 | 11 | ||
cANI12 [CI_cANI12] | 23 | 12 | ||
cANI21 [CI_cANI21] | 24 | 13 | ||
cANI [CI_cANI] | 25 | 14 | ||
ngRBH | 26 | 15 | 9 | |
gANI12 [CI_gANI12] | 27 | 16 | 10 | |
gANI21 [CI_gANI21] | 28 | 17 | 11 | |
gANI [CI_gANI] | 29 | 18 | 12 | |
AF12 [CI_AF12] | 30 | 19 | 13 | |
AF21 [CI_AF21] | 31 | 20 | 14 | |
AF [AF_CI] | 32 | 21 | 15 | |
mCDS12 | 33 | 22 | ||
mCDS21 | 34 | 23 | ||
AAI12 [CI_AAI12] | 35 | 24 | ||
AAI21 [CI_AAI21] | 36 | 25 | ||
AAI [CI_AAI] | 37 | 26 | ||
naRBH | 38 | 27 | 16 | |
ProCov | 39 | 28 | 17 | |
rAAI [CI_rAAI] | 40 | 29 | 18 |
Each input genome nucleotide sequences GENOi
is decomposed into three sets:
FRAGi
: a set of consecutive fragments, each of length (at most) 1020 bps (Goris et al. 2007, Lee et al. 2016); OGRI extracts fragments containing only the character states A, C, G and T (case insensitive), and discards all fragments of length smaller than 920 bps;
CDSNi
: a set of coding codon sequences; OGRI uses Prodigal (Hyatt et al. 2010) to build this set; every codon sequence of length smaller than 33 codons is discarded, as well as any sequence containing any other character state than the ones from the IUPAC set {A, C, G, T}; of note, all stop codons are kept;
CDSAi
: a set of coding amino acid sequences; OGRI creates this set by translating every codon sequences in CDSNi
; of note, every non-translatable codon is discarded.
Given two genomes 1 and 2, different local alignments (best hit of each sequence from a set against the sequences from another set) are obtained using different flavors of BLAST (Altschul et al. 1990; Camacho et al. 2008):
FRAG1
against GENO2
(and reciprocally) using blastn (Altschul et al. 1990; Zhang et al. 2000) with tuned parameters, as described by Goris et al. (2007; see also Yoon et al. 2017);
FRAG1
against FRAG2
(and reciprocally) using blastn with tuned parameters, as described by Lee et al. (2016; see also Yoon et al. 2017);
CDSN1
against CDSN2
(and reciprocally) using blastn with tuned parameters (as described by Konstantinidis and Tiedje 2005a), as well as default parameters to approximate the NSimScan tool (Novichkov et al. 2016) used by Varghese et al. (2015);
CDSA1
against GENO2
(and reciprocally) using tblastn (Gertz et al. 2006) with default parameters, as described by Konstantinidis and Tiedje (2005b);
CDSA1
against CDSA2
(and reciprocally) using blastp (Altschul et al. 1997) with default parameters, as described by Qin et al. (2014); see also Nicholson et al. (2020), and Kim et al. (2021) for similar approaches.
These various local alignments are next specifically filtered, and the resulting sets of local similarities are used to derive different pairwise similarity measures:
ANI: the local alignments of FRAG1
against GENO2
and FRAG2
against GENO1
are screened following the criteria of Goris et al. (2007), resulting to nFRA1 and nFRA2 remaining fragments and associated local alignments, respectively; these selected local alignments are used to derive the two percentages of conserved DNA cDNA12 and cDNA21, and the two pairwise similarity percentages ANI12 and ANI21 (as well as their average ANI), respectively, as described by Goris et al. (2007);
OrthoANI: the local alignments of FRAG1
against FRAG2
and FRAG2
against FRAG1
are screened following the criteria of Lee et al. (2016), and next processed to identify reciprocal best hits (RBH), resulting to nfRBH remaining fragment pairs and associated local alignments; these selected local alignments are used to derive the similarity percentages oANI, as described by Lee et al. (2016);
cANI: the local alignments of CDSN1
against CDSN2
and CDSN2
against CDSN1
are screened following the criteria of Konstantinidis and Tiedje (2005a), resulting to cCDS12 and cCDS21 remaining CDS and associated local alignments (at the nucleotide level), respectively; these selected local alignments are used to derive the two pairwise similarity percentages cANI12 and cANI21, respectively, as described by Konstantinidis and Tiedje (2005a), as well as their average cANI;
gANI, AF: the local alignments (blastn, default parameters) of CDSN1
against CDSN2
and CDSN2
against CDSN1
are screened following the criteria of Varghese et al. (2015), and next processed to identify RBH, resulting to ngRBH remaining CDS pairs and associated local alignments (at the nucleotide level); these selected local alignments are used to derive the two pairwise similarity percentages gANI12 and gANI21, and the two alignment fractions AF12 and AF21, respectively, as described by Varghese et al. (2015); the final values gANI and AF are the average of gANI12 and gANI21, and of AF12 and AF21, respectively;
(one-way) AAI: the local alignments of CDSA1
against GENO2
and CDSA2
against GENO1
are screened following the criteria of Konstantinidis and Tiedje (2005b), resulting to mCDS1 and mCDS2 remaining CDS and associated local alignments (at the amino acid level), respectively; these selected local alignments are used to derive the two pairwise similarity percentages AAI12 and AAI21, respectively, as described by Konstantinidis and Tiedje (2005b), as well as their average AAI; it is worth noting that this estimation of the AAI corresponds to the “AAI based on one-way BLAST” (sensu Konstantinidis and Tiedje 2005b), to be opposed to “AAI based on two-way BLAST” (sensu Konstantinidis and Tiedje 2005b);
POCP: the local alignments of CDSA1
against CDSA2
and CDSA2
against CDSA1
are screened following the criteria of Qin et al. (2014; see also Nicholson et al. 2020, Kim et al. 2021), resulting to nCDS1 and nCDS2 remaining CDS and associated local alignments, respectively; these selected local alignments are used to derive the Percentage Of Conserved Proteins POCP, as described by Qin et al. (2014);
ProCov, rAAI: the local alignments selected for computing POCP are processed to identify RBH, resulting to naRBH remaining CDS pairs and associated local alignments; these selected local alignments are used to derive the similarity percentage rAAI, as described by Nicholson et al. (2020; see also Kim et al. 2021), as well as the Proteome Coverage ProCov, as described by Kim et al. (2021); note that rAAI is quite comparable to the AAI based on two-way BLAST (sensu Konstantinidis and Tiedje 2005b).
The different estimated OGRIs can be used to assess taxonomic rank delineation.
Species delineation. Different species delineation cutoffs based on different OGRIs were proposed, e.g. cANI = 94% (Konstantinidis and Tiedje 2005a), (two-way) AAI = 95%-96% (Konstantinidis and Tiedje 2005b), ANI = 95% and cDNA12 = cDNA21 = 69% (Goris et al. 2007; see also Rodriguez-R and Konstantinidis 2014), ANI = 95%-96% (Richter and Rossello-Mora 2009), rAAI = 95% (Luo et al. 2014), AF = 0.6 and gANI = 96.5% (Varghese et al. 2015). Alternative implementations for estimating the Average Nucleotide Identity also led to comparable species delineation cutoffs, e.g. 95% using FastANI (Jain et al. 2018), a tool comparable to OrthoANI for closely-related genomes (e.g. ANI > 93%; Palmer et al. 2020). Of important note, OGRI values based on non-RBH sequence similarity searches (e.g. ANI, cANI, AAI) are often (incorrectly) smaller than those based on RBH approaches (e.g. OrthoANI, gANI, AF, rAAI), because of the occurrences of repeat regions or the presence of expanded families of paralogous genes (e.g. Konstantinidis and Tiedje 2005b, Palmer et al. 2020). Nevertheless, as the proposed cutoffs for both AF and gANI are based on a sequence similarity search tool (NSimScan) that is different from the one used by OGRI (blastn), the two implementations are not comparable. It is therefore recommended to use oANI = 95% and/or rAAI = 95% as species delineation cutoffs.
Genus delineation. Different genus delineation cutoffs based on different OGRIs were proposed, e.g. POCP = 50% (Qin et al. 2014), (two-way) AAI = 65% (Konstantinidis et al. 2017; see also Rodriguez-R and Konstantinidis 2014), rAAI = 60% (Luo et al. 2014). In consequence, one can consider that two genomes leading to POCP < 50% and rAAI < 60% are likely belonging to (at least) distint genera. However, different exceptions to this simplistic rule have been shown (especially for POCP, e.g. Surech et al. 2019). Among the recommended approaches to determine a genus delineation cutoff, one can (i) look for a natural cutoff in the distribution of a large set of pairwise rAAI values (see e.g. Nicholson et al. 2020), or (ii) estimate a genus inflexion point in the plotting of rAAI vs. ProCov values (for example) estimated against a selected type species (for a similar approach, see Barco et al. 2020).
Family delineation. A family delineation cutoff of (two-way) AAI = 45% was suggested by Konstantinidis et al. (2017), but this cutoff was not assessed based on a large number of compared genomes.
Order delineation. Order delineation cutoffs of rAAI = 47%-50% were observed by Luo et al. (2014). However, as the distribution of the pairwise rAAI values between member of distinct bacterial orders overlaps with those related to the genus and the phylum, assessing orders based on rAAI is not recommended.
Class delineation. No class delineation for any OGRI was ever proposed.
Phylum delineation. A phylum delineation cutoff of rAAI = 40% was assessed by Luo et al. (2014). A comparable cutoff of rAAI = 40% can therefore be eventually considered when using OGRI.
Kingdom delineation. No kingdom delineation for any OGRI was ever proposed.
Domain delineation. A domain delineation cutoff of rAAI = 40% was observed by Luo et al. (2014). However, such a cutoff should be used with caution.
In order to illustrate the usefulness of OGRI, the following use case example describes its usage for estimating pairwise similarity measures between 13 Enterobacteriaceae chromosomes, as published by Konstantinidis and Tiedje (2005a), as well as Goris et al. (2007).
Download the 13 chromosome sequence files using the following Bash command lines:
URL="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA";
wget -q -O - $URL/000/008/865/GCA_000008865.2_ASM886v2/GCA_000008865.2_ASM886v2_genomic.fna.gz | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 01.Escherichia.coli.O157.H7.Sakai.fasta ;
wget -q -O - $URL/000/006/665/GCA_000006665.1_ASM666v1/GCA_000006665.1_ASM666v1_genomic.fna.gz | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 02.Escherichia.coli.O157.H7.EDL933.fasta ;
wget -q -O - $URL/000/273/425/GCA_000273425.1_Esch_coli_MG12655_V1/GCA_000273425.1_Esch_coli_MG12655_V1_genomic.fna.gz | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 03.Escherichia.coli.K-12.MG1655.fasta ;
wget -q -O - $URL/000/007/445/GCA_000007445.1_ASM744v1/GCA_000007445.1_ASM744v1_genomic.fna.gz | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 04.Escherichia.coli.CFT073.fasta ;
wget -q -O - $URL/000/007/405/GCA_000007405.1_ASM740v1/GCA_000007405.1_ASM740v1_genomic.fna.gz | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 05.Shigella.flexneri.2a.2457T.fasta ;
wget -q -O - $URL/000/006/925/GCA_000006925.2_ASM692v2/GCA_000006925.2_ASM692v2_genomic.fna.gz | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 06.Shigella.flexneri.2a.301.fasta ;
wget -q -O - $URL/000/006/945/GCA_000006945.2_ASM694v2/GCA_000006945.2_ASM694v2_genomic.fna.gz | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 07.Salmonella.enterica.Typhimurium.LT2.fasta ;
wget -q -O - $URL/000/007/545/GCA_000007545.1_ASM754v1/GCA_000007545.1_ASM754v1_genomic.fna.gz | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 08.Salmonella.enterica.Typhi.Ty2.fasta ;
wget -q -O - $URL/001/302/605/GCA_001302605.1_ASM130260v1/GCA_001302605.1_ASM130260v1_genomic.fna.gz | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 09.Salmonella.enterica.Typhi.PM016.13.fasta ;
wget -q -O - $URL/000/970/105/GCA_000970105.1_ASM97010v1/GCA_000970105.1_ASM97010v1_genomic.fna.gz | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 11.Yersinia.pestis.KIM5.fasta ;
wget -q -O - $URL/000/009/065/GCA_000009065.1_ASM906v1/GCA_000009065.1_ASM906v1_genomic.fna.gz | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 12.Yersinia.pestis.CO92.fasta ;
wget -q -O - $URL/000/009/345/GCA_000009345.1_ASM934v1/GCA_000009345.1_ASM934v1_genomic.fna.gz | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 14.Yersinia.enterocolitica.8081.fasta ;
wget -q -O - $URL/000/294/535/GCA_000294535.1_ASM29453v1/GCA_000294535.1_ASM29453v1_genomic.fna.gz | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 15.Erwinia.carotovora.PCC21.fasta ;
Note that each file is numbered according to the Enterics part of the Table S1 in Konstantinidis and Tiedje (2005a).
Run the following command line to compare the two first Escherichia coli genomes (using 12 threads):
OGRI_B.sh -t 12 01.Escherichia.coli.O157.H7.Sakai.fasta 02.Escherichia.coli.O157.H7.EDL933.fasta
After ~1 minute of calculations, OGRI displays the following results:
[1/2] [0%]----------+----------+----------+----------+----------[100%]
[2/2] [0%]----------+----------+----------+----------+----------[100%]
Genome files
GENO1 01.Escherichia.coli.O157.H7.Sakai.fasta
GENO2 02.Escherichia.coli.O157.H7.EDL933.fasta
Average Nucleotide Identity (Goris et al. 2007)
nFRA1 (nFRA12) 5390 (5329)
nFRA2 (nFRA21) 4766 (4749)
cDNA12 (lgt1) 98.81 (5498578)
cDNA21 (lgt2) 87.64 (5521804)
ANI12 [95%CI] 99.90 [99.87-99.92]
ANI21 [95%CI] 99.98 [99.97-99.99]
ANI [95%CI] 99.94 [99.92-99.95]
OrthoANI (Lee et al. 2016)
nfRBH 4446
oANI [95%CI] 99.97 [99.96-99.98]
Percentage Of Conserved Proteins (Qin et al. 2014)
nCDS1 (nCDS12) 5286 (4445)
nCDS2 (nCDS21) 4246 (4231)
POCP 91.01
CDS-based ANI (Konstantinidis & Tiedje 2005)
nCDS1 (cCDS12) 5286 (4319)
nCDS2 (cCDS21) 4246 (4226)
cANI12 [95%CI] 99.54 [99.46-99.63]
cANI21 [95%CI] 99.97 [99.96-99.99]
cANI [95%CI] 99.75 [99.71-99.81]
Whole-genome based ANI & Alignment Fraction (Varghese et al. 2015)
ngRBH 4027
gANI12 [95%CI] 99.62 [99.37-99.85]
gANI21 [95%CI] 99.89 [99.83-99.95]
gANI [95%CI] 99.75 [99.60-99.90]
AF12 [95%CI] 0.697 [0.684-0.711]
AF21 [95%CI] 0.960 [0.942-0.979]
AF [95%CI] 0.828 [0.813-0.845]
Average Amino-acid Identity (one-way; Konstantinidis & Tiedje 2005)
nCDS1 (mCDS12) 5286 (5214)
nCDS2 (mCDS21) 4246 (4230)
AAI12 [95%CI] 99.74 [99.66-99.80]
AAI21 [95%CI] 99.93 [99.88-99.97]
AAI [95%CI] 99.83 [99.77-99.88]
Proteome Coverage (Kim et al. 2021) & rAAI (Nicholson et al. 2020)
naRBH 4009
ProCov 0.841
rAAI [95%CI] 99.97 [99.93-99.99]
It can be observed that the ANI of E. coli O157:H7 Sakai (genome 1) against E. coli O157:H7 EDL933 (genome 2) is ANI12 = 99.90%, with a percentage of conserved DNA of cDNA12 = 98.81%. This values can be compared to the ones reported by Goris et al. (2007; Table 2): 99.68% and 99.6%, respectively. One also observes ANI21 = 99.98% and cDNA21 = 87.64%, whereas Goris et al. (2007) reported 99.63% and 97.3%, respectively. Such differences can be explained by the way each genome sequence is decomposed into consecutive fragments.
The CDS-based ANI of E. coli O157:H7 EDL933 (genome 2) against E. coli O157:H7 Sakai (genome 1) is cANI21 = 99.97%, with 4226/4246=99.52% conserved genes. These values can be compared to the one reported by Konstantinidis and Tiedje (2005a; Table S1): 99.7% and 98.6%, respectively. Such differences can be explained by the different numbers of predicted CDS (i.e. nCDS1 = 5286 and nCDS2 = 4246), whereas Konstantinidis and Tiedje (2005a; Table S1) reported 5361 and 5324 CDS, respectively.
To obtain more results against E. coli O157:H7 Sakai (genome 1), OGRI can be run on the whole set of downloaded genomes to display all metrics in tab-delimited format (option -r
), e.g.
OGRI_B.sh -t 48 -r *.fasta 2>/dev/null
This command line leads to the following output:
#GENO1 GENO2 lgt1 lgt2 nFRA1 nFRA2 nFRA12 nFRA21 cDNA12 cDNA21 ANI12 [CI_ANI12] ANI21 [CI_ANI21] ANI [CI_ANI] nfRBH oANI [CI_oANI] nCDS1 nCDS2 nCDS12 nCDS21 POCP cCDS12 cCDS21 cANI12 [CI_cANI12] cANI21 [CI_cANI21] cANI [CI_cANI] ngRBH gANI12 [CI_gANI12] gANI21 [CI_gANI21] gANI [CI_gANI] AF12 [CI_AF12] AF21 [CI_AF21] AF [AF_CI] mCDS12 mCDS21 AAI12 [CI_AAI12] AAI21 [CI_AAI21] AAI [CI_AAI] naRBH ProCov rAAI [CI_rAAI]
01.Escherichia.coli.O157.H7.Sakai.fasta 02.Escherichia.coli.O157.H7.EDL933.fasta 5498578 5521804 5390 4766 5329 4749 98.81 87.64 99.90 [99.87-99.92] 99.98 [99.97-99.99] 99.94 [99.92-99.95] 4446 99.97 [99.96-99.98] 5286 4246 4445 4231 91.01 4319 4226 99.54 [99.46-99.63] 99.97 [99.96-99.99] 99.75 [99.71-99.81] 4027 99.62 [99.37-99.85] 99.89 [99.83-99.95] 99.75 [99.60-99.90] 0.697 [0.684-0.711] 0.960 [0.942-0.979] 0.828 [0.813-0.845] 5214 4230 99.74 [99.66-99.80] 99.93 [99.88-99.97] 99.83 [99.77-99.88] 4009 0.841 99.97 [99.93-99.99]
01.Escherichia.coli.O157.H7.Sakai.fasta 03.Escherichia.coli.K-12.MG1655.fasta 5498578 4638970 5390 4548 4021 3988 74.55 88.16 97.81 [97.68-97.92] 98.02 [97.92-98.11] 97.91 [97.80-98.01] 4013 98.05 [97.95-98.13] 5286 4295 4019 3923 82.89 3864 3807 97.88 [97.74-98.00] 98.03 [97.91-98.12] 97.95 [97.82-98.06] 3765 96.94 [96.56-97.30] 97.10 [96.80-97.37] 97.02 [96.68-97.33] 0.759 [0.744-0.774] 0.898 [0.882-0.916] 0.828 [0.813-0.845] 4080 3968 96.19 [95.76-96.54] 97.25 [96.96-97.54] 96.72 [96.36-97.04] 3760 0.784 98.61 [98.42-98.72]
01.Escherichia.coli.O157.H7.Sakai.fasta 04.Escherichia.coli.CFT073.fasta 5498578 5231148 5390 5070 3946 3839 72.85 74.42 96.53 [96.39-96.68] 96.65 [96.50-96.77] 96.59 [96.44-96.72] 3877 96.77 [96.65-96.89] 5286 4799 4113 3976 80.20 3867 3765 96.72 [96.61-96.85] 96.78 [96.65-96.92] 96.75 [96.63-96.88] 3621 95.57 [95.11-96.01] 96.00 [95.71-96.28] 95.78 [95.41-96.14] 0.720 [0.705-0.736] 0.774 [0.758-0.790] 0.747 [0.731-0.763] 4255 4053 94.68 [94.26-95.05] 94.75 [94.26-95.12] 94.71 [94.26-95.08] 3644 0.722 97.67 [97.40-97.91]
01.Escherichia.coli.O157.H7.Sakai.fasta 05.Shigella.flexneri.2a.2457T.fasta 5498578 4599326 5390 4507 3753 3764 68.97 84.82 97.12 [96.93-97.28] 97.75 [97.64-97.85] 97.43 [97.28-97.56] 3710 97.85 [97.75-97.94] 5286 4702 3922 4185 81.16 3553 3973 97.33 [97.15-97.48] 97.60 [97.46-97.74] 97.46 [97.30-97.61] 3552 93.01 [92.26-93.73] 97.00 [96.74-97.21] 95.00 [94.50-95.47] 0.711 [0.695-0.722] 0.826 [0.809-0.841] 0.768 [0.752-0.781] 3964 4252 94.19 [93.79-94.67] 96.94 [96.65-97.20] 95.56 [95.22-95.93] 3486 0.698 98.40 [98.23-98.55]
01.Escherichia.coli.O157.H7.Sakai.fasta 06.Shigella.flexneri.2a.301.fasta 5498578 4607196 5390 4516 3735 3757 68.77 84.71 97.14 [96.98-97.31] 97.71 [97.57-97.82] 97.42 [97.27-97.56] 3715 97.83 [97.74-97.91] 5286 4715 3907 4207 81.13 3523 4000 97.32 [97.15-97.45] 97.58 [97.43-97.70] 97.45 [97.29-97.57] 3549 92.89 [92.28-93.50] 96.95 [96.75-97.18] 94.92 [94.51-95.34] 0.710 [0.695-0.727] 0.822 [0.805-0.840] 0.766 [0.750-0.783] 3951 4277 94.15 [93.73-94.59] 97.01 [96.75-97.24] 95.58 [95.24-95.91] 3485 0.696 98.39 [98.23-98.51]
01.Escherichia.coli.O157.H7.Sakai.fasta 07.Salmonella.enterica.Typhimurium.LT2.fasta 5498578 4857450 5390 4762 2961 2924 3.10 3.56 80.19 [79.96-80.42] 80.30 [80.08-80.55] 80.24 [80.02-80.48] 3102 80.68 [80.45-80.91] 5286 4504 3493 3361 70.01 3109 3030 80.81 [80.58-81.01] 80.98 [80.78-81.22] 80.89 [80.68-81.11] 2922 80.54 [80.29-80.81] 80.20 [79.93-80.51] 80.37 [80.11-80.66] 0.591 [0.579-0.605] 0.674 [0.661-0.691] 0.632 [0.620-0.648] 3653 3497 82.45 [81.87-82.98] 83.49 [82.98-84.05] 82.97 [82.42-83.51] 3162 0.645 87.08 [86.67-87.44]
01.Escherichia.coli.O157.H7.Sakai.fasta 08.Salmonella.enterica.Typhi.Ty2.fasta 5498578 4791950 5390 4695 2841 2864 3.02 3.44 80.42 [80.21-80.68] 80.32 [80.10-80.57] 80.37 [80.15-80.62] 3002 80.85 [80.62-81.10] 5286 4614 3326 3333 67.26 2941 3008 81.13 [80.91-81.37] 81.06 [80.86-81.28] 81.09 [80.88-81.32] 2872 79.75 [79.41-80.11] 80.29 [80.01-80.51] 80.02 [79.71-80.31] 0.581 [0.570-0.596] 0.669 [0.655-0.686] 0.625 [0.612-0.641] 3488 3462 82.79 [82.22-83.38] 83.73 [83.27-84.25] 83.26 [82.74-83.81] 3079 0.622 87.28 [86.84-87.72]
01.Escherichia.coli.O157.H7.Sakai.fasta 09.Salmonella.enterica.Typhi.PM016.13.fasta 5498578 4793553 5390 4699 2834 2816 3.03 3.71 80.45 [80.19-80.63] 80.40 [80.17-80.64] 80.42 [80.18-80.63] 3008 80.84 [80.60-81.04] 5286 4625 3314 3322 66.95 2932 2999 81.16 [80.97-81.34] 81.10 [80.87-81.31] 81.13 [80.92-81.32] 2867 79.79 [79.48-80.11] 80.33 [80.04-80.59] 80.06 [79.76-80.35] 0.581 [0.565-0.592] 0.667 [0.651-0.681] 0.624 [0.608-0.636] 3478 3452 82.79 [82.15-83.41] 83.74 [83.13-84.20] 83.26 [82.64-83.80] 3070 0.619 87.33 [86.89-87.73]
01.Escherichia.coli.O157.H7.Sakai.fasta 11.Yersinia.pestis.KIM5.fasta 5498578 4605437 5390 4515 1523 1507 0.59 0.75 71.92 [71.59-72.25] 71.91 [71.52-72.25] 71.91 [71.55-72.25] 2003 72.33 [72.07-72.60] 5286 4040 2542 2505 54.11 1721 1694 72.97 [72.65-73.24] 73.00 [72.70-73.24] 72.98 [72.67-73.24] 1328 71.78 [71.20-72.29] 71.85 [71.34-72.31] 71.81 [71.27-72.30] 0.284 [0.275-0.294] 0.355 [0.343-0.367] 0.319 [0.309-0.330] 2790 2630 68.72 [68.11-69.31] 70.01 [69.34-70.62] 69.36 [68.72-69.96] 2309 0.495 73.68 [73.17-74.12]
01.Escherichia.coli.O157.H7.Sakai.fasta 12.Yersinia.pestis.CO92.fasta 5498578 4653728 5390 4562 1528 1507 0.59 0.59 71.92 [71.58-72.25] 72.04 [71.66-72.38] 71.98 [71.62-72.31] 1979 72.25 [71.99-72.52] 5286 4090 2555 2519 54.11 1726 1697 72.95 [72.70-73.23] 73.00 [72.76-73.24] 72.97 [72.73-73.23] 1333 71.68 [71.20-72.15] 71.83 [71.31-72.26] 71.75 [71.25-72.20] 0.285 [0.276-0.294] 0.352 [0.341-0.363] 0.318 [0.308-0.328] 2802 2645 68.73 [67.98-69.32] 69.93 [69.31-70.50] 69.33 [68.64-69.91] 2318 0.494 73.64 [73.09-74.15]
01.Escherichia.coli.O157.H7.Sakai.fasta 14.Yersinia.enterocolitica.8081.fasta 5498578 4615899 5390 4525 1688 1654 0.60 0.73 71.80 [71.46-72.13] 71.79 [71.47-72.07] 71.79 [71.46-72.10] 2136 72.05 [71.74-72.30] 5286 4159 2738 2648 57.02 1870 1803 72.96 [72.72-73.19] 73.11 [72.84-73.36] 73.03 [72.78-73.27] 1425 72.21 [71.88-72.61] 71.87 [71.49-72.31] 72.04 [71.68-72.46] 0.306 [0.297-0.315] 0.377 [0.367-0.389] 0.341 [0.332-0.352] 2992 2811 68.86 [68.22-69.38] 70.10 [69.58-70.64] 69.48 [68.90-70.01] 2495 0.528 73.53 [73.04-74.02]
01.Escherichia.coli.O157.H7.Sakai.fasta 15.Erwinia.carotovora.PCC21.fasta 5498578 4842771 5390 4747 1652 1641 0.73 0.91 73.16 [72.81-73.51] 72.93 [72.55-73.30] 73.04 [72.68-73.40] 2068 73.31 [73.02-73.57] 5286 4258 2559 2511 53.12 1808 1751 73.91 [73.66-74.14] 74.11 [73.84-74.39] 74.01 [73.75-74.26] 1462 72.99 [72.48-73.39] 72.80 [72.26-73.17] 72.89 [72.37-73.28] 0.315 [0.305-0.324] 0.363 [0.352-0.374] 0.339 [0.328-0.349] 2795 2707 69.04 [68.42-69.53] 69.31 [68.63-69.84] 69.17 [68.52-69.68] 2326 0.487 73.70 [73.16-74.16]
The tab-separated format can be useful to restrict the output to some specific fields, e.g. cANI values:
OGRI_B.sh -t 48 -r *.fasta 2>/dev/null | cut -f2,23-25
The above command line leads to the following simplified output:
GENO2 cANI12 [CI_cANI12] cANI21 [CI_cANI21] cANI [CI_cANI]
02.Escherichia.coli.O157.H7.EDL933.fasta 99.54 [99.46-99.63] 99.97 [99.96-99.99] 99.75 [99.71-99.81]
03.Escherichia.coli.K-12.MG1655.fasta 97.88 [97.74-98.00] 98.03 [97.91-98.12] 97.95 [97.82-98.06]
04.Escherichia.coli.CFT073.fasta 96.72 [96.61-96.85] 96.78 [96.65-96.92] 96.75 [96.63-96.88]
05.Shigella.flexneri.2a.2457T.fasta 97.33 [97.15-97.48] 97.60 [97.46-97.74] 97.46 [97.30-97.61]
06.Shigella.flexneri.2a.301.fasta 97.32 [97.15-97.45] 97.58 [97.43-97.70] 97.45 [97.29-97.57]
07.Salmonella.enterica.Typhimurium.LT2.fasta 80.81 [80.58-81.01] 80.98 [80.78-81.22] 80.89 [80.68-81.11]
08.Salmonella.enterica.Typhi.Ty2.fasta 81.13 [80.91-81.37] 81.06 [80.86-81.28] 81.09 [80.88-81.32]
09.Salmonella.enterica.Typhi.PM016.13.fasta 81.16 [80.97-81.34] 81.10 [80.87-81.31] 81.13 [80.92-81.32]
11.Yersinia.pestis.KIM5.fasta 72.97 [72.65-73.24] 73.00 [72.70-73.24] 72.98 [72.67-73.24]
12.Yersinia.pestis.CO92.fasta 72.95 [72.70-73.23] 73.00 [72.76-73.24] 72.97 [72.73-73.23]
14.Yersinia.enterocolitica.8081.fasta 72.96 [72.72-73.19] 73.11 [72.84-73.36] 73.03 [72.78-73.27]
15.Erwinia.carotovora.PCC21.fasta 73.91 [73.66-74.14] 74.11 [73.84-74.39] 74.01 [73.75-74.26]
Each row can be compared to the values reported by Konstantinidis and Tiedje (2005a; Table S1, Enterics section): 99.7%, 97.2%, 95.9%, 96.5%, 96.4%, 79.9%, 80.2%, 80.2%, 71.5%, 71.5%, 82.1%, 72.1%, respectively. Therefore, it is likely that the penultimate reported value (i.e. 82.1%) is a typo in Table S1.
Restricting the computations to the fragment-based pairwise measures (option -x
) can be useful to significantly reduce the overall running times, e.g.
OGRI_B.sh -t 6 -r -x 0[1-5].*.fasta 2>/dev/null | cut -f2,11-13
The above command line leads to the following simplified output (i.e. restricted to ANI values):
GENO2 ANI12 [CI_ANI12] ANI21 [CI_ANI21] ANI [CI_ANI]
02.Escherichia.coli.O157.H7.EDL933.fasta 99.90 [99.87-99.92] 99.98 [99.97-99.99] 99.94 [99.92-99.95]
03.Escherichia.coli.K-12.MG1655.fasta 97.81 [97.68-97.92] 98.02 [97.92-98.11] 97.91 [97.80-98.01]
04.Escherichia.coli.CFT073.fasta 96.53 [96.39-96.68] 96.65 [96.50-96.77] 96.59 [96.44-96.72]
05.Shigella.flexneri.2a.2457T.fasta 97.12 [96.93-97.28] 97.75 [97.64-97.85] 97.43 [97.28-97.56]
06.Shigella.flexneri.2a.301.fasta 97.14 [96.98-97.31] 97.71 [97.57-97.82] 97.42 [97.27-97.56]
07.Salmonella.enterica.Typhimurium.LT2.fasta 80.19 [79.96-80.42] 80.30 [80.08-80.55] 80.24 [80.02-80.48]
08.Salmonella.enterica.Typhi.Ty2.fasta 80.42 [80.21-80.68] 80.32 [80.10-80.57] 80.37 [80.15-80.62]
09.Salmonella.enterica.Typhi.PM016.13.fasta 80.45 [80.19-80.63] 80.40 [80.17-80.64] 80.42 [80.18-80.63]
11.Yersinia.pestis.KIM5.fasta 71.92 [71.59-72.25] 71.91 [71.52-72.25] 71.91 [71.55-72.25]
12.Yersinia.pestis.CO92.fasta 71.92 [71.58-72.25] 72.04 [71.66-72.38] 71.98 [71.62-72.31]
14.Yersinia.enterocolitica.8081.fasta 71.80 [71.46-72.13] 71.79 [71.47-72.07] 71.79 [71.46-72.10]
15.Erwinia.carotovora.PCC21.fasta 73.16 [72.81-73.51] 72.93 [72.55-73.30] 73.04 [72.68-73.40]
Each of the four first rows (genomes 2-5) can be compared to the values reported by Goris et al. (2007; Table 2, Escherichia/Shigella hybridization group), i.e. ANI12: 99.68%, 97.53%, 96.00%, 97.36%, respectively, and ANI21: 99.63%, 97.25%, 95.85%, 96.54%, respectively.
As the recommended OGRIs are oANI and rAAI (see Methods), the following command line can be useful in many cases:
OGRI_B.sh -t 48 -r -z *.fasta 2>/dev/null | cut -f2,5-8,16-18
The above command line leads to the following output:
GENO2 nfRBH oANI [CI_oANI] nCDS1 nCDS2 naRBH ProCov rAAI [CI_rAAI]
02.Escherichia.coli.O157.H7.EDL933.fasta 4446 99.97 [99.96-99.98] 5286 4246 4009 0.841 99.97 [99.93-99.99]
03.Escherichia.coli.K-12.MG1655.fasta 4013 98.05 [97.95-98.13] 5286 4295 3760 0.784 98.61 [98.42-98.72]
04.Escherichia.coli.CFT073.fasta 3877 96.77 [96.65-96.89] 5286 4799 3644 0.722 97.67 [97.40-97.91]
05.Shigella.flexneri.2a.2457T.fasta 3710 97.85 [97.75-97.94] 5286 4702 3486 0.698 98.40 [98.23-98.55]
06.Shigella.flexneri.2a.301.fasta 3715 97.83 [97.74-97.91] 5286 4715 3485 0.696 98.39 [98.23-98.51]
07.Salmonella.enterica.Typhimurium.LT2.fasta 3102 80.68 [80.45-80.91] 5286 4504 3162 0.645 87.08 [86.67-87.44]
08.Salmonella.enterica.Typhi.Ty2.fasta 3002 80.85 [80.62-81.10] 5286 4614 3079 0.622 87.28 [86.84-87.72]
09.Salmonella.enterica.Typhi.PM016.13.fasta 3008 80.84 [80.60-81.04] 5286 4625 3070 0.619 87.33 [86.89-87.73]
11.Yersinia.pestis.KIM5.fasta 2003 72.33 [72.07-72.60] 5286 4040 2309 0.495 73.68 [73.17-74.12]
12.Yersinia.pestis.CO92.fasta 1979 72.25 [71.99-72.52] 5286 4090 2318 0.494 73.64 [73.09-74.15]
14.Yersinia.enterocolitica.8081.fasta 2136 72.05 [71.74-72.30] 5286 4159 2495 0.528 73.53 [73.04-74.02]
15.Erwinia.carotovora.PCC21.fasta 2068 73.31 [73.02-73.57] 5286 4258 2326 0.487 73.70 [73.16-74.16]
These results suggest that the genomes 2-6 belong to the same species as E. coli O157:H7 Sakai (genome 1), contrary to the genomes 7-15 (i.e. oANI < 95% and rAAI < 95%). The rAAI values for the two Yersinia pestis genomes (11-12) can be compared to (two-way) AAI = 72% observed by Konstantinidis and Tiedje (2005b) between pairs of E. coli and Y. pestis genomes.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-410. doi:10.1016/S0022-2836(05)80360-2
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389-3402. doi:10.1093/nar/25.17.3389
Barco RA, Garrity GM, Scott JJ, Amend JP, Nealson KH, Emerson D (2020) A Genus Definition for Bacteria and Archaea Based on a Standard Genome Relatedness Index. mBio, 11:e02475-19. doi:10.1128/mBio.02475-19
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2008) BLAST+: architecture and applications. BMC Bioinformatics, 10:421. doi:10.1186/1471-2105-10-421
Chun J, Rainey FA (2014) Integrating genomics into the taxonomy and systematics of the Bacteria and Archaea. International Journal of Systematic and Evolutionary Biology, 64(Pt_2):316-324. doi:10.1099/ijs.0.054171-0
Gertz EM, Yu Y-K, Agarwala R, Schäffer AA, Altschul SF (2006) Composition-based statistics and translated nucleotide searches: Improving the TBLASTN module of BLAST. BMC Biology, 4:41. doi:10.1186/1741-7007-4-41
Goris J, Konstantinidis KT, Klappenbach JA, Coenye T, Vandamme P, Tiedje JM (2007) DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. International Journal of Systematic and Evolutionary Biology, 57(1):81-91. doi:10.1099/ijs.0.64483-0
Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics, 11:119. doi:10.1186%2F1471-2105-11-119
Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S (2018) High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nature Communications, 9:5114. doi:10.1038/s41467-018-07641-9
Kim D, Park S, Chun J (2021) Introducing EzAAI: a pipeline for high throughput calculations of prokaryotic average amino acid identity. Journal of Microbiology, 59(5):476-480. doi:10.1007/s12275-021-1154-0
Konstantinidis KT, Tiedje JM (2005a) Genomic insights that advance the species definition for prokaryotes. Proceedings of the National Academy of Sciences of the United States of America, 102(7):2567-2572. doi:/10.1073/pnas.0409727102
Konstantinidis KT, Tiedje JM (2005b) Towards a Genome-Based Taxonomy for Prokaryotes. Journal of Bacteriology, 187(18):6258-6264. doi:10.1128/JB.187.18.6258-6264.2005
Konstantinidis KT, Rossello-Mora R, Amann R (2017) Uncultivated microbes in need of their own taxonomy. The ISME Journal, 11:2399-2406. doi:10.1038/ismej.2017.113
Lee I, Kim YO, Park S-C, Chun J (2016) OrthoANI: An improved algorithm and software for calculating average nucleotide identity. International Journal of Systematic and Evolutionary Biology, 66(2):1100-1103. doi:10.1099/ijsem.0.000760
Luo C, Rodriguez-R LM, Konstantinidis KT (2014) MyTaxa: an advanced taxonomic classifier for genomic and metagenomic sequences. Nucleic Acids Research, 42(8):e73. doi:10.1093/nar/gku169
Nicholson AC, Gulvik CA, Whitney AM, Humrighouse BW, Bell ME, Holmes B, Steigerwalt AG, Villarma A, Sheth M, Batra D, Rowe LA, Burroughs M, Pryor JC, Bernardet J-F, Hugo C, Kämpfer P, Newman JD, McQuiston JR (2020) Division of the genus Chryseobacterium: Observation of discontinuities in amino acid identity values, a possible consequence of major extinction events, guides transfer of nine species to the genus Epilithonimonas, eleven species to the genus Kaistella, and three species to the genus Halpernia gen. nov., with description of Kaistella daneshvariae sp. nov. and Epilithonimonas vandammei sp. nov. derived from clinical specimens. International Journal of Systematic and Evolutionary Biology, 70:4432-4450. doi:10.1099/ijsem.0.003935
Novichkov V, Kaznadzey A, Alexandrova N, Kaznadzey D (2016) NSimScan: DNA comparison tool with increased speed, sensitivity and accuracy. Bioinformatics, 32(15):2380-2381. doi:10.1093/bioinformatics/btw126
Palmer M, Steenkamp ET, Blom J, Hedlund BP, Venter SN (2020) All ANIs are not created equal: implications for prokaryotic species boundaries and integration of ANIs into polyphasic taxonomy. International Journal of Systematic and Evolutionary Biology, 70(4):2937-2948. doi:10.1099/ijsem.0.004124
Qin Q-L, Xie B-B, Zhang X-Y, Chen X-L, Zhou B-C, Zhou J, Oren A, Zhang Y-Z (2014) A Proposed Genus Boundary for the Prokaryotes Based on Genomic Insights. Journal of Bacteriology, 196(12):2210-2215. doi:10.1128/JB.01688-14
Richter M, Rossello-Mora R (2009) Shifting the genomic gold standard for the prokaryotic species definition. Proceedings of the National Academy of Sciences of the United States of America, 106(45):19126-19131. doi:10.1073/pnas.0906412106
Rodriguez-R LM, Konstantinidis KT (2014) Bypassing cultivation to identify bacterial species. Microbe, 9(3):111-118. [pdf]
Suresh G, Lodha TD, Indu B, Sasikala C, Ramana CV (2019) Taxogenomics Resolves Conflict in the Genus Rhodobacter: A Two and Half Decades Pending Thought to Reclassify the Genus Rhodobacter. Frontiers in Microbiology, 10:2480. doi:10.3389/fmicb.2019.02480
Varghese NJ, Mukherjee S, Ivanova N, Konstantinidis KT, Mavrommatis K, Kyrpides NC, Pati A (2015) Microbial species delineation using whole genome sequences. Nucleic Acids Research, 43(14):6761-6771. doi:10.1093/nar/gkv657
Yoon S-H, Ha S-M, Lim J, Kwon S, Chun J (2017) A large-scale evaluation of algorithms to calculate average nucleotide identity. Antonie van Leeuwenhoek, 110(10):1281-1286. doi:10.1007/s10482-017-0844-4
Zhang Z, Schwartz S, Wagner L, Miller W (2000) A greedy algorithm for aligning DNA sequences. Journal of Computational Biology, 7(1-2):203-214. doi:10.1089/10665270050081478