GPLv3 license Bash

OGRI

OGRI (Overall Genome Relatedness Indices; Chun & Rainey 2014) is a command line programs written in Bash to compute pairwise similarity measures between whole genome sequences. Every computed similarity is based on local sequence alignments:

  ▹   Average Nucleotide Identity (ANI; Goris et al. 2007),

  ▹   Percentage of Conserved DNA (cDNA; Goris et al. 2007),

  ▹   OrthoANI (oANI; Lee et al. 2016),

  ▹   Percentage Of Conserved Proteins (POCP; Qin et al. 2014),

  ▹   CDS-based ANI (cANI; Konstantinidis & Tiedje 2005a; gANI; Varghese et al. 2015),

  ▹   Alignment Fraction (AF; Varghese et al. 2015),

  ▹   (one-way) Average Amino-acid Identity (AAI; Konstantinidis & Tiedje 2005b),

  ▹   Proteome Coverage (ProCov; Kim et al. 2021),

  ▹   Reciprocal AAI (rAAI; Nicholson et al. 2020).

The key aim of OGRI is to provide a wide range of genome proximity metrics in an accurate way, i.e. implemented following the specific descriptions given by each associated article (see Methods). Consequently, OGRI is not expected to run very fast (e.g. OGRI_B requires up to one minute to deal with two 5 Mbp-long genomes on 12 threads), even though faster running times are expected with a larger number of threads.

Every OGRI tool runs on UNIX, Linux and most OS X operating systems.

Dependencies

You will need to install the required programs listed in the following table, or to verify that they are already installed with the required version.

OGRI tool program package version sources
OGRI_B gawk - > 4.0.0 ftp.gnu.org/gnu/gawk
OGRI_B prodigal - ≥ 2.6.3 github.com/hyattpd/Prodigal
OGRI_B makeblastdb
blastn
blastp
tblastn
blast+ ≥ 2.12.0 ftp.ncbi.nlm.nih.gov/blast/executables/blast+

Installation and execution

Clone this repository with the following command line:

git clone https://gitlab.pasteur.fr/GIPhy/OGRI.git

Go to the directory OGRI/ to give the execute permission to the file:

cd OGRI/
chmod +x OGRI_B.sh

and run it with the following command line model:

./OGRI_B.sh [options]

If at least one of the indicated programs (see Dependencies) is not available on your $PATH variable (or if one compiled binary has a different default name), the OGRI tools will either exit with an error message (when the requisite programs are missing). To set a required program that is not available on your $PATH variable, edit the file and indicate the local path to the corresponding binary(ies) within the code block REQUIREMENTS.

Usage

Run OGRI_B without option to read the following documentation:

 USAGE:  OGRI.sh  [OPTIONS]  <fasta1>  <fasta2>  [<fasta3> ...]

 OPTIONS:
  -x          only OGRIs based on genome fragments (ANI, oANI)
  -y          only OGRIs based on CDS (POCP, gANI, AF, AAI, ProCov, rAAI)
  -z          only OGRIs based on reciprocal searches (oANI, gANI, AF, ProCov, rAAI)
  -b <int>    number of bootstrap replicates for confidence intervals (default: 200)
  -r <int>    tab-delimited raw output (default: detailed output)
  -t <int>    number of threads (default: 2)
  -h          prints this help and exits

Notes

field name field number
(default)
field number
(option -x)
field number
(option -y)
field number
(option -z)
GENO1 1 1 1 1
GENO2 2 2 2 2
lgt1 3 3 3 3
lgt2 4 4 4 4
nFRA1 5 5
nFRA2 6 6
nFRA12 7 7
nFRA21 8 8
cDNA12 9 9
cDNA21 10 10
ANI12 [CI_ANI12] 11 11
ANI21 [CI_ANI21] 12 12
ANI [CI_ANI] 13 13
nfRBH 14 14 5
oANI [CI_oANI] 15 15 6
nCDS1 16 5 7
nCDS2 17 6 8
nCDS12 18 7
nCDS21 19 8
POCP 20 9
cCDS12 21 10
cCDS21 22 11
cANI12 [CI_cANI12] 23 12
cANI21 [CI_cANI21] 24 13
cANI [CI_cANI] 25 14
ngRBH 26 15 9
gANI12 [CI_gANI12] 27 16 10
gANI21 [CI_gANI21] 28 17 11
gANI [CI_gANI] 29 18 12
AF12 [CI_AF12] 30 19 13
AF21 [CI_AF21] 31 20 14
AF [AF_CI] 32 21 15
mCDS12 33 22
mCDS21 34 23
AAI12 [CI_AAI12] 35 24
AAI21 [CI_AAI21] 36 25
AAI [CI_AAI] 37 26
naRBH 38 27 16
ProCov 39 28 17
rAAI [CI_rAAI] 40 29 18

Methods

Each input genome nucleotide sequences GENOi is decomposed into three sets:

Given two genomes 1 and 2, different local alignments (best hit of each sequence from a set against the sequences from another set) are obtained using different flavors of BLAST (Altschul et al. 1990; Camacho et al. 2008):

These various local alignments are next specifically filtered, and the resulting sets of local similarities are used to derive different pairwise similarity measures:

The different estimated OGRIs can be used to assess taxonomic rank delineation.

Example

In order to illustrate the usefulness of OGRI, the following use case example describes its usage for estimating pairwise similarity measures between 13 Enterobacteriaceae chromosomes, as published by Konstantinidis and Tiedje (2005a), as well as Goris et al. (2007).

Downloading genome sequences

Download the 13 chromosome sequence files using the following Bash command lines:

URL="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA";
wget -q -O - $URL/000/008/865/GCA_000008865.2_ASM886v2/GCA_000008865.2_ASM886v2_genomic.fna.gz                         | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 01.Escherichia.coli.O157.H7.Sakai.fasta ;
wget -q -O - $URL/000/006/665/GCA_000006665.1_ASM666v1/GCA_000006665.1_ASM666v1_genomic.fna.gz                         | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 02.Escherichia.coli.O157.H7.EDL933.fasta ;
wget -q -O - $URL/000/273/425/GCA_000273425.1_Esch_coli_MG12655_V1/GCA_000273425.1_Esch_coli_MG12655_V1_genomic.fna.gz | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 03.Escherichia.coli.K-12.MG1655.fasta ;
wget -q -O - $URL/000/007/445/GCA_000007445.1_ASM744v1/GCA_000007445.1_ASM744v1_genomic.fna.gz                         | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 04.Escherichia.coli.CFT073.fasta ;
wget -q -O - $URL/000/007/405/GCA_000007405.1_ASM740v1/GCA_000007405.1_ASM740v1_genomic.fna.gz                         | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 05.Shigella.flexneri.2a.2457T.fasta ;
wget -q -O - $URL/000/006/925/GCA_000006925.2_ASM692v2/GCA_000006925.2_ASM692v2_genomic.fna.gz                         | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 06.Shigella.flexneri.2a.301.fasta ;
wget -q -O - $URL/000/006/945/GCA_000006945.2_ASM694v2/GCA_000006945.2_ASM694v2_genomic.fna.gz                         | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 07.Salmonella.enterica.Typhimurium.LT2.fasta ;
wget -q -O - $URL/000/007/545/GCA_000007545.1_ASM754v1/GCA_000007545.1_ASM754v1_genomic.fna.gz                         | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 08.Salmonella.enterica.Typhi.Ty2.fasta ;
wget -q -O - $URL/001/302/605/GCA_001302605.1_ASM130260v1/GCA_001302605.1_ASM130260v1_genomic.fna.gz                   | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 09.Salmonella.enterica.Typhi.PM016.13.fasta ;
wget -q -O - $URL/000/970/105/GCA_000970105.1_ASM97010v1/GCA_000970105.1_ASM97010v1_genomic.fna.gz                     | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 11.Yersinia.pestis.KIM5.fasta ;
wget -q -O - $URL/000/009/065/GCA_000009065.1_ASM906v1/GCA_000009065.1_ASM906v1_genomic.fna.gz                         | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 12.Yersinia.pestis.CO92.fasta ;
wget -q -O - $URL/000/009/345/GCA_000009345.1_ASM934v1/GCA_000009345.1_ASM934v1_genomic.fna.gz                         | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 14.Yersinia.enterocolitica.8081.fasta ;
wget -q -O - $URL/000/294/535/GCA_000294535.1_ASM29453v1/GCA_000294535.1_ASM29453v1_genomic.fna.gz                     | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 15.Erwinia.carotovora.PCC21.fasta ;

Note that each file is numbered according to the Enterics part of the Table S1 in Konstantinidis and Tiedje (2005a).

Running OGRI to compare two genomes

Run the following command line to compare the two first Escherichia coli genomes (using 12 threads):

OGRI_B.sh  -t 12  01.Escherichia.coli.O157.H7.Sakai.fasta  02.Escherichia.coli.O157.H7.EDL933.fasta

After ~1 minute of calculations, OGRI displays the following results:

[1/2]    [0%]----------+----------+----------+----------+----------[100%]
[2/2]    [0%]----------+----------+----------+----------+----------[100%]

 Genome files
   GENO1             01.Escherichia.coli.O157.H7.Sakai.fasta
   GENO2             02.Escherichia.coli.O157.H7.EDL933.fasta

 Average Nucleotide Identity (Goris et al. 2007)
   nFRA1  (nFRA12)   5390 (5329)
   nFRA2  (nFRA21)   4766 (4749)
   cDNA12 (lgt1)     98.81 (5498578)
   cDNA21 (lgt2)     87.64 (5521804)
   ANI12  [95%CI]    99.90 [99.87-99.92]
   ANI21  [95%CI]    99.98 [99.97-99.99]
   ANI    [95%CI]    99.94 [99.92-99.95]

 OrthoANI (Lee et al. 2016)
   nfRBH             4446
   oANI   [95%CI]    99.97 [99.96-99.98]

 Percentage Of Conserved Proteins (Qin et al. 2014)
   nCDS1  (nCDS12)   5286 (4445)
   nCDS2  (nCDS21)   4246 (4231)
   POCP              91.01

 CDS-based ANI (Konstantinidis & Tiedje 2005)
   nCDS1  (cCDS12)   5286 (4319)
   nCDS2  (cCDS21)   4246 (4226)
   cANI12 [95%CI]    99.54 [99.46-99.63]
   cANI21 [95%CI]    99.97 [99.96-99.99]
   cANI   [95%CI]    99.75 [99.71-99.81]

 Whole-genome based ANI & Alignment Fraction (Varghese et al. 2015)
   ngRBH             4027
   gANI12 [95%CI]    99.62 [99.37-99.85]
   gANI21 [95%CI]    99.89 [99.83-99.95]
   gANI   [95%CI]    99.75 [99.60-99.90]
   AF12   [95%CI]    0.697 [0.684-0.711]
   AF21   [95%CI]    0.960 [0.942-0.979]
   AF     [95%CI]    0.828 [0.813-0.845]

 Average Amino-acid Identity (one-way; Konstantinidis & Tiedje 2005)
   nCDS1  (mCDS12)   5286 (5214)
   nCDS2  (mCDS21)   4246 (4230)
   AAI12  [95%CI]    99.74 [99.66-99.80]
   AAI21  [95%CI]    99.93 [99.88-99.97]
   AAI    [95%CI]    99.83 [99.77-99.88]

 Proteome Coverage (Kim et al. 2021) & rAAI (Nicholson et al. 2020)
   naRBH             4009
   ProCov            0.841
   rAAI   [95%CI]    99.97 [99.93-99.99]

It can be observed that the ANI of E. coli O157:H7 Sakai (genome 1) against E. coli O157:H7 EDL933 (genome 2) is ANI12 = 99.90%, with a percentage of conserved DNA of cDNA12 = 98.81%. This values can be compared to the ones reported by Goris et al. (2007; Table 2): 99.68% and 99.6%, respectively. One also observes ANI21 = 99.98% and cDNA21 = 87.64%, whereas Goris et al. (2007) reported 99.63% and 97.3%, respectively. Such differences can be explained by the way each genome sequence is decomposed into consecutive fragments.

The CDS-based ANI of E. coli O157:H7 EDL933 (genome 2) against E. coli O157:H7 Sakai (genome 1) is cANI21 = 99.97%, with 4226/4246=99.52% conserved genes. These values can be compared to the one reported by Konstantinidis and Tiedje (2005a; Table S1): 99.7% and 98.6%, respectively. Such differences can be explained by the different numbers of predicted CDS (i.e. nCDS1 = 5286 and nCDS2 = 4246), whereas Konstantinidis and Tiedje (2005a; Table S1) reported 5361 and 5324 CDS, respectively.

Running OGRI to compare one genome against several ones

To obtain more results against E. coli O157:H7 Sakai (genome 1), OGRI can be run on the whole set of downloaded genomes to display all metrics in tab-delimited format (option -r), e.g. 

OGRI_B.sh  -t 48  -r  *.fasta  2>/dev/null

This command line leads to the following output:

#GENO1                                   GENO2                                        lgt1    lgt2     nFRA1 nFRA2  nFRA12 nFRA21  cDNA12 cDNA21  ANI12 [CI_ANI12]     ANI21 [CI_ANI21]     ANI [CI_ANI]         nfRBH  oANI [CI_oANI]       nCDS1 nCDS2  nCDS12 nCDS21  POCP   cCDS12 cCDS21  cANI12 [CI_cANI12]   cANI21 [CI_cANI21]   cANI [CI_cANI]       ngRBH  gANI12 [CI_gANI12]   gANI21 [CI_gANI21]   gANI [CI_gANI]       AF12 [CI_AF12]       AF21 [CI_AF21]       AF [AF_CI]           mCDS12 mCDS21  AAI12 [CI_AAI12]     AAI21 [CI_AAI21]     AAI [CI_AAI]         naRBH  ProCov  rAAI [CI_rAAI]
01.Escherichia.coli.O157.H7.Sakai.fasta  02.Escherichia.coli.O157.H7.EDL933.fasta     5498578 5521804  5390  4766   5329   4749    98.81  87.64   99.90 [99.87-99.92]  99.98 [99.97-99.99]  99.94 [99.92-99.95]  4446   99.97 [99.96-99.98]  5286  4246   4445   4231    91.01  4319   4226    99.54 [99.46-99.63]  99.97 [99.96-99.99]  99.75 [99.71-99.81]  4027   99.62 [99.37-99.85]  99.89 [99.83-99.95]  99.75 [99.60-99.90]  0.697 [0.684-0.711]  0.960 [0.942-0.979]  0.828 [0.813-0.845]  5214   4230    99.74 [99.66-99.80]  99.93 [99.88-99.97]  99.83 [99.77-99.88]  4009   0.841   99.97 [99.93-99.99]
01.Escherichia.coli.O157.H7.Sakai.fasta  03.Escherichia.coli.K-12.MG1655.fasta        5498578 4638970  5390  4548   4021   3988    74.55  88.16   97.81 [97.68-97.92]  98.02 [97.92-98.11]  97.91 [97.80-98.01]  4013   98.05 [97.95-98.13]  5286  4295   4019   3923    82.89  3864   3807    97.88 [97.74-98.00]  98.03 [97.91-98.12]  97.95 [97.82-98.06]  3765   96.94 [96.56-97.30]  97.10 [96.80-97.37]  97.02 [96.68-97.33]  0.759 [0.744-0.774]  0.898 [0.882-0.916]  0.828 [0.813-0.845]  4080   3968    96.19 [95.76-96.54]  97.25 [96.96-97.54]  96.72 [96.36-97.04]  3760   0.784   98.61 [98.42-98.72]
01.Escherichia.coli.O157.H7.Sakai.fasta  04.Escherichia.coli.CFT073.fasta             5498578 5231148  5390  5070   3946   3839    72.85  74.42   96.53 [96.39-96.68]  96.65 [96.50-96.77]  96.59 [96.44-96.72]  3877   96.77 [96.65-96.89]  5286  4799   4113   3976    80.20  3867   3765    96.72 [96.61-96.85]  96.78 [96.65-96.92]  96.75 [96.63-96.88]  3621   95.57 [95.11-96.01]  96.00 [95.71-96.28]  95.78 [95.41-96.14]  0.720 [0.705-0.736]  0.774 [0.758-0.790]  0.747 [0.731-0.763]  4255   4053    94.68 [94.26-95.05]  94.75 [94.26-95.12]  94.71 [94.26-95.08]  3644   0.722   97.67 [97.40-97.91]
01.Escherichia.coli.O157.H7.Sakai.fasta  05.Shigella.flexneri.2a.2457T.fasta          5498578 4599326  5390  4507   3753   3764    68.97  84.82   97.12 [96.93-97.28]  97.75 [97.64-97.85]  97.43 [97.28-97.56]  3710   97.85 [97.75-97.94]  5286  4702   3922   4185    81.16  3553   3973    97.33 [97.15-97.48]  97.60 [97.46-97.74]  97.46 [97.30-97.61]  3552   93.01 [92.26-93.73]  97.00 [96.74-97.21]  95.00 [94.50-95.47]  0.711 [0.695-0.722]  0.826 [0.809-0.841]  0.768 [0.752-0.781]  3964   4252    94.19 [93.79-94.67]  96.94 [96.65-97.20]  95.56 [95.22-95.93]  3486   0.698   98.40 [98.23-98.55]
01.Escherichia.coli.O157.H7.Sakai.fasta  06.Shigella.flexneri.2a.301.fasta            5498578 4607196  5390  4516   3735   3757    68.77  84.71   97.14 [96.98-97.31]  97.71 [97.57-97.82]  97.42 [97.27-97.56]  3715   97.83 [97.74-97.91]  5286  4715   3907   4207    81.13  3523   4000    97.32 [97.15-97.45]  97.58 [97.43-97.70]  97.45 [97.29-97.57]  3549   92.89 [92.28-93.50]  96.95 [96.75-97.18]  94.92 [94.51-95.34]  0.710 [0.695-0.727]  0.822 [0.805-0.840]  0.766 [0.750-0.783]  3951   4277    94.15 [93.73-94.59]  97.01 [96.75-97.24]  95.58 [95.24-95.91]  3485   0.696   98.39 [98.23-98.51]
01.Escherichia.coli.O157.H7.Sakai.fasta  07.Salmonella.enterica.Typhimurium.LT2.fasta 5498578 4857450  5390  4762   2961   2924    3.10   3.56    80.19 [79.96-80.42]  80.30 [80.08-80.55]  80.24 [80.02-80.48]  3102   80.68 [80.45-80.91]  5286  4504   3493   3361    70.01  3109   3030    80.81 [80.58-81.01]  80.98 [80.78-81.22]  80.89 [80.68-81.11]  2922   80.54 [80.29-80.81]  80.20 [79.93-80.51]  80.37 [80.11-80.66]  0.591 [0.579-0.605]  0.674 [0.661-0.691]  0.632 [0.620-0.648]  3653   3497    82.45 [81.87-82.98]  83.49 [82.98-84.05]  82.97 [82.42-83.51]  3162   0.645   87.08 [86.67-87.44]
01.Escherichia.coli.O157.H7.Sakai.fasta  08.Salmonella.enterica.Typhi.Ty2.fasta       5498578 4791950  5390  4695   2841   2864    3.02   3.44    80.42 [80.21-80.68]  80.32 [80.10-80.57]  80.37 [80.15-80.62]  3002   80.85 [80.62-81.10]  5286  4614   3326   3333    67.26  2941   3008    81.13 [80.91-81.37]  81.06 [80.86-81.28]  81.09 [80.88-81.32]  2872   79.75 [79.41-80.11]  80.29 [80.01-80.51]  80.02 [79.71-80.31]  0.581 [0.570-0.596]  0.669 [0.655-0.686]  0.625 [0.612-0.641]  3488   3462    82.79 [82.22-83.38]  83.73 [83.27-84.25]  83.26 [82.74-83.81]  3079   0.622   87.28 [86.84-87.72]
01.Escherichia.coli.O157.H7.Sakai.fasta  09.Salmonella.enterica.Typhi.PM016.13.fasta  5498578 4793553  5390  4699   2834   2816    3.03   3.71    80.45 [80.19-80.63]  80.40 [80.17-80.64]  80.42 [80.18-80.63]  3008   80.84 [80.60-81.04]  5286  4625   3314   3322    66.95  2932   2999    81.16 [80.97-81.34]  81.10 [80.87-81.31]  81.13 [80.92-81.32]  2867   79.79 [79.48-80.11]  80.33 [80.04-80.59]  80.06 [79.76-80.35]  0.581 [0.565-0.592]  0.667 [0.651-0.681]  0.624 [0.608-0.636]  3478   3452    82.79 [82.15-83.41]  83.74 [83.13-84.20]  83.26 [82.64-83.80]  3070   0.619   87.33 [86.89-87.73]
01.Escherichia.coli.O157.H7.Sakai.fasta  11.Yersinia.pestis.KIM5.fasta                5498578 4605437  5390  4515   1523   1507    0.59   0.75    71.92 [71.59-72.25]  71.91 [71.52-72.25]  71.91 [71.55-72.25]  2003   72.33 [72.07-72.60]  5286  4040   2542   2505    54.11  1721   1694    72.97 [72.65-73.24]  73.00 [72.70-73.24]  72.98 [72.67-73.24]  1328   71.78 [71.20-72.29]  71.85 [71.34-72.31]  71.81 [71.27-72.30]  0.284 [0.275-0.294]  0.355 [0.343-0.367]  0.319 [0.309-0.330]  2790   2630    68.72 [68.11-69.31]  70.01 [69.34-70.62]  69.36 [68.72-69.96]  2309   0.495   73.68 [73.17-74.12]
01.Escherichia.coli.O157.H7.Sakai.fasta  12.Yersinia.pestis.CO92.fasta                5498578 4653728  5390  4562   1528   1507    0.59   0.59    71.92 [71.58-72.25]  72.04 [71.66-72.38]  71.98 [71.62-72.31]  1979   72.25 [71.99-72.52]  5286  4090   2555   2519    54.11  1726   1697    72.95 [72.70-73.23]  73.00 [72.76-73.24]  72.97 [72.73-73.23]  1333   71.68 [71.20-72.15]  71.83 [71.31-72.26]  71.75 [71.25-72.20]  0.285 [0.276-0.294]  0.352 [0.341-0.363]  0.318 [0.308-0.328]  2802   2645    68.73 [67.98-69.32]  69.93 [69.31-70.50]  69.33 [68.64-69.91]  2318   0.494   73.64 [73.09-74.15]
01.Escherichia.coli.O157.H7.Sakai.fasta  14.Yersinia.enterocolitica.8081.fasta        5498578 4615899  5390  4525   1688   1654    0.60   0.73    71.80 [71.46-72.13]  71.79 [71.47-72.07]  71.79 [71.46-72.10]  2136   72.05 [71.74-72.30]  5286  4159   2738   2648    57.02  1870   1803    72.96 [72.72-73.19]  73.11 [72.84-73.36]  73.03 [72.78-73.27]  1425   72.21 [71.88-72.61]  71.87 [71.49-72.31]  72.04 [71.68-72.46]  0.306 [0.297-0.315]  0.377 [0.367-0.389]  0.341 [0.332-0.352]  2992   2811    68.86 [68.22-69.38]  70.10 [69.58-70.64]  69.48 [68.90-70.01]  2495   0.528   73.53 [73.04-74.02]
01.Escherichia.coli.O157.H7.Sakai.fasta  15.Erwinia.carotovora.PCC21.fasta            5498578 4842771  5390  4747   1652   1641    0.73   0.91    73.16 [72.81-73.51]  72.93 [72.55-73.30]  73.04 [72.68-73.40]  2068   73.31 [73.02-73.57]  5286  4258   2559   2511    53.12  1808   1751    73.91 [73.66-74.14]  74.11 [73.84-74.39]  74.01 [73.75-74.26]  1462   72.99 [72.48-73.39]  72.80 [72.26-73.17]  72.89 [72.37-73.28]  0.315 [0.305-0.324]  0.363 [0.352-0.374]  0.339 [0.328-0.349]  2795   2707    69.04 [68.42-69.53]  69.31 [68.63-69.84]  69.17 [68.52-69.68]  2326   0.487   73.70 [73.16-74.16]
Restricting fields in tab-delimited output

The tab-separated format can be useful to restrict the output to some specific fields, e.g. cANI values:

OGRI_B.sh  -t 48  -r  *.fasta  2>/dev/null  |  cut -f2,23-25

The above command line leads to the following simplified output:

GENO2                                        cANI12 [CI_cANI12]   cANI21 [CI_cANI21]   cANI [CI_cANI]
02.Escherichia.coli.O157.H7.EDL933.fasta     99.54 [99.46-99.63]  99.97 [99.96-99.99]  99.75 [99.71-99.81]
03.Escherichia.coli.K-12.MG1655.fasta        97.88 [97.74-98.00]  98.03 [97.91-98.12]  97.95 [97.82-98.06]
04.Escherichia.coli.CFT073.fasta             96.72 [96.61-96.85]  96.78 [96.65-96.92]  96.75 [96.63-96.88]
05.Shigella.flexneri.2a.2457T.fasta          97.33 [97.15-97.48]  97.60 [97.46-97.74]  97.46 [97.30-97.61]
06.Shigella.flexneri.2a.301.fasta            97.32 [97.15-97.45]  97.58 [97.43-97.70]  97.45 [97.29-97.57]
07.Salmonella.enterica.Typhimurium.LT2.fasta 80.81 [80.58-81.01]  80.98 [80.78-81.22]  80.89 [80.68-81.11]
08.Salmonella.enterica.Typhi.Ty2.fasta       81.13 [80.91-81.37]  81.06 [80.86-81.28]  81.09 [80.88-81.32]
09.Salmonella.enterica.Typhi.PM016.13.fasta  81.16 [80.97-81.34]  81.10 [80.87-81.31]  81.13 [80.92-81.32]
11.Yersinia.pestis.KIM5.fasta                72.97 [72.65-73.24]  73.00 [72.70-73.24]  72.98 [72.67-73.24]
12.Yersinia.pestis.CO92.fasta                72.95 [72.70-73.23]  73.00 [72.76-73.24]  72.97 [72.73-73.23]
14.Yersinia.enterocolitica.8081.fasta        72.96 [72.72-73.19]  73.11 [72.84-73.36]  73.03 [72.78-73.27]
15.Erwinia.carotovora.PCC21.fasta            73.91 [73.66-74.14]  74.11 [73.84-74.39]  74.01 [73.75-74.26]

Each row can be compared to the values reported by Konstantinidis and Tiedje (2005a; Table S1, Enterics section): 99.7%, 97.2%, 95.9%, 96.5%, 96.4%, 79.9%, 80.2%, 80.2%, 71.5%, 71.5%, 82.1%, 72.1%, respectively. Therefore, it is likely that the penultimate reported value (i.e. 82.1%) is a typo in Table S1.

Restricting computations

Restricting the computations to the fragment-based pairwise measures (option -x) can be useful to significantly reduce the overall running times, e.g.

OGRI_B.sh  -t 6  -r  -x  0[1-5].*.fasta  2>/dev/null  |  cut -f2,11-13

The above command line leads to the following simplified output (i.e. restricted to ANI values):

GENO2                                        ANI12 [CI_ANI12]     ANI21 [CI_ANI21]     ANI [CI_ANI]
02.Escherichia.coli.O157.H7.EDL933.fasta     99.90 [99.87-99.92]  99.98 [99.97-99.99]  99.94 [99.92-99.95]
03.Escherichia.coli.K-12.MG1655.fasta        97.81 [97.68-97.92]  98.02 [97.92-98.11]  97.91 [97.80-98.01]
04.Escherichia.coli.CFT073.fasta             96.53 [96.39-96.68]  96.65 [96.50-96.77]  96.59 [96.44-96.72]
05.Shigella.flexneri.2a.2457T.fasta          97.12 [96.93-97.28]  97.75 [97.64-97.85]  97.43 [97.28-97.56]
06.Shigella.flexneri.2a.301.fasta            97.14 [96.98-97.31]  97.71 [97.57-97.82]  97.42 [97.27-97.56]
07.Salmonella.enterica.Typhimurium.LT2.fasta 80.19 [79.96-80.42]  80.30 [80.08-80.55]  80.24 [80.02-80.48]
08.Salmonella.enterica.Typhi.Ty2.fasta       80.42 [80.21-80.68]  80.32 [80.10-80.57]  80.37 [80.15-80.62]
09.Salmonella.enterica.Typhi.PM016.13.fasta  80.45 [80.19-80.63]  80.40 [80.17-80.64]  80.42 [80.18-80.63]
11.Yersinia.pestis.KIM5.fasta                71.92 [71.59-72.25]  71.91 [71.52-72.25]  71.91 [71.55-72.25]
12.Yersinia.pestis.CO92.fasta                71.92 [71.58-72.25]  72.04 [71.66-72.38]  71.98 [71.62-72.31]
14.Yersinia.enterocolitica.8081.fasta        71.80 [71.46-72.13]  71.79 [71.47-72.07]  71.79 [71.46-72.10]
15.Erwinia.carotovora.PCC21.fasta            73.16 [72.81-73.51]  72.93 [72.55-73.30]  73.04 [72.68-73.40]

Each of the four first rows (genomes 2-5) can be compared to the values reported by Goris et al. (2007; Table 2, Escherichia/Shigella hybridization group), i.e. ANI12: 99.68%, 97.53%, 96.00%, 97.36%, respectively, and ANI21: 99.63%, 97.25%, 95.85%, 96.54%, respectively.

Practical usage

As the recommended OGRIs are oANI and rAAI (see Methods), the following command line can be useful in many cases:

OGRI_B.sh  -t 48  -r  -z  *.fasta  2>/dev/null   |  cut -f2,5-8,16-18

The above command line leads to the following output:

GENO2                                        nfRBH  oANI [CI_oANI]       nCDS1 nCDS2  naRBH  ProCov  rAAI [CI_rAAI]
02.Escherichia.coli.O157.H7.EDL933.fasta     4446   99.97 [99.96-99.98]  5286  4246   4009   0.841   99.97 [99.93-99.99]
03.Escherichia.coli.K-12.MG1655.fasta        4013   98.05 [97.95-98.13]  5286  4295   3760   0.784   98.61 [98.42-98.72]
04.Escherichia.coli.CFT073.fasta             3877   96.77 [96.65-96.89]  5286  4799   3644   0.722   97.67 [97.40-97.91]
05.Shigella.flexneri.2a.2457T.fasta          3710   97.85 [97.75-97.94]  5286  4702   3486   0.698   98.40 [98.23-98.55]
06.Shigella.flexneri.2a.301.fasta            3715   97.83 [97.74-97.91]  5286  4715   3485   0.696   98.39 [98.23-98.51]
07.Salmonella.enterica.Typhimurium.LT2.fasta 3102   80.68 [80.45-80.91]  5286  4504   3162   0.645   87.08 [86.67-87.44]
08.Salmonella.enterica.Typhi.Ty2.fasta       3002   80.85 [80.62-81.10]  5286  4614   3079   0.622   87.28 [86.84-87.72]
09.Salmonella.enterica.Typhi.PM016.13.fasta  3008   80.84 [80.60-81.04]  5286  4625   3070   0.619   87.33 [86.89-87.73]
11.Yersinia.pestis.KIM5.fasta                2003   72.33 [72.07-72.60]  5286  4040   2309   0.495   73.68 [73.17-74.12]
12.Yersinia.pestis.CO92.fasta                1979   72.25 [71.99-72.52]  5286  4090   2318   0.494   73.64 [73.09-74.15]
14.Yersinia.enterocolitica.8081.fasta        2136   72.05 [71.74-72.30]  5286  4159   2495   0.528   73.53 [73.04-74.02]
15.Erwinia.carotovora.PCC21.fasta            2068   73.31 [73.02-73.57]  5286  4258   2326   0.487   73.70 [73.16-74.16]

These results suggest that the genomes 2-6 belong to the same species as E. coli O157:H7 Sakai (genome 1), contrary to the genomes 7-15 (i.e. oANI < 95% and rAAI < 95%). The rAAI values for the two Yersinia pestis genomes (11-12) can be compared to (two-way) AAI = 72% observed by Konstantinidis and Tiedje (2005b) between pairs of E. coli and Y. pestis genomes.

References

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-410. doi:10.1016/S0022-2836(05)80360-2

Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389-3402. doi:10.1093/nar/25.17.3389

Barco RA, Garrity GM, Scott JJ, Amend JP, Nealson KH, Emerson D (2020) A Genus Definition for Bacteria and Archaea Based on a Standard Genome Relatedness Index. mBio, 11:e02475-19. doi:10.1128/mBio.02475-19

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2008) BLAST+: architecture and applications. BMC Bioinformatics, 10:421. doi:10.1186/1471-2105-10-421

Chun J, Rainey FA (2014) Integrating genomics into the taxonomy and systematics of the Bacteria and Archaea. International Journal of Systematic and Evolutionary Biology, 64(Pt_2):316-324. doi:10.1099/ijs.0.054171-0

Gertz EM, Yu Y-K, Agarwala R, Schäffer AA, Altschul SF (2006) Composition-based statistics and translated nucleotide searches: Improving the TBLASTN module of BLAST. BMC Biology, 4:41. doi:10.1186/1741-7007-4-41

Goris J, Konstantinidis KT, Klappenbach JA, Coenye T, Vandamme P, Tiedje JM (2007) DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. International Journal of Systematic and Evolutionary Biology, 57(1):81-91. doi:10.1099/ijs.0.64483-0

Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics, 11:119. doi:10.1186%2F1471-2105-11-119

Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S (2018) High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nature Communications, 9:5114. doi:10.1038/s41467-018-07641-9

Kim D, Park S, Chun J (2021) Introducing EzAAI: a pipeline for high throughput calculations of prokaryotic average amino acid identity. Journal of Microbiology, 59(5):476-480. doi:10.1007/s12275-021-1154-0

Konstantinidis KT, Tiedje JM (2005a) Genomic insights that advance the species definition for prokaryotes. Proceedings of the National Academy of Sciences of the United States of America, 102(7):2567-2572. doi:/10.1073/pnas.0409727102

Konstantinidis KT, Tiedje JM (2005b) Towards a Genome-Based Taxonomy for Prokaryotes. Journal of Bacteriology, 187(18):6258-6264. doi:10.1128/JB.187.18.6258-6264.2005

Konstantinidis KT, Rossello-Mora R, Amann R (2017) Uncultivated microbes in need of their own taxonomy. The ISME Journal, 11:2399-2406. doi:10.1038/ismej.2017.113

Lee I, Kim YO, Park S-C, Chun J (2016) OrthoANI: An improved algorithm and software for calculating average nucleotide identity. International Journal of Systematic and Evolutionary Biology, 66(2):1100-1103. doi:10.1099/ijsem.0.000760

Luo C, Rodriguez-R LM, Konstantinidis KT (2014) MyTaxa: an advanced taxonomic classifier for genomic and metagenomic sequences. Nucleic Acids Research, 42(8):e73. doi:10.1093/nar/gku169

Nicholson AC, Gulvik CA, Whitney AM, Humrighouse BW, Bell ME, Holmes B, Steigerwalt AG, Villarma A, Sheth M, Batra D, Rowe LA, Burroughs M, Pryor JC, Bernardet J-F, Hugo C, Kämpfer P, Newman JD, McQuiston JR (2020) Division of the genus Chryseobacterium: Observation of discontinuities in amino acid identity values, a possible consequence of major extinction events, guides transfer of nine species to the genus Epilithonimonas, eleven species to the genus Kaistella, and three species to the genus Halpernia gen. nov., with description of Kaistella daneshvariae sp. nov. and Epilithonimonas vandammei sp. nov. derived from clinical specimens. International Journal of Systematic and Evolutionary Biology, 70:4432-4450. doi:10.1099/ijsem.0.003935

Novichkov V, Kaznadzey A, Alexandrova N, Kaznadzey D (2016) NSimScan: DNA comparison tool with increased speed, sensitivity and accuracy. Bioinformatics, 32(15):2380-2381. doi:10.1093/bioinformatics/btw126

Palmer M, Steenkamp ET, Blom J, Hedlund BP, Venter SN (2020) All ANIs are not created equal: implications for prokaryotic species boundaries and integration of ANIs into polyphasic taxonomy. International Journal of Systematic and Evolutionary Biology, 70(4):2937-2948. doi:10.1099/ijsem.0.004124

Qin Q-L, Xie B-B, Zhang X-Y, Chen X-L, Zhou B-C, Zhou J, Oren A, Zhang Y-Z (2014) A Proposed Genus Boundary for the Prokaryotes Based on Genomic Insights. Journal of Bacteriology, 196(12):2210-2215. doi:10.1128/JB.01688-14

Richter M, Rossello-Mora R (2009) Shifting the genomic gold standard for the prokaryotic species definition. Proceedings of the National Academy of Sciences of the United States of America, 106(45):19126-19131. doi:10.1073/pnas.0906412106

Rodriguez-R LM, Konstantinidis KT (2014) Bypassing cultivation to identify bacterial species. Microbe, 9(3):111-118. [pdf]

Suresh G, Lodha TD, Indu B, Sasikala C, Ramana CV (2019) Taxogenomics Resolves Conflict in the Genus Rhodobacter: A Two and Half Decades Pending Thought to Reclassify the Genus Rhodobacter. Frontiers in Microbiology, 10:2480. doi:10.3389/fmicb.2019.02480

Varghese NJ, Mukherjee S, Ivanova N, Konstantinidis KT, Mavrommatis K, Kyrpides NC, Pati A (2015) Microbial species delineation using whole genome sequences. Nucleic Acids Research, 43(14):6761-6771. doi:10.1093/nar/gkv657

Yoon S-H, Ha S-M, Lim J, Kwon S, Chun J (2017) A large-scale evaluation of algorithms to calculate average nucleotide identity. Antonie van Leeuwenhoek, 110(10):1281-1286. doi:10.1007/s10482-017-0844-4

Zhang Z, Schwartz S, Wagner L, Miller W (2000) A greedy algorithm for aligning DNA sequences. Journal of Computational Biology, 7(1-2):203-214. doi:10.1089/10665270050081478