OGRI (Overall Genome Relatedness Indices; Chun & Rainey 2014) is a command line programs written in Bash to compute pairwise similarity measures between whole genome sequences. Every computed similarity is based on local sequence alignments:

▹ CDS-based ANI (cANI; Konstantinidis & Tiedje 2005a; gANI; Varghese et al. 2015),

▹ (one-way) Average Amino-acid Identity (AAI; Konstantinidis & Tiedje 2005b),

The key aim of OGRI is to provide a wide range of genome proximity metrics in an accurate way, i.e. implemented following the specific descriptions given by each associated article (see Methods). Consequently, OGRI is not expected to run very fast (e.g. OGRI_B requires up to one minute to deal with two 5 Mbp-long genomes on 12 threads), even though faster running times are expected with a larger number of threads.

Dependencies

You will need to install the required programs listed in the following table, or to verify that they are already installed with the required version.

OGRI tool	program	package	version	sources
`OGRI_B`	gawk	-	> 4.0.0	ftp.gnu.org/gnu/gawk
`OGRI_B`	prodigal	-	≥ 2.6.3	github.com/hyattpd/Prodigal
`OGRI_B`	makeblastdb blastn blastp tblastn	blast+	≥ 2.12.0	ftp.ncbi.nlm.nih.gov/blast/executables/blast+

Installation and execution

git clone https://gitlab.pasteur.fr/GIPhy/OGRI.git

cd OGRI/
chmod +x OGRI_B.sh

./OGRI_B.sh [options]

If at least one of the indicated programs (see Dependencies) is not available on your $PATH variable (or if one compiled binary has a different default name), the OGRI tools will either exit with an error message (when the requisite programs are missing). To set a required program that is not available on your $PATH variable, edit the file and indicate the local path to the corresponding binary(ies) within the code block REQUIREMENTS.

Usage

Notes

field name	field number (default)	field number (option `-x`)	field number (option `-y`)	field number (option `-z`)
GENO1	1	1	1	1
GENO2	2	2	2	2
lgt1	3	3	3	3
lgt2	4	4	4	4
nFRA1	5	5
nFRA2	6	6
nFRA12	7	7
nFRA21	8	8
cDNA12	9	9
cDNA21	10	10
ANI12 [CI_ANI12]	11	11
ANI21 [CI_ANI21]	12	12
ANI [CI_ANI]	13	13
nfRBH	14	14		5
oANI [CI_oANI]	15	15		6
nCDS1	16		5	7
nCDS2	17		6	8
nCDS12	18		7
nCDS21	19		8
POCP	20		9
cCDS12	21		10
cCDS21	22		11
cANI12 [CI_cANI12]	23		12
cANI21 [CI_cANI21]	24		13
cANI [CI_cANI]	25		14
ngRBH	26		15	9
gANI12 [CI_gANI12]	27		16	10
gANI21 [CI_gANI21]	28		17	11
gANI [CI_gANI]	29		18	12
AF12 [CI_AF12]	30		19	13
AF21 [CI_AF21]	31		20	14
AF [AF_CI]	32		21	15
mCDS12	33		22
mCDS21	34		23
AAI12 [CI_AAI12]	35		24
AAI21 [CI_AAI21]	36		25
AAI [CI_AAI]	37		26
naRBH	38		27	16
ProCov	39		28	17
rAAI [CI_rAAI]	40		29	18

Methods

Given two genomes 1 and 2, different local alignments (best hit of each sequence from a set against the sequences from another set) are obtained using different flavors of BLAST (Altschul et al. 1990; Camacho et al. 2008):

These various local alignments are next specifically filtered, and the resulting sets of local similarities are used to derive different pairwise similarity measures:

Example

In order to illustrate the usefulness of OGRI, the following use case example describes its usage for estimating pairwise similarity measures between 13 Enterobacteriaceae chromosomes, as published by Konstantinidis and Tiedje (2005a), as well as Goris et al. (2007).

Downloading genome sequences

Download the 13 chromosome sequence files using the following Bash command lines:

URL="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA";
wget -q -O - $URL/000/008/865/GCA_000008865.2_ASM886v2/GCA_000008865.2_ASM886v2_genomic.fna.gz                         | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 01.Escherichia.coli.O157.H7.Sakai.fasta ;
wget -q -O - $URL/000/006/665/GCA_000006665.1_ASM666v1/GCA_000006665.1_ASM666v1_genomic.fna.gz                         | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 02.Escherichia.coli.O157.H7.EDL933.fasta ;
wget -q -O - $URL/000/273/425/GCA_000273425.1_Esch_coli_MG12655_V1/GCA_000273425.1_Esch_coli_MG12655_V1_genomic.fna.gz | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 03.Escherichia.coli.K-12.MG1655.fasta ;
wget -q -O - $URL/000/007/445/GCA_000007445.1_ASM744v1/GCA_000007445.1_ASM744v1_genomic.fna.gz                         | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 04.Escherichia.coli.CFT073.fasta ;
wget -q -O - $URL/000/007/405/GCA_000007405.1_ASM740v1/GCA_000007405.1_ASM740v1_genomic.fna.gz                         | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 05.Shigella.flexneri.2a.2457T.fasta ;
wget -q -O - $URL/000/006/925/GCA_000006925.2_ASM692v2/GCA_000006925.2_ASM692v2_genomic.fna.gz                         | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 06.Shigella.flexneri.2a.301.fasta ;
wget -q -O - $URL/000/006/945/GCA_000006945.2_ASM694v2/GCA_000006945.2_ASM694v2_genomic.fna.gz                         | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 07.Salmonella.enterica.Typhimurium.LT2.fasta ;
wget -q -O - $URL/000/007/545/GCA_000007545.1_ASM754v1/GCA_000007545.1_ASM754v1_genomic.fna.gz                         | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 08.Salmonella.enterica.Typhi.Ty2.fasta ;
wget -q -O - $URL/001/302/605/GCA_001302605.1_ASM130260v1/GCA_001302605.1_ASM130260v1_genomic.fna.gz                   | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 09.Salmonella.enterica.Typhi.PM016.13.fasta ;
wget -q -O - $URL/000/970/105/GCA_000970105.1_ASM97010v1/GCA_000970105.1_ASM97010v1_genomic.fna.gz                     | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 11.Yersinia.pestis.KIM5.fasta ;
wget -q -O - $URL/000/009/065/GCA_000009065.1_ASM906v1/GCA_000009065.1_ASM906v1_genomic.fna.gz                         | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 12.Yersinia.pestis.CO92.fasta ;
wget -q -O - $URL/000/009/345/GCA_000009345.1_ASM934v1/GCA_000009345.1_ASM934v1_genomic.fna.gz                         | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 14.Yersinia.enterocolitica.8081.fasta ;
wget -q -O - $URL/000/294/535/GCA_000294535.1_ASM29453v1/GCA_000294535.1_ASM29453v1_genomic.fna.gz                     | gunzip -c | awk '/^>/{if(NR>1)exit}{print}' > 15.Erwinia.carotovora.PCC21.fasta ;

Note that each file is numbered according to the Enterics part of the Table S1 in Konstantinidis and Tiedje (2005a).

Running OGRI to compare two genomes

Run the following command line to compare the two first Escherichia coli genomes (using 12 threads):

OGRI_B.sh  -t 12  01.Escherichia.coli.O157.H7.Sakai.fasta  02.Escherichia.coli.O157.H7.EDL933.fasta

It can be observed that the ANI of E. coli O157:H7 Sakai (genome 1) against E. coli O157:H7 EDL933 (genome 2) is ANI12 = 99.90%, with a percentage of conserved DNA of cDNA12 = 98.81%. This values can be compared to the ones reported by Goris et al. (2007; Table 2): 99.68% and 99.6%, respectively. One also observes ANI21 = 99.98% and cDNA21 = 87.64%, whereas Goris et al. (2007) reported 99.63% and 97.3%, respectively. Such differences can be explained by the way each genome sequence is decomposed into consecutive fragments.

The CDS-based ANI of E. coli O157:H7 EDL933 (genome 2) against E. coli O157:H7 Sakai (genome 1) is cANI21 = 99.97%, with 4226/4246=99.52% conserved genes. These values can be compared to the one reported by Konstantinidis and Tiedje (2005a; Table S1): 99.7% and 98.6%, respectively. Such differences can be explained by the different numbers of predicted CDS (i.e. nCDS1 = 5286 and nCDS2 = 4246), whereas Konstantinidis and Tiedje (2005a; Table S1) reported 5361 and 5324 CDS, respectively.

Running OGRI to compare one genome against several ones

To obtain more results against E. coli O157:H7 Sakai (genome 1), OGRI can be run on the whole set of downloaded genomes to display all metrics in tab-delimited format (option -r), e.g.

OGRI_B.sh  -t 48  -r  *.fasta  2>/dev/null

Restricting fields in tab-delimited output

The tab-separated format can be useful to restrict the output to some specific fields, e.g. cANI values:

OGRI_B.sh  -t 48  -r  *.fasta  2>/dev/null  |  cut -f2,23-25

Each row can be compared to the values reported by Konstantinidis and Tiedje (2005a; Table S1, Enterics section): 99.7%, 97.2%, 95.9%, 96.5%, 96.4%, 79.9%, 80.2%, 80.2%, 71.5%, 71.5%, 82.1%, 72.1%, respectively. Therefore, it is likely that the penultimate reported value (i.e. 82.1%) is a typo in Table S1.

Restricting computations

Restricting the computations to the fragment-based pairwise measures (option -x) can be useful to significantly reduce the overall running times, e.g.

OGRI_B.sh  -t 6  -r  -x  0[1-5].*.fasta  2>/dev/null  |  cut -f2,11-13

The above command line leads to the following simplified output (i.e. restricted to ANI values):

Each of the four first rows (genomes 2-5) can be compared to the values reported by Goris et al. (2007; Table 2, Escherichia/Shigella hybridization group), i.e. ANI12: 99.68%, 97.53%, 96.00%, 97.36%, respectively, and ANI21: 99.63%, 97.25%, 95.85%, 96.54%, respectively.

Practical usage

As the recommended OGRIs are oANI and rAAI (see Methods), the following command line can be useful in many cases:

OGRI_B.sh  -t 48  -r  -z  *.fasta  2>/dev/null   |  cut -f2,5-8,16-18

These results suggest that the genomes 2-6 belong to the same species as E. coli O157:H7 Sakai (genome 1), contrary to the genomes 7-15 (i.e. oANI < 95% and rAAI < 95%). The rAAI values for the two Yersinia pestis genomes (11-12) can be compared to (two-way) AAI = 72% observed by Konstantinidis and Tiedje (2005b) between pairs of E. coli and Y. pestis genomes.

References

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. Journal of Molecular Biology, 215(3):403-410. doi:10.1016/S0022-2836(05)80360-2

Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389-3402. doi:10.1093/nar/25.17.3389

Barco RA, Garrity GM, Scott JJ, Amend JP, Nealson KH, Emerson D (2020) A Genus Definition for Bacteria and Archaea Based on a Standard Genome Relatedness Index. mBio, 11:e02475-19. doi:10.1128/mBio.02475-19

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2008) BLAST+: architecture and applications. BMC Bioinformatics, 10:421. doi:10.1186/1471-2105-10-421

Chun J, Rainey FA (2014) Integrating genomics into the taxonomy and systematics of the Bacteria and Archaea. International Journal of Systematic and Evolutionary Biology, 64(Pt_2):316-324. doi:10.1099/ijs.0.054171-0

Gertz EM, Yu Y-K, Agarwala R, Schäffer AA, Altschul SF (2006) Composition-based statistics and translated nucleotide searches: Improving the TBLASTN module of BLAST. BMC Biology, 4:41. doi:10.1186/1741-7007-4-41

Goris J, Konstantinidis KT, Klappenbach JA, Coenye T, Vandamme P, Tiedje JM (2007) DNA-DNA hybridization values and their relationship to whole-genome sequence similarities. International Journal of Systematic and Evolutionary Biology, 57(1):81-91. doi:10.1099/ijs.0.64483-0

Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics, 11:119. doi:10.1186%2F1471-2105-11-119

Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S (2018) High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nature Communications, 9:5114. doi:10.1038/s41467-018-07641-9

Kim D, Park S, Chun J (2021) Introducing EzAAI: a pipeline for high throughput calculations of prokaryotic average amino acid identity. Journal of Microbiology, 59(5):476-480. doi:10.1007/s12275-021-1154-0

Konstantinidis KT, Tiedje JM (2005a) Genomic insights that advance the species definition for prokaryotes. Proceedings of the National Academy of Sciences of the United States of America, 102(7):2567-2572. doi:/10.1073/pnas.0409727102

Konstantinidis KT, Tiedje JM (2005b) Towards a Genome-Based Taxonomy for Prokaryotes. Journal of Bacteriology, 187(18):6258-6264. doi:10.1128/JB.187.18.6258-6264.2005

Konstantinidis KT, Rossello-Mora R, Amann R (2017) Uncultivated microbes in need of their own taxonomy. The ISME Journal, 11:2399-2406. doi:10.1038/ismej.2017.113

Lee I, Kim YO, Park S-C, Chun J (2016) OrthoANI: An improved algorithm and software for calculating average nucleotide identity. International Journal of Systematic and Evolutionary Biology, 66(2):1100-1103. doi:10.1099/ijsem.0.000760

Luo C, Rodriguez-R LM, Konstantinidis KT (2014) MyTaxa: an advanced taxonomic classifier for genomic and metagenomic sequences. Nucleic Acids Research, 42(8):e73. doi:10.1093/nar/gku169

Nicholson AC, Gulvik CA, Whitney AM, Humrighouse BW, Bell ME, Holmes B, Steigerwalt AG, Villarma A, Sheth M, Batra D, Rowe LA, Burroughs M, Pryor JC, Bernardet J-F, Hugo C, Kämpfer P, Newman JD, McQuiston JR (2020) Division of the genus Chryseobacterium: Observation of discontinuities in amino acid identity values, a possible consequence of major extinction events, guides transfer of nine species to the genus Epilithonimonas, eleven species to the genus Kaistella, and three species to the genus Halpernia gen. nov., with description of Kaistella daneshvariae sp. nov. and Epilithonimonas vandammei sp. nov. derived from clinical specimens. International Journal of Systematic and Evolutionary Biology, 70:4432-4450. doi:10.1099/ijsem.0.003935

Novichkov V, Kaznadzey A, Alexandrova N, Kaznadzey D (2016) NSimScan: DNA comparison tool with increased speed, sensitivity and accuracy. Bioinformatics, 32(15):2380-2381. doi:10.1093/bioinformatics/btw126

Palmer M, Steenkamp ET, Blom J, Hedlund BP, Venter SN (2020) All ANIs are not created equal: implications for prokaryotic species boundaries and integration of ANIs into polyphasic taxonomy. International Journal of Systematic and Evolutionary Biology, 70(4):2937-2948. doi:10.1099/ijsem.0.004124

Qin Q-L, Xie B-B, Zhang X-Y, Chen X-L, Zhou B-C, Zhou J, Oren A, Zhang Y-Z (2014) A Proposed Genus Boundary for the Prokaryotes Based on Genomic Insights. Journal of Bacteriology, 196(12):2210-2215. doi:10.1128/JB.01688-14

Richter M, Rossello-Mora R (2009) Shifting the genomic gold standard for the prokaryotic species definition. Proceedings of the National Academy of Sciences of the United States of America, 106(45):19126-19131. doi:10.1073/pnas.0906412106

Rodriguez-R LM, Konstantinidis KT (2014) Bypassing cultivation to identify bacterial species. Microbe, 9(3):111-118. [pdf]

Suresh G, Lodha TD, Indu B, Sasikala C, Ramana CV (2019) Taxogenomics Resolves Conflict in the Genus Rhodobacter: A Two and Half Decades Pending Thought to Reclassify the Genus Rhodobacter. Frontiers in Microbiology, 10:2480. doi:10.3389/fmicb.2019.02480

Varghese NJ, Mukherjee S, Ivanova N, Konstantinidis KT, Mavrommatis K, Kyrpides NC, Pati A (2015) Microbial species delineation using whole genome sequences. Nucleic Acids Research, 43(14):6761-6771. doi:10.1093/nar/gkv657

Yoon S-H, Ha S-M, Lim J, Kwon S, Chun J (2017) A large-scale evaluation of algorithms to calculate average nucleotide identity. Antonie van Leeuwenhoek, 110(10):1281-1286. doi:10.1007/s10482-017-0844-4

Zhang Z, Schwartz S, Wagner L, Miller W (2000) A greedy algorithm for aligning DNA sequences. Journal of Computational Biology, 7(1-2):203-214. doi:10.1089/10665270050081478

OGRI