fq2dna (FASTQ files to de novo assembly) is a command line tool written in Bash to ease the de novo assembly of archaea, bacteria or virus genomes from raw high-throughput sequencing (HTS) paired-end (PE) reads.
Every data pre- and post-processing step is managed by fq2dna (e.g. HTS read filtering and enhancing, well-tuned de novo assemblies, scaffold sequence accuracy assessment). The main purpose of fq2dna is to efficiently use different methods, programs and tools to quickly infer accurate genome assemblies (e.g. from 5 to 15 minutes to deal with a bacteria HTS sample using 12 threads). This mini-workflow can therefore be very useful to deal with large batches of whole-genome shotgun sequencing data.
fq2dna runs on UNIX, Linux and most OS X operating systems.
You will need to install the required programs and tools listed in the following table, or to verify that they are already installed with the required version.
program | package | version | sources |
---|---|---|---|
contig_info | - | > 2.0 | gitlab.pasteur.fr/GIPhy/contig_info |
FASTA2AGP | - | ≥ 2.0 | gitlab.pasteur.fr/GIPhy/FASTA2AGP |
fqCleanER | - | 21.06 | gitlab.pasteur.fr/GIPhy/fqCleanER |
fqstats | fqtools | 1.1a | ftp.pasteur.fr/pub/gensoft/projects/fqtools |
gawk | - | > 4.0.0 | ftp.gnu.org/gnu/gawk |
minimap 2 | - | > 2.17 | github.com/lh3/minimap2 |
ntCard | - | > 1.2 | github.com/bcgsc/ntCard |
Platon | - | > 1.5 | github.com/oschwengers/platon |
Prokka | - | ≥ 1.14.5 | github.com/tseemann/prokka |
SAM2MAP | SAM2MSA | ≥ 0.3.3.1 | gitlab.pasteur.fr/GIPhy/SAM2MSA |
SPAdes | - | > 3.15.1 | github.com/ablab/spades |
A. Clone this repository with the following command line:
git clone https://gitlab.pasteur.fr/GIPhy/fq2dna.git
B. Give the execute permission to the file
fq2dna.sh
:
chmod +x fq2dna.sh
C. Execute fq2dna with the following command line model:
./fq2dna.sh [options]
D. If at least one of the required program (see Dependencies) is not available on your
$PATH
variable (or if one compiled binary has a different
default name), fq2dna will exit with an error message. When
running fq2dna without option, a documentation should be
displayed; otherwise, the name of the missing program is displayed
before existing. In such a case, edit the file fq2dna.sh
and indicate the local path to the corresponding binary(ies) within the
code block REQUIREMENTS
(approximately lines 60-150). For
each required program, the table below reports the corresponding
variable assignment instruction to edit (if needed) within the code
block REQUIREMENTS
program | variable assignment | program | variable assignment | |
---|---|---|---|---|
contig_info | CONTIG_INFO_BIN=contig_info; |
ntcard | NTCARD_BIN=ntcard; |
|
FASTA2AGP | FASTA2AGP_BIN=FASTA2AGP; |
Platon | PLATON_BIN=platon; |
|
fqCleanER | FQCLEANER_BIN=fqCleanER; |
Prokka | PROKKA_BIN=prokka; |
|
fqstats | FQSTATS_BIN=fqstats; |
SAM2MAP | SAM2MAP_BIN=SAM2MAP; |
|
gawk | GAWK_BIN=gawk; |
SPAdes | SPADES_BIN=spades.py; |
Note that depending on the installation of some required programs,
the corresponding variable can be assigned with complex commands. For
example, as FASTA2AGP is a Java tool that can be run using a
Java virtual machine, the executable jar file FASTA2AGP.jar
can be used by fq2dna by editing the corresponding variable
assignment instruction as follows:
FASTA2AGP_BIN="java -jar FASTA2AGP.jar"
.
Run fq2dna without option to read the following documentation:
USAGE: fq2dna.sh [options]
Processing and assembling high-throughput sequencing (HTS) paired-end reads:
+ processing HTS reads using different steps: deduplicating [D], trimming/clipping [T], error
correction [E], contaminant removal [C], merging [M], and/or digital normalization [N]
+ de novo assembly [dna] of the whole genome from processed HTS reads
OPTIONS:
-1 <infile> fwd (R1) FASTQ-formatted input file name from paired-ends (PE) library 1; every
input file can be uncompressed (file extension: .fastq or .fq) or compressed
using either gzip (.gz), bzip2 (.bz or .bz2) or DSRC (.dsrc or .dsrc2)
-2 <infile> rev (R2) FASTQ-formatted input file name from PE library 1
-3 <infile> fwd (R1) FASTQ-formatted input file name from PE library 2
-4 <infile> rev (R2) FASTQ-formatted input file name from PE library 2
-5 <infile> fwd (R1) FASTQ-formatted input file name from PE library 3
-6 <infile> rev (R2) FASTQ-formatted input file name from PE library 3
-o <outdir> path and name of the output directory (mandatory option)
-b <string> base name for output files (mandatory option)
-s <char> to set a predefined strategy for processing HTS reads among the following ones:
A Archaea: DT(C)E + [N|MN] + dna + annotation
B Bacteria: DT(C)E + [N|MN] + dna + annotation
E Eukaryote: DT(C) + [N|MN] + dna
P Prokaryote: DT(C)E + [N|MN] + dna
S Standard: DT(C) + [N|MN] + dna
V Virus: DT(C)E + [N|MN] + dna + annotation
(mandatory option; default: S)
-L <int> minimum required length for a contig (default: 300)
-T <"G S I"> Genus (G), Species (S) and Isolate (I) names to be used during annotation step;
should be set between quotation marks and separated by a blank space; only with
options -s A, -s B or -s V (default: "Genus sp. STRAIN")
-q <int> quality score threshold; all bases with Phred score below this threshold are
considered as non-confident during step [T] (default: 15)
-l <int> minimum required length for a read (default: 50)
-p <int> maximum allowed percentage of non-confident bases (as ruled by option -q) per
read (default: 50)
-c <int> minimum allowed coverage depth during step [N] (default: 2)
-C <int> maximum allowed coverage depth during step [N] (default: 60)
-a <infile> to set a file containing every alien oligonucleotide sequence (one per line) to
be clipped during step [T] (see below)
-a <string> one or several key words (separated with commas), each corresponding to a set
of alien oligonucleotide sequences to be clipped during step [T]:
POLY nucleotide homopolymers
NEXTERA Illumina Nextera index Kits
IUDI Illumina Unique Dual index Kits
AMPLISEQ AmpliSeq for Illumina Panels
TRUSIGHT_PANCANCER Illumina TruSight RNA Pan-Cancer Kits
TRUSEQ_UD Illumina TruSeq Unique Dual index Kits
TRUSEQ_CD Illumina TruSeq Combinatorial Dual index Kits
TRUSEQ_SINGLE Illumina TruSeq Single index Kits
TRUSEQ_SMALLRNA Illumina TruSeq Small RNA Kits
Note that these sets of alien oligonucleotide sequences are not exhaustive and
will never replace the exact oligos used for library preparation (default:
"POLY")
-a AUTO to perform de novo inference of 3' alien oligonucleotide sequence(s) of at
least 20 nucleotide length for step [T]; selected sequences are completed with
those from "POLYS" (see above)
-A <infile> to set sequence or k-mer model file(s) to carry out contaminant read removal
during step [C]; several comma-separated file names can be specified; allowed
file extensions: .fa, .fasta, .fna, .kmr or .kmz
-t <int> number of threads (default: 12)
-w <dir> tmp directory (default: $TMPDIR, otherwise /tmp)
-x to not remove (after completing) the tmp directory inside the one set with
option -w (default: not set)
-h prints this help and exit
EXAMPLES:
fq2dna.sh -1 r1.fq -2 r2.fq -o out -b echk12 -s P -T "Escherichia coli K12" -a AUTO
fq2dna.sh -1 rA.1.fq -2 rA.2.fq -3 rB.1.fq.gz -4 rB.2.fq.gz -o out -b name -a NEXTERA
fq2dna is able to consider up to three paired-ends
libraries (PE; options -1
to -6
). Input files
should be in FASTQ format and can be compressed using gzip, bzip2 or DSRC
(Roguski and Deorowicz 2014).
In brief, fq2dna first performs initial basic HTS read
preprocessing (i.e. step I) using fqCleanER,
i.e. deduplication (D), trimming/clipping (T; as ruled by options
-a
, -q
, -l
and -p
)
and error correction (E). When specified (option -A
), a
contaminating HTS read removal step (C ) can also be performed.
Next, fq2dna creates two distinct datasets from the
preprocessed HTS reads: a first one (i.e. step N) obtained using a
digital normalization procedure (N; as ruled by options -C
and -c
), and a second one (i.e. step M) by merging the PE
HTS reads with short insert size (M) followed by a digital normalization
procedure. Each of these two HTS datasets is used to infer a de
novo genome assembly (dnaN and dnaM, respectively) using SPAdes
(Bankevich et al. 2012).
Among the two genome assemblies, the less
fragmented one is retained, and next used together with its generating
HTS reads to infer a genome coverage profile (e.g. Lindner et al. 2013).
Based on the coverage profile, sufficiently long and well-covered
scaffold sequences are finally selected.
Depending on the specified
strategy (option -s
), selected scaffold sequences are
classified as chromosome/plasmid/undetermined (i.e. CPU) using Platon
(Schwengers et al. 2020) and/or annotated using Prokka (Seemann
2014).
Output files are all defined by the same specified prefix
(mandatory option -b
) and written into a specified output
directory (mandatory option -o
). Each output file content
is determined by its file extension:
Temporary files are written into a dedicated directory created
into the $TMPDIR
directory (when defined, otherwise
tmp/
). When possible, it is highly recommended to set a
temp directory with large capacity (option -w
).
Default options lead to a de novo assembly inferred from
a subset of well-suited HTS read (digital normalization) corresponding
to 60x coverage depth (option -C
), which is a good tradeoff
to observe accurate results with fast running times. It is therefore
expected that (i) the genome coverage profile (output file
<prefix>._cov.info.txt_) is a unimodal distribution with
both mean and mode close to 60x, and (ii) most of the selected scaffold
sequences (<prefix>._scf.fasta_) are supported by 60x
coverage depth on average (see covr values in FASTA headers).
Strong deviation from the specified coverage depth (option
-C
) can be indicative of a sequencing problem, e.g. low
coverage depth (small mode value), locally insufficient coverage depth
(very negative skewness), sample contamination (bimodal
distribution).
Of important note is that the genome coverage profile is used to fit a theoretical distribution using SAM2MAP (i.e. Negative Binomial or Generalized Poisson, depending on the variance). Such a theoretical distribution enables to determine a coverage depth cutoff, under which any observed coverage depth is considered as significantly low (p-value ≤ 0.00001). Every assembled base with coverage depth lower than the estimated coverage cutoff (specified in FASTA headers) is written in lowercase (into both files <prefix>._all.fasta_ and <prefix>._scf.fasta_) and should be considered with caution. If the selected scaffold sequences (i.e. with average coverage depth greater than the coverage cutoff; output file <prefix>._scf.fasta_) do not seems to represent the whole genome, try to consider the entire set of assembled scaffolds (output file <prefix>._all.fasta_).
For more details about the different assembly and scaffold sequence statistics (output files <prefix>._dna.info.txt_ and <prefix>._scf.info.txt_, respectively), see the documentation of contig_info.
For more details about the chromosome/plasmid/undetermined
classification of each scaffold sequence (i.e. column CPU in
<prefix>._scf.info.txt_ when using option
-s B
), see the documentation of Platon.
For more details about the annotation procedure when using
strategies A, B or V (option -s
; output files
<prefix>._gbk_ and
<prefix>._gbk.info.txt_), see the documentation of Prokka.
In order to illustrate the usefulness of fq2dna and to better describe its output files, the following use case example describes its usage for (re)assembling the draft genome of Listeria monocytogenes 2HF33 (Duru et al. 2020).
All output files are available in the directory example/ (the four sequence files were compressed using gzip).
Downloading input files
Paired-end sequencing of this genome was performed using Illumina Miseq, and the resulting pair of (compressed) FASTQ files (112 Mb and 128 Mb, respectively) can be downloaded using the following command lines:
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR403/006/ERR4032786/ERR4032786_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR403/006/ERR4032786/ERR4032786_2.fastq.gz
As phiX genome was used as spike-in during Illumina sequencing (Duru et al. 2020), this putative contaminating sequence (5.4 kb) can be downloded using the following command line:
wget -O phiX.fasta "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=NC_001422.1"
Running fq2dna
Below is a typical command line to run fq2dna on PE FASTQ files to deal with such a bacteria HTS data:
./fq2dna.sh -1 ERR4032786_1.fastq.gz -2 ERR4032786_2.fastq.gz -a NEXTERA -A phiX.fasta \\
-s B -T "Listeria monocytogenes 2HF33" -o 2HF33 -b Lm.2HF33 -t 12 -w /tmp
Of note, option -a NEXTERA
was set to clip the Nextera
XT technical oligonucleotides used during library preparation (see Duru
et al. 2020), the putative contaminating sequence file
phiX.fasta was set using option -A
, and the
strategy ‘Bacteria’ was trivially set (option -s B
).
Different log outputs can be observed depending on the OS and/or installed program versions (see Dependencies). Below is the one observed when using fq2dna together with program and tool versions listed into example/program.versions.txt:
# fq2dna v21.06ac
# Copyright (C) 2016-2021 Institut Pasteur
# OS: linux-gnu
# Bash: 4.4.19(1)-release
# input file(s):
> PE lib 1
+ FQ11: [111.9 Mb] ERR4032786_1.fastq.gz
+ FQ12: [127.8 Mb] ERR4032786_2.fastq.gz
# output directory
+ OUTDIR=2hf33
# tmp directory
+ TMP_DIR=/tmp/fq2dna.Lm.2HF33.Pe6bnviTj
# STRATEGY B: Bacteria
[00:01] Deduplicating, Trimming, Clipping, Decontaminating and Correcting reads ... [ok]
[02:37] Merging and/or Normalizing reads ... [ok]
[03:34] Approximating genome size ... [ok]
> 3077907 bps (N=3078225 M=3077589)
[03:38] Assembling genome with/without PE read merging (dnaM/dnaN) ... [ok]
> [dnaN] covr=63 arl=239 k=21,33,55,77,99,121
> [dnaM] covr=60 arl=261 k=21,33,55,77,99,121
[05:41] Comparing genome assemblies ... [ok]
> [dnaN] Nseq=34 Nres=3099209 NG50=358956 auGN=325895
> [dnaM] Nseq=35 Nres=3099182 NG50=358475 auGN=325314
> selecting dnaN
[05:42] Aligning reads against scaffolds ..... [ok]
> min. cov. cutoff: 39
> avg. cov. depth: 61.5760
> bimodality coef.: 0.222822
[06:07] Annotating scaffold sequences ... [ok]
> taxon: Listeria monocytogenes 2HF33
> locus tag: LISMON2HF33
# output files:
+ de novo assembly (all scaffolds): 2hf33/Lm.2HF33.all.fasta
+ coverage profile summary: 2hf33/Lm.2HF33.cov.info.txt
+ de novo assembly (selected scaffolds): 2hf33/Lm.2HF33.scf.fasta
+ sequence stats (selected scaffolds): 2hf33/Lm.2HF33.scf.info.txt
+ de novo assembly (selected contigs): 2hf33/Lm.2HF33.agp.fasta
+ scaffolding info: 2hf33/Lm.2HF33.agp
+ descriptive statistics: 2hf33/Lm.2HF33.dna.info.txt
+ insert size statistics: 2hf33/Lm.2HF33.isd.txt
+ annotation (selected scaffolds): 2hf33/Lm.2HF33.scf.gbk
+ annotation info: 2hf33/Lm.2HF33.scf.gbk.info.txt
[08:31] exit
This log output shows that the complete analysis was performed on 12 threads in less than 9 minutes (HTS read processing in < 4 minutes, de novo assemblies in ~2 minutes, and scaffold sequence post-processing in < 3 minutes). One can also observe that the selected de novo assembly was dnaN (i.e. greater NG50 and auNG metrics), therefore showing that the PE HTS read merging step (M) does not always lead to better genome assemblies.
Output files
The genome coverage profile summary file
(Lm.2HF33.cov.info.txt; see excerpt below) shows an average
coverage depth of ~61x, corresponding to a symmetrical coverage depth
distribution with mode of 61x, as expected (default option
-C 60
). The bimodality coefficient is 0.223, far below the
bimodality cutoff (i.e. 0.55; see e.g. Ellison 1987, Pfister et
al. 2013), suggesting that no contamination occurred in the HTS reads
selected for assembly.
= observed coverage distribution: no.pos=3099409
avg=61.5760 skewness=-1.850311 excess.kurtosis=16.852882 bimodality.coef=0.222822
# Poisson(l) coverage tail distribution: l=0.26470588 w=0.00004388
* GP(l',r) coverage distribution: l'=91.38551745 r=-0.48001727 1-w=0.99995612
min.cov.cutoff=39 (p-value<=0.00001)
These different statistics therefore assess that the assembled sequences are accurate and well-supported by sufficient data. Finally, a generalized Poisson (GP) distribution (that fits the observed genome coverage profile) enables to determine a minimum coverage cutoff (i.e. 39x) under which any assembled base is considered as putatively not trustworthy.
The tab-delimited file Lm.2HF33.dna.info.txt shows that the de novo assembly procedure led to quite few scaffold and contig sequences:
#File Nseq Nres A C G T N %A %C %G %T %N %AT %GC Min Q25 Med Q75 Max Avg auN N50 N75 N90 L50 L75 L90
Lm.2HF33.all.fasta 32 3099409 965680 586493 584967 962069 200 31.15% 18.92% 18.87% 31.04% 0.00% 62.21% 37.79% 325 467 5071 103424 609762 96856.53 369361 483167 125907 97509 3 7 11
Lm.2HF33.scf.fasta 25 3096280 964821 585834 584214 961211 200 31.16% 18.92% 18.86% 31.04% 0.00% 62.21% 37.79% 325 2377 61185 125907 609762 123851.20 369734 483167 125907 97509 3 7 11
Lm.2HF33.agp.fasta 27 3096080 964821 585834 584214 961211 0 31.16% 18.92% 18.86% 31.04% 0.00% 62.21% 37.79% 325 2334 61185 126569 532745 114669.62 323982 358956 125907 97509 4 8 12
Of note, seven small and/or insufficiently covered sequences (cutoff = 39x; see Lm.2HF33.cov.info.txt) were not selected for the final scaffold sequence set (i.e. Lm.2HF33.scf.fasta), therefore leading to a final assembled genome of total size 3,096,080 bps (Nres), represented in 25 scaffolds or 27 contigs (Nseq), with 37.79% GC-content.
The tab-delimited file Lm.2HF33.scf.info.txt (displayed below) shows different residue statistics for each final scaffold sequence. Note that each FASTA header (info files Lm.2HF33.all.fasta and Lm.2HF33.scf.fasta) contains the k-mer coverage depth returned by SPAdes (covk), the base coverage depth estimated by fq2dna (covr), and the (low) coverage cutoff derived from the coverage profile analysis (cutoff).
#Seq Nres A C G T N %A %C %G %T %N %AT %GC Pval CPU
NODE_1_covk_30.406287_covr_61.662_cutoff_39 609762 203624 102734 125302 178002 100 33.39% 16.84% 20.54% 29.19% 0.01% 62.60% 37.40% 0.0034 C
NODE_2_covk_30.453294_covr_61.686_cutoff_39 532745 154855 110898 91560 175432 0 29.06% 20.81% 17.18% 32.92% 0.00% 62.00% 38.00% 0.6223 C
NODE_3_covk_30.272711_covr_61.550_cutoff_39 483167 152931 88007 93261 148968 0 31.65% 18.21% 19.30% 30.83% 0.00% 62.49% 37.51% 0.0385 C
NODE_4_covk_30.706364_covr_61.810_cutoff_39 358956 107395 74076 65004 112481 0 29.91% 20.63% 18.10% 31.33% 0.00% 61.26% 38.74% 0.0004 C
NODE_5_covk_30.542160_covr_61.815_cutoff_39 202873 66970 34564 41586 59753 0 33.01% 17.03% 20.49% 29.45% 0.00% 62.47% 37.53% 0.0148 C
NODE_6_covk_30.987260_covr_61.695_cutoff_39 126569 41312 21299 26286 37672 0 32.63% 16.82% 20.76% 29.76% 0.00% 62.41% 37.59% 0.0628 C
NODE_7_covk_30.597515_covr_61.821_cutoff_39 125907 36800 25733 22119 41255 0 29.22% 20.43% 17.56% 32.76% 0.00% 62.00% 38.00% 0.9722 C
NODE_8_covk_30.889345_covr_61.679_cutoff_39 109334 32660 22786 17973 35915 0 29.87% 20.84% 16.43% 32.84% 0.00% 62.73% 37.27% 0.0206 C
NODE_9_covk_30.877632_covr_61.422_cutoff_39 103424 29747 21597 16742 35338 0 28.76% 20.88% 16.18% 34.16% 0.00% 62.94% 37.06% 0.0079 C
NODE_10_covk_30.202183_covr_61.508_cutoff_39 99066 32740 17152 20508 28666 0 33.04% 17.31% 20.70% 28.93% 0.00% 61.99% 38.01% 0.9669 C
NODE_11_covk_30.532037_covr_61.653_cutoff_39 97509 28253 20285 17248 31723 0 28.97% 20.80% 17.68% 32.53% 0.00% 61.51% 38.49% 0.0206 C
NODE_12_covk_30.699550_covr_61.404_cutoff_39 64398 21741 11084 13359 18214 0 33.76% 17.21% 20.74% 28.28% 0.00% 62.05% 37.95% 0.7621 C
NODE_13_covk_29.544576_covr_60.693_cutoff_39 61185 21142 9717 12639 17687 0 34.55% 15.88% 20.65% 28.90% 0.00% 63.47% 36.53% 0.0000 P
NODE_14_covk_30.516466_covr_61.060_cutoff_39 46823 14085 9566 7968 15204 0 30.08% 20.43% 17.01% 32.47% 0.00% 62.56% 37.44% 0.1267 C
NODE_15_covk_30.758479_covr_61.756_cutoff_39 35062 10004 7708 6005 11345 0 28.53% 21.98% 17.12% 32.35% 0.00% 60.89% 39.11% 0.0190 C
NODE_16_covk_30.848137_covr_61.400_cutoff_39 22536 6000 4813 3643 8080 0 26.62% 21.35% 16.16% 35.85% 0.00% 62.48% 37.52% 0.1588 U
NODE_17_covk_35.036364_covr_60.303_cutoff_39 5071 1064 1467 1048 1392 100 20.98% 28.92% 20.66% 27.45% 1.97% 49.41% 50.59% 0.0000 U
NODE_18_covk_32.390317_covr_54.643_cutoff_39 2806 724 661 386 1035 0 25.80% 23.55% 13.75% 36.88% 0.00% 62.69% 37.31% 0.4884 U
NODE_19_covk_29.273936_covr_55.026_cutoff_39 2377 637 502 337 901 0 26.79% 21.11% 14.17% 37.90% 0.00% 64.71% 35.29% 0.0432 U
NODE_20_covk_30.051514_covr_56.107_cutoff_39 2334 775 370 416 773 0 33.20% 15.85% 17.82% 33.11% 0.00% 66.33% 33.67% 0.0042 U
NODE_21_covk_25.827869_covr_50.283_cutoff_39 2073 706 348 359 660 0 34.05% 16.78% 17.31% 31.83% 0.00% 65.90% 34.10% 0.0104 U
NODE_22_covk_34.048830_covr_54.113_cutoff_39 1104 395 198 256 255 0 35.77% 17.93% 23.18% 23.09% 0.00% 58.88% 41.12% 0.1347 U
NODE_25_covk_37.367052_covr_46.822_cutoff_39 467 133 88 61 185 0 28.47% 18.84% 13.06% 39.61% 0.00% 68.10% 31.90% 0.0461 U
NODE_29_covk_33.709790_covr_42.206_cutoff_39 407 80 99 70 158 0 19.65% 24.32% 17.19% 38.82% 0.00% 58.48% 41.52% 0.2084 C
NODE_32_covk_34.926471_covr_39.388_cutoff_39 325 48 82 78 117 0 14.76% 25.23% 24.00% 36.00% 0.00% 50.77% 49.23% 0.0171 U
The last column CPU also shows the classification of each scaffold
sequence into the category ‘Chromosome’, ‘Plasmid’ or ‘Undetermined’
(i.e. C, P, U, respectively). Such a classification shows that the 13rd
scaffold sequence (i.e. NODE_13
) is likely a plasmid
(subsequent BLAST searches indeed show that it is almost identical to pLM6179).
Bankevich A, Nurk S, Antipov D, Gurevich A, Dvorkin M, Kulikov AS, Lesin V, Nikolenko S, Pham S, Prjibelski A, Pyshkin A, Sirotkin A, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA (2012) SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 19(5):455-477. doi:10.1089/cmb.2012.0021.
Duru IC, Andreevskaya M, Laine P, Rode TN, Ylinen A, Løvdal T, Bar N, Crauwels P, Riedel CU, Bucur FI, Nicolau AI, Auvinen P (2020) Genomic characterization of the most barotolerant Listeria monocytogenes RO15 strain compared to reference strains used to evaluate food high pressure processing. BMC Genomics, 21:455. doi:10.1186/s12864-020-06819-0.
Ellison AM (1987) Effect of seed dimorphism on the density-dependent dynamics of experimental populations of Atriplex triangularis (Chenopodiaceae). American Journal of Botany, 74(8):1280-1288. doi:10.2307/2444163.
Lindner MS, Kollock M, Zickmann F, Renard BY (2013) Analyzing genome coverage profiles with applications to quality control in metagenomics. Bioinformatics, 29(10):1260-1267. doi:10.1093/bioinformatics/btt147.
Pfister R, Schwarz KA, Janczyk M, Dale R, Freeman JB (2013) Good things peak in pairs: a note on the bimodality coefficient. Frontiers in Psychology, 4:700. doi:10.3389/fpsyg.2013.00700.
Roguski L, Deorowicz S (2014) DSRC 2: Industry-oriented compression of FASTQ files. Bioinformatics, 30(15):2213-2215. doi:10.1093/bioinformatics/btu208.
Schwengers O, Barth P, Falgenhauer L, Hain T, Chakraborty T, Goesmann A (2020) Platon: identification and characterization of bacterial plasmid contigs in short-read draft assemblies exploiting protein sequence-based replicon distribution scores. Microbial Genomics, 6(10):mgen000398. doi:10.1099/mgen.0.000398.
Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14):2068-2069. doi:10.1093/bioinformatics/btu153.