It is exciting that MetaBAT has gained some popularity for the last couple of years. One of design goals of original MetaBAT was to enable users to explore MetaBAT parameters efficiently so that they could find the best results out of their dataset. However, the task might have been challenging due to various reasons (e.g. too many parameters to optimize, lack of knowledge about parameters, lack of computing or time resources, etc.). And typical users were just stick to the default setting and the shortcuts (e.g. –sensitive, –specific, etc) at best. Even though it is amazing to get great bins without spending much effort to optimize parameters, sometimes it does not produce the best possible bins out of datasets. And we felt that was not users’ responsibility but our’s. So here we introduce MetaBAT 2:

NOTICE:

Since v2.12.1, metabat executable points to metabat2. To run MetaBAT 1, use metabat1 executable.

Check out a new tutorial Best Binning Practices.

Check out CAMI benchmark with MetaBAT 2 here.

Check out the MetaBAT paper here.

Be careful to have bams sorted first!

MetaBAT 2 USAGE: running on command line

Be careful to have bams sorted first!

The easy way: > runMetaBat.sh assembly.fasta sample1.bam [sample2.bam …]

The slightly less easy way:

  1. Generate a depth file from BAM files

jgi_summarize_bam_contig_depths –outputDepth depth.txt *.bam

  1. Run metabat

metabat2 -i assembly.fasta -a depth.txt -o bins_dir/bin

To check MetaBAT options:

metabat2 -h

Allowed options:
  -h [ --help ]                     produce help message
  -i [ --inFile ] arg               Contigs in (gzipped) fasta file format [Mandatory]
  -o [ --outFile ] arg              Base file name and path for each bin. The default output is fasta format.
                                    Use -l option to output only contig names [Mandatory].
  -a [ --abdFile ] arg              A file having mean and variance of base coverage depth (tab delimited; 
                                    the first column should be contig names, and the first row will be 
                                    considered as the header and be skipped) [Optional].
  -m [ --minContig ] arg (=2500)    Minimum size of a contig for binning (should be >=1500).
  --maxP arg (=95)                  Percentage of 'good' contigs considered for binning decided by connection
                                    among contigs. The greater, the more sensitive.
  --minS arg (=60)                  Minimum score of a edge for binning (should be between 1 and 99). The 
                                    greater, the more specific.
  --maxEdges arg (=200)             Maximum number of edges per node. The greater, the more sensitive.
  --pTNF arg (=0)                   TNF probability cutoff for building TNF graph. Use it to skip the 
                                    preparation step. (0: auto).
  --noAdd                           Turning off additional binning for lost or small contigs.
  --cvExt                           When a coverage file without variance (from third party tools) is used 
                                    instead of abdFile from jgi_summarize_bam_contig_depths.
  -x [ --minCV ] arg (=1)           Minimum mean coverage of a contig in each library for binning.
  --minCVSum arg (=1)               Minimum total effective mean coverage of a contig (sum of depth over 
                                    minCV) for binning.
  -s [ --minClsSize ] arg (=200000) Minimum size of a bin as the output.
  -t [ --numThreads ] arg (=0)      Number of threads to use (0: use all cores).
  -l [ --onlyLabel ]                Output only sequence labels as a list in a column without sequences.
  --saveCls                         Save cluster memberships as a matrix format
  --unbinned                        Generate [outFile].unbinned.fa file for unbinned contigs
  --noBinOut                        No bin output. Usually combined with --saveCls to check only contig 
                                    memberships
  --seed arg (=0)                   For exact reproducibility. (0: use random seed)
  -d [ --debug ]                    Debug output
  -v [ --verbose ]                  Verbose output

Choice of Options:

MetaBAT 1 USAGE: running on command line

Run metabat

metabat1 -i assembly.fasta -a depth.txt -o bins_dir/bin

To check MetaBAT options:

metabat1 -h

Allowed options:
  -h [ --help ]                     produce help message
  -i [ --inFile ] arg               Contigs in (gzipped) fasta file format [Mandatory]
  -o [ --outFile ] arg              Base file name for each bin. The default output is fasta format. Use -l
                                    option to output only contig names [Mandatory]
  -a [ --abdFile ] arg              A file having mean and variance of base coverage depth (tab delimited;
                                    the first column should be contig names, and the first row will be
                                    considered as the header and be skipped) [Optional]
  --cvExt                           When a coverage file without variance (from third party tools) is used
                                    instead of abdFile from jgi_summarize_bam_contig_depths
  -p [ --pairFile ] arg             A file having paired reads mapping information. Use it to increase
                                    sensitivity. (tab delimited; should have 3 columns of contig index
                                    (ordered by), its mate contig index, and supporting mean read coverage.
                                    The first row will be considered as the header and be skipped) [Optional]
  --p1 arg (=0)                     Probability cutoff for bin seeding. It mainly controls the number of
                                    potential bins and their specificity. The higher, the more (specific)
                                    bins would be. (Percentage; Should be between 0 and 100)
  --p2 arg (=0)                     Probability cutoff for secondary neighbors. It supports p1 and better be
                                    close to p1. (Percentage; Should be between 0 and 100)
  --minProb arg (=0)                Minimum probability for binning consideration. It controls sensitivity.
                                    Usually it should be >= 75. (Percentage; Should be between 0 and 100)
  --minBinned arg (=0)              Minimum proportion of already binned neighbors for one's membership
                                    inference. It contorls specificity. Usually it would be <= 50
                                    (Percentage; Should be between 0 and 100)
  --verysensitive                   For greater sensitivity, especially in a simple community. It is the
                                    shortcut for --p1 90 --p2 85 --pB 20 --minProb 75 --minBinned 20
                                    --minCorr 90
  --sensitive                       For better sensitivity [default]. It is the shortcut for --p1 90 --p2 90
                                    --pB 20 --minProb 80 --minBinned 40 --minCorr 92
  --specific                        For better specificity. Different from --sensitive when using correlation
                                    binning or ensemble binning. It is the shortcut for --p1 90 --p2 90 --pB
                                    30 --minProb 80 --minBinned 40 --minCorr 96
  --veryspecific                    For greater specificity. No correlation binning for short contig
                                    recruiting. It is the shortcut for --p1 90 --p2 90 --pB 40 --minProb 80
                                    --minBinned 40
  --superspecific                   For the best specificity. It is the shortcut for --p1 95 --p2 90 --pB 50
                                    --minProb 80 --minBinned 20
  --minCorr arg (=0)                Minimum pearson correlation coefficient for binning missed contigs to
                                    increase sensitivity (Helpful when there are many samples). Should be
                                    very high (>=90) to reduce contamination. (Percentage; Should be between
                                    0 and 100; 0 disables)
  --minSamples arg (=10)            Minimum number of sample sizes for considering correlation based
                                    recruiting
  -x [ --minCV ] arg (=1)           Minimum mean coverage of a contig to consider for abundance distance
                                    calculation in each library
  --minCVSum arg (=2)               Minimum total mean coverage of a contig (sum of all libraries) to
                                    consider for abundance distance calculation
  -s [ --minClsSize ] arg (=200000) Minimum size of a bin to be considered as the output
  -m [ --minContig ] arg (=2500)    Minimum size of a contig to be considered for binning (should be >=1500;
                                    ideally >=2500). If # of samples >= minSamples, small contigs (>=1000)
                                    will be given a chance to be recruited to existing bins by default.
  --minContigByCorr arg (=1000)     Minimum size of a contig to be considered for recruiting by pearson
                                    correlation coefficients (activated only if # of samples >= minSamples;
                                    disabled when minContigByCorr > minContig)
  -t [ --numThreads ] arg (=0)      Number of threads to use (0: use all cores)
  --minShared arg (=50)             Percentage cutoff for merging fuzzy contigs
  --fuzzy                           Binning with fuzziness which assigns multiple memberships of a contig to
                                    bins (activated only with --pairFile at the moment)
  -l [ --onlyLabel ]                Output only sequence labels as a list in a column without sequences
  -S [ --sumLowCV ]                 If set, then every sample that falls below the minCV will be used in an
                                    aggregate sample
  -V [ --maxVarRatio ] arg (=0)     Ignore any contigs where variance / mean exceeds this ratio (0 disables)
  --saveTNF arg                     File to save (or load if exists) TNF matrix for each contig in input
  --saveDistance arg                File to save (or load if exists) distance graph at lowest probability
                                    cutoff
  --saveCls                         Save cluster memberships as a matrix format
  --unbinned                        Generate [outFile].unbinned.fa file for unbinned contigs
  --noBinOut                        No bin output. Usually combined with --saveCls to check only contig
                                    memberships
  -B [ --B ] arg (=20)               Number of bootstrapping for ensemble binning (Recommended to be >=20)
  --pB arg (=50)                    Proportion of shared membership in bootstrapping. Major control for
                                    sensitivity/specificity. The higher, the specific. (Percentage; Should be
                                    between 0 and 100)
  --seed arg (=0)                   For reproducibility in ensemble binning, though it might produce slightly
                                    different results. (0: use random seed)
  --keep                            Keep the intermediate files for later usage
  -d [ --debug ]                    Debug output
  -v [ --verbose ]                  Verbose output

Choice of Options:

OUTPUT:

Each discovered bin will be saved as a fasta format

NOTES:

The proper settings on the read aligner should be set to evenly distribute ambiguously mapping reads (the default option for bowtie2, bwa, and bbmap).

jgi_summarize_bam_contig_depths USAGE:

jgi_summarize_bam_contig_depths

Usage: jgi_summarize_bam_contig_depths <options> sortedBam1 [ sortedBam2 ...]
where options include:
    --outputDepth       arg  The file to put the contig by bam depth matrix (default: STDOUT)
    --percentIdentity   arg  The minimum end-to-end % identity of qualifying reads (default: 97)
    --pairedContigs     arg  The file to output the sparse matrix of contigs which paired reads span (default: none)
    --unmappedFastq     arg  The prefix to output unmapped reads from each bam file suffixed by 'bamfile.bam.fastq.gz'
    --noIntraDepthVariance   Do not include variance from mean depth along the contig
    --showDepth              Output a .depth file per bam for each contig base
    --includeEdgeBases       When calculating depth & variance, include the 1-readlength edges (off by default)
    --maxEdgeBases           When calculating depth & variance, and not --includeEdgeBases, the maximum length (default:75)

Options to control shredding contigs that are under represented by the reads
    --referenceFasta    arg  The reference file.  (It must be the same fasta that bams used)
    --shredLength       arg  The maximum length of the shreds
    --shredDepth        arg  The depth to generate overlapping shreds
    --minContigLength   arg  The mimimum length of contig to include for mapping and shredding
    --minContigDepth    arg  The minimum depth along contig at which to break the contig

MetaBAT Adjusted coverage depths

The algorithm that jgi_summarize_bam_contig_depths uses for calculating coverage depth for each sequence in the assembly is adjusted by a few factors to improve the fidelity of the metrics when correlating abundance coverage in the binning stage.

By default the following adjustments are applied:

  1. Edge Bases are ignored > Edge bases are not counted as coverage, by the lesser of 1 AverageReadLength or (–maxEdgeBases=75).

This is because most mappers can not reliably place a read that would extend off the edge of a sequence, and coverage depth tends to drop towards 0 at the edge of a contig or scaffold.

Use –includeEdgeBases to include the coverage in this region.

  1. Reads with high mapping errors are skipped

Reads that map imperfectly are excluded when the %ID of the mapping drops below a threshold (–percentIdentity=97).

MetaBAT is designed to resolve strain variation and mapping reads with low %ID indicate that the read actually came from a different strain/species.

%ID is calculated from the CIGAR string and/or NM/MD fields and == 100 * MatchedBases / (MatchedBases + Substituions + Insertions + Deletions)

This ensures that clips, insertions, deletions and mismatches are excluded from the coverage count. Only the read bases that exactly match the reference are counted as coverage. This generally has a small effect, except in the case of long reads from PacBio and Nanopore.

Example with real data

Description of Data:

After downloading the assembly file and two bam files, run the following command line:

runMetaBat.sh assembly.fa *.bam

MetaBAT forms about 28 bins. In this example, since we know the true membership of each contig, we can calculate the completeness and precision of a bin correctly.

#The following is R commands (tested on Linux)
library(plyr)
library(foreach)
options(width=150)

refs <- read.table("membership.txt", sep="\t", header=T)

files <- list.files(".", pattern="*.[0-9]+.fa$", full.name=F)
bins <- foreach(f=files) %do% {
    system(sprintf("grep '>' %s | sed 's/>//'", f), intern=TRUE)
}

res <- foreach(b=bins, .combine=cbind) %do% {
    ddply(refs[match(b, refs$contig),], .(reference), function(x) sum(x$size), .drop=F)
}
rownames(res) <- res[,1]; res <- res[, seq(2,ncol(res),2)]
colnames(res) <- sapply(strsplit(files, "\\."), function(x) x[length(x)-1])

genome.size <- ddply(refs, .(reference), function(x) c(size=sum(x$size)))
genome.size <- genome.size[match(rownames(res), genome.size$reference),] 

res.prec <- apply(res, 2, function(x) max(x)/sum(x))
res.comp <- apply(res, 2, function(x) max(x)/genome.size$size[which.max(x)])
res.ref <- apply(res, 2, function(x) rownames(res)[which.max(x)])
res2 <- cbind.data.frame(Bin=names(res.prec), Size=colSums(res), 'Compl.'=res.comp, 'Prec.'=res.prec, 'Ref.'=res.ref)
res2
summary(res2[,2:4]) #Median bin size was 3.72Mb with 97% median completeness (or 80% by mean) and 100% median precision (or 96% by mean).
length(unique(res2$Ref.)) # 24 out of 25 genomes were binned.

The output will look like the following:

>res2
   Bin    Size     Compl.     Prec.                                           Ref.
1    1 7860794 0.97635160 0.4832147        Natronobacterium gregoryi SP2, DSM 3393
10  10 4115670 0.81602285 1.0000000              Spirochaeta smaragdinae DSM 11293
11  11 3921774 0.86124825 1.0000000           Natronococcus occultus SP4, DSM 3396
12  12 3874303 0.98891653 1.0000000                    Hirschia baltica ATCC 49814
13  13 3839986 0.99090122 1.0000000         Coraliomargarita akajimensis DSM 45221
14  14 3730710 0.98314815 1.0000000            Clostridium thermocellum ATCC 27405
15  15 3704412 0.87120817 1.0000000        Thermobacillus composti KWC4, DSM 18247
16  16 3684573 0.86379200 1.0000000          Frateuria aurantia Kondo 67, DSM 6220
17  17 3546737 0.98625152 1.0000000                  Meiothermus silvanus DSM 9946
18  18 3236335 0.99686126 1.0000000             Clostridium perfringens ATCC 13124
19  19 3173259 0.99905801 1.0000000                Segniliparus rotundus DSM 44985
2    2 5619150 0.99822442 1.0000000    Echinicola vietnamensis KMM 6221, DSM 17526
20  20 2924873 0.89827162 1.0000000          Corynebacterium glutamicum ATCC 13032
21  21 2147757 0.98982458 1.0000000    Fervidobacterium pennivorans Ven5, DSM 9078
22  22 2125683 0.98312892 1.0000000                         Olsenella uli DSM 7084
23  23 1805558 0.96656297 1.0000000                  Streptococcus pyogenes M1 GAS
24  24  703485 0.23799984 0.7072745                  Salmonella bongori NCTC 12419
25  25  576231 0.11425057 1.0000000              Spirochaeta smaragdinae DSM 11293
26  26  351671 0.06972657 1.0000000              Spirochaeta smaragdinae DSM 11293
27  27  292949 0.06889611 1.0000000        Thermobacillus composti KWC4, DSM 18247
28  28  274427 0.08428058 1.0000000          Corynebacterium glutamicum ATCC 13032
3    3 5272697 0.92686085 0.7153358      Escherichia coli str. K-12 substr. MG1655
4    4 5090248 0.97611998 1.0000000     Desulfotomaculum gibsoniae Groll, DSM 7213
5    5 4940344 0.99375427 1.0000000      Desulfosporosinus meridiei S10, DSM 13257
6    6 4829038 0.96953761 1.0000000   Desulfosporosinus acidophilus SJ4, DSM 22704
7    7 4766169 0.98037872 1.0000000           Terriglobus roseus KBS 63, DSM 18391
8    8 4753789 0.91043046 1.0000000 Salmonella enterica subsp. arizonae serovar 62
9    9 4735197 0.98375082 1.0000000                      Pseudomonas stutzeri RCH2

>summary(res2[,2:4])
      Size             Compl.           Prec.       
 Min.   : 274427   Min.   :0.0689   Min.   :0.4832  
 1st Qu.:2142238   1st Qu.:0.8632   1st Qu.:1.0000  
 Median :3717561   Median :0.9728   Median :1.0000  
 Mean   :3424922   Mean   :0.8031   Mean   :0.9609  
 3rd Qu.:4756884   3rd Qu.:0.9869   3rd Qu.:1.0000  
 Max.   :7860794   Max.   :0.9991   Max.   :1.0000  
MetaBAT Copyright (c) 2014, The Regents of the University of California, 
through Lawrence Berkeley National Laboratory (subject to receipt of any 
required approvals from the U.S. Dept. of Energy).  All rights reserved.

If you have questions about your rights to use or distribute this software, 
please contact Berkeley Lab's technology transfer department at  TTD@lbl.gov 
referring to "MetaBAT (2014-075)."

NOTICE.  This software was developed under funding from the U.S. Department of 
Energy.  As such, the U.S. Government has been granted for itself and others 
acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide license in 
the Software to reproduce, prepare derivative works, and perform publicly and 
display publicly.  Beginning five (5) years after the date permission to assert 
copyright is obtained from the U.S. Department of Energy, and subject to any 
subsequent five (5) year renewals, the U.S. Government is granted for itself 
and others acting on its behalf a paid-up, nonexclusive, irrevocable, worldwide 
license in the Software to reproduce, prepare derivative works, distribute 
copies to the public, perform publicly and display publicly, and to permit 
others to do so.