Supporting Data =============== Mash: fast genome and metagenome distance estimation using MinHash ------------------------------------------------------------------ `RefSeqSketches.msh.gz `_: Mash sketch database (k=16, s=400) for RefSeq release 70 (48MB) `RefSeqSketchesDefaults.msh.gz `_: Mash sketch database (k=21, s=1000) for RefSeq release 70 (255MB) `Escherichia.tar.gz `_: Names and accessions for 500 selected Escherichia genomes, pairwise ANI, and pairwise Jaccard indexes for various k-mer and sketch sizes (24MB) `mash-1.0.tar.gz `_: Mash version 1.0 codebase (93KB) `SRR2671867.BaAmes.poretools.fastq.gz `_: Nanopore 1D + 2D sequences generated by poretools (157MB) `SRR2671868.Bc10987.poretools.fastq.gz `_: Nanopore 1D + 2D sequences generated by poretools (250MB) Mash Screen: High-throughput sequence containment estimation for genome discovery --------------------------------------------------------------------------------- Custom scripts and intermediate data: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ `MashScreen_supp.tar.gz `_ Data files: ~~~~~~~~~~~ Mash Sketch databases for RefSeq release 88: * `RefSeq88n.msh.gz `_: Genomes (k=21, s=1000), 1.2Gb uncompressed * `RefSeq88p.msh.gz `_: Proteomes (k=9, s=1000), 1.1Gb uncompressed `art.fastq.gz `_: Simulated reads for Shakya experiment Figure 5: * `fig5.html `_: Interactive version * `fig5.tsv `_: Source data Screen of SRA metagenomes vs. RefSeq ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * `sra_meta_nucl_95idy.tsv.gz `_ (2.3Gb uncompressed) * `sra_meta_nucl_80idy_3x.tsv.gz `_ (6.7Gb uncompressed) * `sra_meta_prot_95idy.tsv.gz `_ (2.1Gb uncompressed) * `sra_meta_prot_80idy_3x.tsv.gz `_ (8.3Gb uncompressed) These files have a line for each RefSeq genome listing all metagenomic SRA runs (as of August 2018) with Mash Containment Scores above the specified threshold. They are provided for two screen modes: * ``nucl``: Genomic RefSeq sequences * ``prot``: Proteomic RefSeq sequences (combined amino acid sequences per organism). **NOTE:** Protein tables above are not p-value filtered and thus large (> ~50Gb) runs may have spurious hits. They also do not contain plasmids. Updates coming soon! ...and at two thresholds: * ``95idy``: 95% Mash Containment Score, any coverage. Useful for finding runs containing a specific genome. * ``80idy_3x``: 80% Mash Containment Score, at least 3x median k-mer multiplicity. Useful for finding related, but novel, sequences. The files are tab separated, with each line beginning with a RefSeq assembly accession, followed by SRA accessions, for example: :: GCF_000001215.4 SRR3401361 SRR3540373 GCF_000001405.36 SRR5127794 ERR1539652 SRR413753 ERR206081 GCF_000001405.38 SRR5127794 ERR1539652 ERR1711677 SRR413753 ERR206081 We also provide simple scripts for searching these files: `search.tar `_ Public data sources ~~~~~~~~~~~~~~~~~~~ The BLAST ``nr`` database was downloaded from ``ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.*``. HMP data were downloaded from ``ftp://public-ftp.ihmpdcc.org/``, reads from the ``Ilumina/`` directory and coding sequences from the ``HMGI/`` directory. Within these folders, sample SRS015937 resides in ``tongue_dorsum/`` and SRS020263 in ``right_retroauricular_crease/``. SRA runs downloaded with the `SRA Toolkit `_. RefSeq genomes downloaded from the ``genomes/refseq/`` directory of ``ftp.ncbi.nlm.nih.gov``. Public data products ~~~~~~~~~~~~~~~~~~~~ Quebec Polyomavirus is submitted to GenBank as BK010702.