Sketches ======== For sequences to be compared with :code:`mash`, they must first be `sketched`, which creates vastly reduced representations of them. This will happen automatically if :code:`mash dist` is given raw sequences. However, if multiple comparisons will be performed, it is more efficient to create sketches with :code:`mash sketch` first and provide them to :code:`mash dist` in place of the raw sequences. Sketching parameters can be provided to either tool via command line options. Reduced representations with MinHash tables ------------------------------------------- Sketches are used by the `MinHash` algorithm to allow fast distance estimations with low storage and memory requirements. To make a sketch, each k-mer in a sequence is `hashed`, which creates a pseudo-random identifier. By sorting these identifiers (`hashes`), a small subset from the top of the sorted list can represent the entire sequence (these are `min-hashes`). The more similar another sequence is, the more min-hashes it is likely to share. k-mer size '''''''''' As in any k-mer based method, larger k-mers will provide more specificity, while smaller k-mers will provide more sensitivity. Larger genomes will also require larger k-mers to avoid k-mers that are shared by chance. K-mer size is specified with :code:`-k`, and sketch files must have the same k-mer size to be compared with :code:`mash dist`. When :code:`mash sketch` is run, it automatically assesses the specified k-mer size against the sizes of input genomes by estimating the probability of a random match as: .. math:: p = \frac 1 {\frac {\left(\overline\Sigma\right)^k} g + 1} ...where :math:`g` is the genome size and :math:`\Sigma` is the alphabet (ACGT by default). If this probability exceeds a threshold (specified by :code:`-w`; 0.01 by default) for any input genomes, a warning will be given with the minimum k-mer size needed to get within the threshold. For large collections of sketches, memory and storage may also be a consideration when choosing a k-mer size. Mash will use 32-bit hashes, rather than 64-bit, if they can encompass the full k-mer space for the alphabet in use. This will (roughly) halve the size of the size of the sketch file on disk and the memory it uses when loaded for :code:`mash dist`. The criterion for using a 32-bit hash is: .. math:: \left({\overline\Sigma}\right)^k \leq 2^{32} ...which becomes :math:`k \leq 16` for nucleotides (the default) and :math:`k \leq 7` for amino acids. sketch size ''''''''''' Sketch size corresponds to the number of (non-redundant) min-hashes that are kept. Larger sketches will better represent the sequence, but at the cost of larger sketch files and longer comparison times. The error bound of a distance estimation for a given sketch size :math:`s` is formulated as: .. math:: \sqrt{\frac{1}{s}} Sketch size is specified with :code:`-s`. Sketches of different sizes can be compared with :code:`mash dist`, although the comparison will be restricted to the smaller of the two sizes. Strand and alphabet ------------------- By default, :code:`mash` uses a nucleotide alphabet (ACGT), is case-insensitive, and will ignore strandedness by using canonical k-mers, as done in `Jellyfish`_. This works by using the reverse complement of a k-mer if it comes before the original k-mer alphabetically. Strandedness can be preserved with :code:`-n` (non-canonical) and case can be preserved with :code:`-Z`. Note that the default nucleotide alphabet does not include lowercase and thus will filter out k-mers with lowercase nucleotides if :code:`-Z` is specified. The amino acid alphabet can be specified with :code:`-a`, which also changes the default k-mer size to reflect the denser information. A completely custom alphabet can also be specified with :code:`-z`. Note that alphabet size affects p-value calculation and hash size (see `Assessing significance with p-values `_ and `k-mer size`_). Sketching read sets ------------------- When sketching reads instead of complete genomes or assemblies, :code:`-r` should be specified, which will estimate genome size from k-mer content rather than total sequence length, allowing more accurate p-vlaues. Genome size can also be specified directly with :code:`-g`. Additionally, Since MinHash is a k-mer based method, removing unique or low-copy k-mers usually improves results for read sets, since these k-mers are likely to represent sequencing error. The minimum copies of each k-mer required can be specified with :code:`-m` (e.g. :code:`-m 2` to filter unique). However, this could lead to high memory usage if genome size is high and coverage is low, such as in metagenomic read sets. In these cases a Bloom filter can be used (:code:`-b`) to filter out most unique k-mers with constant memory. If coverage is high (e.g. >100x), it can be helpful to limit it to save time and to avoid repeat errors appearing as legitimate k-mers. This can be done with :code:`-c`, which stops sketching reads once the estimated average coverage (based on k-mer multiplicity) reaches the target. Working with sketch files ------------------------- The sketch or sketches stored in a sketch file, and their parameters, can be inspected with :code:`mash info`. If sketch files have matching k-mer sizes, their sketches can be combined into a single file with :code:`mash paste`. This allows simple pairwise comparisons with :code:`mash dist`, and allows sketching of multiple files to be parallelized. .. _Jellyfish: http://www.cbcb.umd.edu/software/jellyfish/