Distance Estimation =================== MinHash Jaccard estimation -------------------------- Given :math:`k`-mer sets :math:`A` and :math:`B`, the MinHash algorithm provides an estimation of the Jaccard index: .. math:: j(A_s,B_s) = \frac {\lvert A_s \cap B_s \rvert} s where :math:`A_s` and :math:`B_s` are subsets such that :math:`\lvert A_s \cup B_s \rvert` is equal to the sketch size, :math:`s`, allowing for a known error bound as suggested by Broder [#f1]_. This is done by using a merge-sort algorithm to find common values between the two sorted sketches and terminating when the total number of hashes seen reaches the sketch size (or all hashes in both sketches have been seen). Mash distance formulation ------------------------- For mutating a sequence with :math:`t` total :math:`k`-mers and a conserved :math:`k`-mer count :math:`w`, an approximate mutation rate :math:`d` can be estimated using a Poisson model of mutations occurring in :math:`k`-mers, as suggested by Fan et al. [#f2]_: .. math:: d = \frac {-1} k \ln \frac w t In order to use a Jaccard estimate :math:`j` between two :math:`k`-mer sets of arbitrary sizes, the Jaccard estimate can be framed in terms of the conserved :math:`k`-mer count :math:`w` and the average set size :math:`n`: .. math:: j \approx \frac w {2n - w} To substitute :math:`n` for the total :math:`k`-mer count :math:`t` in the mutation estimation, this approximation can be reformulated as: .. math:: \frac w n \approx \frac {2j} {1 + j} Substituting :math:`\frac w n` for :math:`\frac w t` thus yields the Mash distance: .. math:: D(k,j)=\Bigg\{\begin{split} &1&,\ j=0\\ &\frac {-1} k \ln \frac {2j} {1 + j}&,\ 0`_ .. [#f2] `Fan, H., Ives, A.R., Surget-Groba, Y. & Cannon, C.H. An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC genomics 16, 522 (2015). `_