Inspecting base modification probabilities

For details on how base modification probabilities are calculated, see the FAQ page

For most use cases the automatic filtering enabled in modkit will produce nearly ideal results. However, in some cases such as exotic organisms or specialized assays, you may want to interrogate the base modification probabilities directly and tune the pass thresholds. The modkit sample-probs command is designed for this task. There are two ways to use this command, first by simply running modkit sample-probs $mod_bam to get a tab-separated file of threshold values for each modified base. This can save time in downstream steps where you wish to re-use the threshold value by passing --filter-threshold and skip re-estimating the value. To generate more advanced output, add --hist --out-dir $output_dir to the command and generate per-modification histograms of the output probabilities. Using the command this way produces 3 files in the $output_dir:

  1. An HTML document containing a histogram of the total counts of each probability emitted for each modification code (including canonical) in the sampled reads.
  2. Another HTML document containing the proportions of each probability emitted.
  3. A tab-separated table with the same information as the histograms and the percentile rank of each probability value.

The schema of the table is as follows:

columnnamedescriptiontype
1codemodification code or '-' for canonicalstring
2primary basethe primary DNA base for which the code appliesstring
3range_startthe inclusive start probability of the binfloat
4range_endthe exclusive end probability of the binfloat
5countthe total count of probabilities falling in this binint
6fracthe fraction of the total calls for this code/primary base in this binfloat
7percentile_rankthe percentile rank of this probability binfloat

From these plots and tables you can decide on a pass threshold per-modification code and use --mod-threshold/--filter-threshold accordingly.