<h1 id="table-of-contents">Table of Contents</h1>
<ul>
<li><a href="#synopsis">Synopsis</a></li>
<li><a href="#population-reference-graphs">Population Reference
Graphs</a></li>
<li><a href="#build-index">Build index</a></li>
<li><a href="#map-reads-to-index">Map reads to index</a></li>
<li><a href="#compare-reads-from-several-samples">Compare reads from
several samples</a></li>
<li><a href="#discover-novel-variants">Discover novel variants</a></li>
</ul>
<h1 id="synopsis">Synopsis</h1>
<pre><code>$ pandora --help
Pandora: Pan-genome inference and genotyping with long noisy or short accurate reads.
Usage: pandora [OPTIONS] SUBCOMMAND

Options:
  -h,--help                   Print this help message and exit
  -V,--version

Subcommands:
  index                       Index population reference graph (PRG) sequences.
  map                         Quasi-map reads to an indexed PRG, infer the sequence of present loci in the sample, and optionally genotype variants.
  compare                     Quasi-map reads from multiple samples to an indexed PRG, infer the sequence of present loci in each sample, and call variants between the samples.
  discover                    Quasi-map reads to an indexed PRG, infer the sequence of present loci in the sample and discover novel variants.
  walk                        Outputs a path through the nodes in a PRG corresponding to the either an input sequence (if it exists) or the top/bottom path
  seq2path                    For each sequence, return the path through the PRG
  get_vcf_ref                 Outputs a fasta suitable for use as the VCF reference using input sequences
  random                      Outputs a fasta of random paths through the PRGs
  merge_index                 Allows multiple indices to be merged (no compatibility check)</code></pre>
<h1 id="population-reference-graphs">Population Reference Graphs</h1>
<p>Pandora assumes you have already constructed a fasta-like file of
graphs, one entry for each gene/ genome region of interest. If you
haven’t, you will need a multiple sequence alignment for each graph.
Precompiled collections of MSA representing othologous gene clusters for
a number of species can be downloaded from <a
href="http://pangenome.de/">here</a> and converted to graphs using <a
href="https://github.com/leoisl/make_prg">make_prg</a>.</p>
<h1 id="build-index">Build index</h1>
<p>Takes a fasta-like file of PanRG sequences and constructs an index,
and a directory of gfa files to be used by <code>pandora map</code> or
<code>pandora compare</code>. These are output in the same directory as
the PanRG file.</p>
<pre><code>$ pandora index --help
Index population reference graph (PRG) sequences.
Usage: pandora index [OPTIONS] &lt;PRG&gt;

Positionals:
  &lt;PRG&gt; FILE [required]       PRG to index (in fasta format)

Options:
  -h,--help                   Print this help message and exit
  -w INT                      Window size for (w,k)-minimizers (must be &lt;=k) [default: 14]
  -k INT                      K-mer size for (w,k)-minimizers [default: 15]
  -t,--threads INT            Maximum number of threads to use [default: 1]
  -o,--outfile FILE           Filename for the index [default: &lt;PRG&gt;.kXX.wXX.idx]
  -v                          Verbosity of logging. Repeat for increased verbosity</code></pre>
<p>The index stores (w,k)-minimizers for each PanRG path found. These
parameters can be specified, but default to w=14, k=15.</p>
<h1 id="map-reads-to-index">Map reads to index</h1>
<p>This takes a fasta/q of Nanopore or Illumina reads and compares to
the index. It infers which of the PanRG genes/elements is present, and
for those that are present it outputs the inferred sequence and a
genotyped VCF.</p>
<pre><code>$ pandora map --help
Quasi-map reads to an indexed PRG, infer the sequence of present loci in the sample, and optionally genotype variants.
Usage: ./pandora map [OPTIONS] &lt;TARGET&gt; &lt;QUERY&gt;

Positionals:
  &lt;TARGET&gt; FILE [required]    An indexed PRG file (in fasta format)
  &lt;QUERY&gt; FILE [required]     Fast{a,q} file containing reads to quasi-map

Options:
  -h,--help                   Print this help message and exit
  -v                          Verbosity of logging. Repeat for increased verbosity

Indexing:
  -w INT                      Window size for (w,k)-minimizers (must be &lt;=k) [default: 14]
  -k INT                      K-mer size for (w,k)-minimizers [default: 15]

Input/Output:
  -o,--outdir DIR             Directory to write output files to [default: pandora]
  -t,--threads INT            Maximum number of threads to use [default: 1]
  --vcf-refs FILE             Fasta file with a reference sequence to use for each loci. The sequence MUST have a perfect match in &lt;TARGET&gt; and the same name
  --kg                        Save kmer graphs with forward and reverse coverage annotations for found loci
  --loci-vcf                  Save a VCF file for each found loci
  -C,--comparison-paths       Save a fasta file for a random selection of paths through loci
  -M,--mapped-reads           Save a fasta file for each loci containing read parts which overlapped it

Parameter Estimation:
  -e,--error-rate FLOAT       Estimated error rate for reads [default: 0.11]
  -g,--genome-size STR/INT    Estimated length of the genome - used for coverage estimation. Can pass string such as 4.4m, 100k etc. [default: 5000000]
  --bin                       Use binomial model for kmer coverages [default: negative binomial]

Mapping:
  -m,--max-diff INT           Maximum distance (bp) between consecutive hits within a cluster [default: 250]
  -c,--min-cluster-size INT   Minimum size of a cluster of hits between a read and a loci to consider the loci present [default: 10]

Preset:
  -I,--illumina               Reads are from Illumina. Alters error rate used and adjusts for shorter reads

Filtering:
  --clean                     Add a step to clean and detangle the pangraph
  --max-covg INT              Maximum coverage of reads to accept [default: 300]

Consensus/Variant Calling:
  --genotype                  Add extra step to carefully genotype sites.
  --snps                      When genotyping, only include SNP sites
  --kmer-avg INT              Maximum number of kmers to average over when selecting the maximum likelihood path [default: 100]

Genotyping:
  --local Needs: --genotype   (Intended for developers) Use coverage-oriented (local) genotyping instead of the default ML path-oriented (global) approach.
  -a INT                      Hard threshold for the minimum allele coverage allowed when genotyping [default: 0]
  -s INT                      The minimum required total coverage for a site when genotyping [default: 0]
  -D INT                      Minimum difference in coverage on a site required between the first and second maximum likelihood path [default: 0]
  -F INT                      Minimum allele coverage, as a fraction of the expected coverage, allowed when genotyping [default: 0]
  -E,--gt-error-rate FLOAT    When genotyping, assume that coverage on alternative alleles arises as a result of an error process with rate -E. [default: 0.01]
  -G,--gt-conf INT            Minimum genotype confidence (GT_CONF) required to make a call [default: 1]</code></pre>
<h1 id="compare-reads-from-several-samples">Compare reads from several
samples</h1>
<p>This takes Nanopore or Illumina read fasta/q for a number of samples,
mapping each to the index. It infers which of the PanRG genes/elements
is present in each sample, and outputs a presence/absence pangenome
matrix, the inferred sequences for each sample and a genotyped
multisample pangenome VCF.</p>
<pre><code>$ pandora compare --help
Quasi-map reads from multiple samples to an indexed PRG, infer the sequence of present loci in each sample, and call variants between the samples.
Usage: ./pandora compare [OPTIONS] &lt;TARGET&gt; &lt;QUERY_IDX&gt;

Positionals:
  &lt;TARGET&gt; FILE [required]    An indexed PRG file (in fasta format)
  &lt;QUERY_IDX&gt; FILE [required] A tab-delimited file where each line is a sample identifier followed by the path to the fast{a,q} of reads for that sample

Options:
  -h,--help                   Print this help message and exit
  -v                          Verbosity of logging. Repeat for increased verbosity

Indexing:
  -w INT                      Window size for (w,k)-minimizers (must be &lt;=k) [default: 14]
  -k INT                      K-mer size for (w,k)-minimizers [default: 15]

Input/Output:
  -o,--outdir DIR             Directory to write output files to [default: pandora]
  -t,--threads INT            Maximum number of threads to use [default: 1]
  --vcf-refs FILE             Fasta file with a reference sequence to use for each loci. The sequence MUST have a perfect match in &lt;TARGET&gt; and the same name
  --loci-vcf                  Save a VCF file for each found loci

Parameter Estimation:
  -e,--error-rate FLOAT       Estimated error rate for reads [default: 0.11]
  -g,--genome-size STR/INT    Estimated length of the genome - used for coverage estimation. Can pass string such as 4.4m, 100k etc. [default: 5000000]
  --bin                       Use binomial model for kmer coverages [default: negative binomial]

Mapping:
  -m,--max-diff INT           Maximum distance (bp) between consecutive hits within a cluster [default: 250]
  -c,--min-cluster-size INT   Minimum size of a cluster of hits between a read and a loci to consider the loci present [default: 10]

Preset:
  -I,--illumina               Reads are from Illumina. Alters error rate used and adjusts for shorter reads

Filtering:
  --clean                     Add a step to clean and detangle the pangraph
  --max-covg INT              Maximum coverage of reads to accept [default: 300]

Consensus/Variant Calling:
  --genotype                  Add extra step to carefully genotype sites.
  --kmer-avg INT              Maximum number of kmers to average over when selecting the maximum likelihood path [default: 100]
  
Genotyping:
  --local Needs: --genotype   (Intended for developers) Use coverage-oriented (local) genotyping instead of the default ML path-oriented (global) approach.
  -a INT                      Hard threshold for the minimum allele coverage allowed when genotyping [default: 0]
  -s INT                      The minimum required total coverage for a site when genotyping [default: 0]
  -D INT                      Minimum difference in coverage on a site required between the first and second maximum likelihood path [default: 0]
  -F INT                      Minimum allele coverage, as a fraction of the expected coverage, allowed when genotyping [default: 0]
  -E,--gt-error-rate FLOAT    When genotyping, assume that coverage on alternative alleles arises as a result of an error process with rate -E. [default: 0.01]
  -G,--gt-conf INT            Minimum genotype confidence (GT_CONF) required to make a call [default: 1]</code></pre>
<h1 id="discover-novel-variants-in-several-samples">Discover novel
variants in several samples</h1>
<p>This will look for regions in the pangraph where the reads do not map
and attempt to locally assemble these regions to find novel
variants.</p>
<pre><code>$ pandora discover --help
Quasi-map reads to an indexed PRG, infer the sequence of present loci in the sample and discover novel variants.
Usage: pandora discover [OPTIONS] &lt;TARGET&gt; &lt;QUERY_IDX&gt;

Positionals:
  &lt;TARGET&gt; FILE [required]    An indexed PRG file (in fasta format)
  &lt;QUERY_IDX&gt; FILE [required] A tab-delimited file where each line is a sample identifier followed by the path to the fast{a,q} of reads for that sample

Options:
  -h,--help                   Print this help message and exit
  --discover-k INT:[0-32)     K-mer size to use when discovering novel variants [default: 15]
  --max-ins INT               Max. insertion size for novel variants. Warning: setting too long may impair performance [default: 15]
  --covg-threshold INT        Positions with coverage less than this will be tagged for variant discovery [default: 3]
  -l INT                      Min. length of consecutive positions below coverage threshold to trigger variant discovery [default: 1]
  -L INT                      Max. length of consecutive positions below coverage threshold to trigger variant discovery [default: 30]
  -d,--merge INT              Merge candidate variant intervals within distance [default: 15]
  -N INT                      Maximum number of candidate variants allowed for a candidate region [default: 25]
  --min-dbg-dp INT            Minimum node/kmer depth in the de Bruijn graph used for discovering variants [default: 2]
  -v                          Verbosity of logging. Repeat for increased verbosity

Indexing:
  -w INT                      Window size for (w,k)-minimizers (must be &lt;=k) [default: 14]
  -k INT                      K-mer size for (w,k)-minimizers [default: 15]

Input/Output:
  -o,--outdir DIR             Directory to write output files to [default: &quot;pandora_discover&quot;]
  -t,--threads INT            Maximum number of threads to use [default: 1]
  --kg                        Save kmer graphs with forward and reverse coverage annotations for found loci
  -M,--mapped-reads           Save a fasta file for each loci containing read parts which overlapped it

Parameter Estimation:
  -e,--error-rate FLOAT       Estimated error rate for reads [default: 0.11]
  -g,--genome-size STR/INT    Estimated length of the genome - used for coverage estimation. Can pass string such as 4.4m, 100k etc. [default: 5000000]
  --bin                       Use binomial model for kmer coverages [default: negative binomial]

Mapping:
  -m,--max-diff INT           Maximum distance (bp) between consecutive hits within a cluster [default: 250]
  -c,--min-cluster-size INT   Minimum size of a cluster of hits between a read and a loci to consider the loci present [default: 10]

Preset:
  -I,--illumina               Reads are from Illumina. Alters error rate used and adjusts for shorter reads

Filtering:
  --clean                     Add a step to clean and detangle the pangraph
  --clean-dbg                 Clean the local assembly de Bruijn graph
  --max-covg INT              Maximum coverage of reads to accept [default: 600]

Consensus/Variant Calling:
  --kmer-avg INT              Maximum number of kmers to average over when selecting the maximum likelihood path [default: 100]</code></pre>