Genrich: detecting sites of genomic enrichment

Table of Contents

Introduction

Genrich is a peak-caller for genomic enrichment assays (e.g. ChIP-seq, ATAC-seq). It analyzes alignment files generated following the assay and produces a file detailing peaks of significant enrichment.

Quick start

Given: * sample.bam (alignment file, sorted by queryname) * Genrich (compiled as described below)

To produce a file listing regions of genomic enrichment:

$ ./Genrich  -t sample.bam  -o sample.narrowPeak  -v

Software compilation

The software can be downloaded from GitHub.

A Makefile is provided for compilation with GCC, and zlib is also required. The program has been tested after compilation with GCC 5.4.0 and zlib 1.2.8.

To compile, run make in the folder in which the software was downloaded. The executable Genrich should be produced.

Usage message

Usage: ./Genrich  -t <file>  -o <file>  [optional arguments]
Required arguments:
  -t  <file>       Input SAM/BAM file(s) for experimental sample(s)
  -o  <file>       Output peak file (in ENCODE narrowPeak format)
Optional I/O arguments:
  -c  <file>       Input SAM/BAM file(s) for control sample(s)
  -f  <file>       Output bedgraph-ish file for p/q values
  -k  <file>       Output bedgraph-ish file for pileups and p-values
  -b  <file>       Output BED file for reads/fragments/intervals
  -R  <file>       Output file for PCR duplicates (only with -r)
Filtering options:
  -r               Remove PCR duplicates
  -e  <arg>        Comma-separated list of chromosomes to exclude
  -E  <file>       Input BED file(s) of genomic regions to exclude
  -m  <int>        Minimum MAPQ to keep an alignment (def. 0)
  -s  <float>      Keep sec alns with AS >= bestAS - <float> (def. 0)
  -y               Keep unpaired alignments (def. false)
  -w  <int>        Keep unpaired alns, lengths changed to <int>
  -x               Keep unpaired alns, lengths changed to paired avg
Options for ATAC-seq:
  -j               Use ATAC-seq mode (def. false)
  -d  <int>        Expand cut sites to <int> bp (def. 100)
  -D               Skip Tn5 adjustments of cut sites (def. false)
Options for peak-calling:
  -p  <float>      Maximum p-value (def. 0.01)
  -q  <float>      Maximum q-value (FDR-adjusted p-value; def. 1)
  -a  <float>      Minimum AUC for a peak (def. 200.0)
  -l  <int>        Minimum length of a peak (def. 0)
  -g  <int>        Maximum distance between signif. sites (def. 100)
Other options:
  -X               Skip peak-calling
  -P               Call peaks directly from a log file (-f)
  -z               Option to gzip-compress output(s)
  -v               Option to print status updates/counts to stderr

Attributes

Peak-calling method

Here is an overview of the method used by Genrich to identify peaks (Fig. 1): * Parse alignments for the experimental sample and create an experimental “pileup” by counting the DNA fragments that cover each position of the genome (additional information about alignment parsing can be found here). * Create a control pileup using the control sample (if available) and background level (additional information about control/background pileup calculation can be found here). * Calculate p-values for each genomic position, as described here. * (Optional) Convert p-values to q-values, as described here. * Calculate the “area under the curve” (AUC) for all regions reaching statistical significance (e.g., q < 0.05 ⇒ -log(q) > 1.301). * Combine nearby regions and call peaks whose total AUC is above a threshold (details of peak-calling parameters can be found here).

Peak-calling by Genrich
Figure 1. Peak-calling by Genrich. Information about the sample and the Genrich command can be found here. Visualization by IGV.



Alignment parsing

Genrich analyzes paired-end reads aligned to a reference genome. It correctly infers full fragments as spanning between the 5’ ends of two properly paired alignments. By default, it does not consider unpaired alignments (including those from single-end reads), although there are three options for keeping such alignments, as described here.

An alternative analysis mode for ATAC-seq is also provided by Genrich, as described here.

Multimapping reads

Genrich analyzes reads/fragments that map to multiple locations in the genome by adding a fractional count to each location. This allows for peak detection in regions of the genome that are otherwise inaccessible to the assay. The input SAM/BAM file(s) must list secondary alignments for multimapping reads/fragments, such as those reported by the short read aligner Bowtie2 in -k <int> mode or -a mode. Additional information about the processing of secondary alignments by Genrich can be found in the description of the -s parameter.

PCR duplicate removal

Genrich provides an option for removing PCR duplicates (-r). In this process, it analyzes reads/fragments based on their alignments, in three separate groups (proper pairs, discordant pairs, and singletons), and removes those identified as duplicates from further analysis. One novel feature is that this evaluation takes into account reads/fragments with multiple alignments. Additional information on the duplicate identification procedure can be found here.

Genome length calculation

Genrich computes the genome length as the sum of the lengths of the chromosomes (reference sequences) in the header of the SAM/BAM file. The length is reduced if the user specifies chromosomes (-e) or genomic regions (-E) to be excluded, as described below. The calculated length is used for computing a background pileup value and q-values.

Control/background pileup calculation

The background pileup value is calculated by dividing the total sequence information (sum of read/fragment/interval lengths) in the experimental sample by the calculated genome length. The net control pileup value at a particular genomic position is the maximum of the background pileup value and the pileup of the control sample at that position (if a control sample is specified). Note that control pileups are scaled to match the experimental, based on the total sequence information in each.

p-value calculation

The p-value for each base of the genome is calculated assuming a null model with a log-normal distribution. The control/background pileup value is used as the parameter μ, and the σ parameter is 1.2×μ if μ ≤ 7, and 10×log10(μ) if μ > 7. These formulas, as well as the choice of the log-normal distribution, were determined from a comprehensive review of control samples from various ChIP-seq experiments (human, mouse, and worm) downloaded from SRA. * Because the log-normal is a continuous probability distribution, fractional experimental pileup values can be considered. Such values need to be evaluated by Genrich due to reads/fragments with multiple alignments. * Earlier versions of Genrich (prior to v0.5) used the exponential distribution for the null model. Although this distribution has some convenient properties (continuous, one-parameter, -log10(p) ∝ x / β), the study described above showed that the exponential distribution was a good fit to the control pileup distribution only occasionally. However, this was still better than the Poisson distribution, which is frequently used as a null model in genomics software but was a good fit to the control pileup distribution never.

q-value calculation

The q-value for each base of the genome is calculated from the p-value using the Benjamini-Hochberg procedure. The calculated genome length is used as the number of hypothesis tests (m).

Multiple replicates

Genrich calls peaks for multiple replicates collectively. First, it analyzes the replicates separately, with p-values calculated for each. At each genomic position, the multiple replicates’ p-values are then combined by Fisher’s method. The combined p-values are converted to q-values, and peaks are called. This obviates the need for IDR (you’re welcome!).

I/O files and options

Required files

  -t  <file>       Input SAM/BAM file(s) for experimental sample(s)
  -o  <file>       Output peak file (in ENCODE narrowPeak format)
chr1    894446    894988    peak_10    402    .    217.824936    4.344683    1.946031    317
chr1    895834    896167    peak_11    343    .    114.331093    4.344683    1.946031    90

Optional files

  -c  <file>       Input SAM/BAM file(s) for control sample(s)
  -f  <file>       Output bedgraph-ish file for p/q values
chr1    894435    894436    33.000000    2.477916    3.183460    1.208321
chr1    894436    894442    34.000000    2.477916    3.231466    1.241843
chr1    894442    894446    35.000000    2.477916    3.278469    1.274561
chr1    894446    894447    36.000000    2.477916    3.324516    1.306471    *
chr1    894447    894450    39.000000    2.477916    3.457329    1.398035    *
chr1    894450    894451    40.000000    2.477916    3.499948    1.427253    *
chr1    894451    894460    41.000000    2.477916    3.541798    1.455938    *
  -k  <file>       Output bedgraph-ish file for pileups and p-values
  -b  <file>       Output BED file for reads/fragments/intervals
  -R  <file>       Output file for PCR duplicates (only with -r)
SRR5427886.5958     chr4:185201876-185201975            SRR5427886.4688    paired
SRR5427886.1826     chr12:34372610,+;chr1:91852878,-    SRR5427886.2040    discordant
SRR5427886.10866    chr14:53438632,+                    SRR5427886.4746    single

Filtering options

  -r               Remove PCR duplicates

  -e  <arg>        Comma-separated list of chromosomes to exclude
  -E  <file>       Input BED file(s) of genomic regions to exclude
  -m  <int>        Minimum MAPQ to keep an alignment (def. 0)

  -s  <float>      Keep sec alns with AS >= bestAS - <float> (def. 0)

Unpaired alignments

By default, Genrich analyzes only properly paired alignments and infers the full fragments as spanning between the 5’ ends of the two alignments (Fig. 2). It does not analyze unpaired alignments unless one of three options is selected:

  -y               Keep unpaired alignments (def. false)
  -w  <int>        Keep unpaired alns, lengths changed to <int>
  -x               Keep unpaired alns, lengths changed to paired avg
Alignment analysis
Figure 2. Analysis of alignments by Genrich. The alignment file example.bam has both properly paired alignments (top left) and unpaired alignments (top right). By default, Genrich infers the full fragments from the paired alignments and discards the unpaired alignments. Unpaired alignments can be kept via -y, -w <int>, or -x, as described above.



ATAC-seq mode

ATAC-seq is a method for assessing genomic regions of open chromatin. Since only the ends of the DNA fragments indicate where the Tn5 transposase enzyme was able to insert into the chromatin, it may not be optimal to interpret alignments as shown above (Fig. 2). Genrich has an alternative analysis mode for ATAC-seq in which it creates intervals centered on transposase cut sites (Fig. 3).

  -j               Use ATAC-seq mode (def. false)
  -d  <int>        Expand cut sites to <int> bp (def. 100)
  -D               Skip Tn5 adjustments of cut sites (def. false)
ATAC-seq mode
Figure 3. ATAC-seq mode of Genrich. Genrich analyzes intervals centered on cut sites (both ends of full fragments, as well as the 5’ ends of unpaired alignments if -y is set, adjusted forward by 5bp). The lengths of the intervals can be changed from the default of -d 100.



Unpaired alignments can be analyzed with -y, though only one interval, centered on the read’s 5’ end, is inferred. Both -w <int> and -x are equivalent to -y in ATAC-seq mode.

By default, Genrich centers the intervals at the ends of the reads/fragments, adjusted forward by 5bp to account for the Tn5 transposase occupancy. That is, for the 5’ ends of fragments (or for reads aligning in a normal orientation), the position is increased by +5, and for the 3’ ends of fragments (or for reads aligning in a reverse-complement orientation), the position is adjusted by -5. To avoid this position adjustment (e.g. for DNase-seq), one can use -D.

For full fragments, when the two cut site intervals overlap, they are merged into a single interval. To get a BED file of cut sites, one can run -d 1 -b <file>.

The remainder of the peak-calling process (calculating pileups and significance values) is identical to the default analysis mode. Note that the interval lengths (not the fragment lengths) are used to sum the total sequence information for the calculation of control/background pileup values.

Peak-calling parameters

  -p  <float>      Maximum p-value (def. 0.01)
  -q  <float>      Maximum q-value (FDR-adjusted p-value; def. 1)
  -a  <float>      Minimum AUC for a peak (def. 200.0)
  -l  <int>        Minimum length of a peak (def. 0)
  -g  <int>        Maximum distance between signif. sites (def. 100)

Miscellaneous

  -X               Skip peak-calling

  -P               Call peaks directly from a log file (-f)
  -z               Option to gzip-compress output(s)
  -S               Option to skip sort order check

Other options:

  -v/--verbose     Option to print status updates/counts to stderr
  -h/--help        Print the usage message and exit
  -V/--version     Print the version and exit

Full analysis example

A sequencing run was downloaded from SRA. Its reads were adapter-trimmed by NGmerge and aligned to the human genome (hg19) by Bowtie2 with -k 20. The SAM alignments from bowtie2 were piped through SAMtools to sort them by queryname and convert the file to the BAM format. The resulting alignment file SRR5427886.bam was analyzed by Genrich with the options to remove PCR duplicates (-r) and to keep unpaired alignments, extended to the average fragment length (-x). All alignments to chrM and chrY were discarded (-e chrM,chrY), as well as alignments to two sets of excluded intervals: regions of ‘N’ homopolymers in the hg19 genome (produced by findNs.py) and high mappability regions (-E hg19_Ns.bed,wgEncodeDukeMapabilityRegionsExcludable.bed.gz). There was no control sample. Peaks were called using a maximum q-value of 0.05 (-q 0.05) and a minimum AUC of 20.0 (-a 20.0).

$ ./Genrich  -t SRR5427886.bam  -o SRR5427886.narrowPeak  \
  -f SRR5427886.log  -r  -x  -q 0.05  -a 20.0  -v  \
  -e chrM,chrY  -E hg19_Ns.bed,wgEncodeDukeMapabilityRegionsExcludable.bed.gz 
Processing experimental file #0: SRR5427886.bam
  BAM records analyzed:  146277509
    Unmapped:              2877550
    To skipped refs:       2560688
      (chrM,chrY)
    Paired alignments:   137462998
      secondary alns:     67724920
    Unpaired alignments:   3376273
      secondary alns:      2232056
  PCR duplicates --
    Paired aln sets:      35574220
      duplicates:          4660723 (13.1%)
    Discordant aln sets:     64679
      duplicates:             7736 (12.0%)
    Singleton aln sets:    1036778
      duplicates:           475201 (45.8%)
  Fragments analyzed:     31286234
    Full fragments:       30616993
      (avg. length: 226.4bp)
    Half fragments:         669241
      (from unpaired alns, extended to 226bp)
- control file #0 not provided -
  Background pileup value: 2.477916
Peak-calling parameters:
  Genome length: 2826865605bp
  Significance threshold: -log(q) > 1.301
  Min. AUC: 20.000
  Max. gap between sites: 100bp
Peaks identified: 35114 (27918264bp)

The total time to execute the above command (analyzing a BAM of 146.3 million alignments and calling peaks) was 10.5min. It required 17.1GB of memory.

Once a log file (SRR5427886.log) is produced, calling peaks with an alternative set of parameters is accomplished most easily with the -P option. For example, with -p 0.01 -a 200:

$ ./Genrich  -P  -f SRR5427886.log  -o SRR5427886_p01_a200.narrowPeak  -p 0.01  -a 200  -v
Peak-calling from log file: SRR5427886.log
Peak-calling parameters:
  Genome length: 2826865605bp
  Significance threshold: -log(p) > 2.000
  Min. AUC: 200.000
  Max. gap between sites: 100bp
Peaks identified: 47329 (62194983bp)

This required just 36sec and less than 0.1MB of memory.

However, if one wanted the alignments to be interpreted differently, such as ATAC-seq mode (-j), the original command (with -t SRR5427886.bam) would need to be rerun.

Warning messages

In verbose mode, Genrich may print one or more warnings to stderr: * Read N prevented from extending below 0 on <chrom>: This may occur due to extending unpaired alignments (-w <int>, -x) or in ATAC-seq mode (-j). * Read N prevented from extending past <int> on <chrom>: This also may occur due to extending unpaired alignments (-w <int>, -x) or in ATAC-seq mode (-j). A maximum of 128 warning messages of these types (Read N prevented...) are printed per SAM/BAM. * Large scaling may mask true signal: This is printed if the scaling factor for the control pileup is greater than 5. * BED interval ignored - located off end of reference: An excluded BED interval (-E) whose start coordinate is past the end of the reference sequence is ignored. One should ensure that the genome version that produced the BED intervals matches that of the SAM/BAM. * BED interval extends past end of ref. - edited to <loc>: An excluded BED interval (-E) whose end coordinate is past the end of the reference sequence is adjusted as indicated. Again, one should ensure that the genome version that produced the BED intervals matches that of the SAM/BAM. * No paired alignments to calculate avg frag length -- Printing unpaired alignments "as is": When there are no properly paired alignments and the -x average extension option is selected, the unpaired alignments are printed as they appear in the SAM/BAM. * Read N, alignment at <loc> skipped due to overflow: The maximum difference in pileup values from one genomic position to the next is +32767, and additional read alignments are skipped due to this limitation. Removing PCR duplicates (-r) may help reduce this issue. * Read N, alignment at <loc> skipped due to underflow: The minimum difference in pileup values from one genomic position to the next is -32768, and additional read alignments are skipped due to this limitation. Removing PCR duplicates (-r) may help reduce this issue. * Read N has more than 128 alignments: If a read has more than 128 alignments in the SAM/BAM, only the first 128 are considered. As described above, the best 10 alignments (at most) are ultimately analyzed by Genrich. * All q-values are 1: When no p-values are smaller than the reciprocal of the calculated genome length, all q-values are calculated to be 1. If peaks are expected to be called, one should consider using a p-value threshold instead (-p). * Skipping chromosome N -- Reads aligning to it were used...: This occurs when peak-calling directly from a log file (-P), and the log file indicates that reads/fragments aligned to a chromosome to be skipped (-e). When Genrich created the log file, the chromosome was not skipped. Therefore, the reads/fragments that aligned to it were used in the background pileup calculation, and its length was included in the genome length calculation. Those calculations affected the statistics present in the log file, so the called peaks may be slightly different than those called when running the full analysis. * Skipping given BED regions -- Reads aligning to them were used...: This occurs when peak-calling directly from a log file (-P), and the log file indicates that reads/fragments aligned to one or more genomic regions to be skipped (-E). When Genrich created the log file, the regions were not skipped. Therefore, the reads/fragments that aligned to them were used in the background pileup calculation, and their lengths were included in the genome length calculation. Those calculations affected the statistics present in the log file, so the called peaks may be slightly different than those called when running the full analysis. * "orphan" alns: An “orphan” alignment is one that the SAM/BAM indicates is properly paired, but its pair could not be found. This could be due to a poorly formatted SAM/BAM, or possibly (but unlikely) a bug in alignment parsing by Genrich. This warning appears in the accounting of alignments analyzed.

Computational requirements

Genrich runs very quickly but uses a considerable amount of memory. For starters, it requires 3 bytes for every base-pair of the reference genome, i.e. ~9GB for a human sample. The number of input files has little effect on memory, but certain analysis options (especially the option to remove PCR duplicates) can greatly increase the memory usage, particularly with large SAM/BAM input files. See above for an example.

Genrich is not multi-threaded.

Contact

Genrich

Copyright © 2018 John M. Gaspar (jsh58@wildcats.unh.edu)

Any question, concern, or bug report about the program should be posted as an Issue on GitHub. Before posting, please check previous issues (both Open and Closed) to see if your issue has been addressed already. Also, please follow these good GitHub practices.