.. _api:

deepTools API example
=====================

The following is a short overview of the most useful methods and classes
from deepTools.
Complete information can be found in the following links: :ref:`genindex` and :ref:`modindex`


Finding read coverage over a region
-----------------------------------

With deepTools, the read coverage over multiple genomic regions and multiple files can be computed quite quickly using multiple processors.
First, we start with a simple example that is later expanded upon to demonstrate
the use of multipe processors.
In this example we compute the coverage of reads over a small region for bins of 50bp. For this we need the :class:`deeptools.countReadsPerBin` class.


.. code:: python

    import deeptools.countReadsPerBin as crpb


We also need a BAM file containing the aligned reads.
The BAM file must be indexed to allow quick access to reads
falling into the regions of interest.

.. code:: python

    bam_file = "file.bam"

Now, the ``CountReadsPerBin`` object can be initialized.
The first argument to the constructor is a list of BAM files,
which in this case is just one file.
We are going to use a ``binLength`` of 50 bases, with subsequent bins adjacent
(i.e., the ``stepSize`` between bins is also 50 bases). Overlapping bin 
coverages can be used by setting a ``stepSize`` smaller than ``binLength``.

.. code:: python

    cr = crpb.CountReadsPerBin([bam_file], binLength=50, stepSize=50)


Now, we can compute the coverage over a region in chromosome 2 from position 0
to 1000.

.. code:: python

    cr.count_reads_in_region('chr2L', 0, 1000)

.. parsed-literal::

    (array([[ 2.],
           [ 3.],
           [ 1.],
           [ 2.],
           [ 3.],
           [ 2.],
           [ 4.],
           [ 3.],
           [ 2.],
           [ 3.],
           [ 4.],
           [ 6.],
           [ 4.],
           [ 2.],
           [ 2.],
           [ 1.]]), '')

The result is a tuple with the first element a numpy array with one row per bin and one column per bam file. Since only one BAM file was used, there is only one column. If a file name for saving the raw data had been specificied, then the temporary file name used for this would appear in the second item of the tuple.

Filtering reads
---------------

If reads should be filtered, the relevant options simply
need to be passed to the constructor. In the following code, the reads are filtered
such that only those with a mapping quality of at least 20 and not aligned to the
reverse strand are kept (samFlag_exclude=16, where 16 is the value for reverse reads, see
the [SAM Flag Calculator](http://broadinstitute.github.io/picard/explain-flags.html)
for more info).
Furthermore, duplicated reads are ignored.

.. code:: python

    cr = crpb.CountReadsPerBin([bam_file], binLength=50, stepSize=50,
                               minMappingQuality=20,
                               samFlag_exclude=16,
                               ignoreDuplicates=True
                               )
    cr.count_reads_in_region('chr2L', 1000000, 1001000)

.. parsed-literal::

    (array([[ 1.],
           [ 1.],
           [ 0.],
           [ 0.],
           [ 0.],
           [ 0.],
           [ 2.],
           [ 3.],
           [ 1.],
           [ 0.],
           [ 1.],
           [ 2.],
           [ 0.],
           [ 0.],
           [ 1.],
           [ 2.],
           [ 1.],
           [ 0.],
           [ 0.],
           [ 0.]]), '')

Sampling the genome
-------------------

Instead of adjacent bins, as in the previous cases, a genome can
simply be sampled. This is useful to estimate some values,
like depth of sequencing, without having to look at the complete genome. In the following example,
10,000 positions of size 1 base are going to be queried from three bam files to compute the average depth of sequencing.
For this, we set the `numberOfSamples` parameter in the object constructor.

The `run()` method is used instead of `count_reads_in_region` to provide efficient sampling over the entire genome.

.. code:: python

    cr = crpb.CountReadsPerBin([bam_file1, bam_file2, bam_file3],
                               binLength=1, numberOfSamples=10000,
                               numberOfProcessors=10)
    sequencing_depth = cr.run()
    print sequencing_depth.mean(axis=0)

.. parsed-literal::
    [  1.98923924   2.43743744  22.90102603]


The `run()` method splits the computation over 10 processors and collates
the results. When the parameter `numberOfSamples` is used, the regions selected
for the computation of the coverage are not random. Instead, the genome is split into 'number-of-samples'
equal parts and the start of each part is queried for its coverage. You can also compute coverage over selected regions by inputting a BED file.

Now it is possible to make some diagnostic plots from the results:

.. code:: python

    fig, axs = plt.subplots(1, 2, figsize=(15,5))
    # plot coverage
    for col in res.T:
        axs[0].plot(np.bincount(col.astype(int)).astype(float)/total_sites)
        csum = np.bincount(col.astype(int))[::-1].cumsum()
        axs[1].plot(csum.astype(float)[::-1] / csum.max())
    axs[0].set_xlabel('coverage')
    axs[0].set_ylabel('fraction of bases sampled')
    # plot cumulative coverage

    axs[1].set_xlabel('coverage')
    axs[1].set_ylabel('fraction of bases sampled >= coverage')


.. image:: ../images/plot_coverage.png


Computing the FRiP score
------------------------

The FRiP score is defined as the fraction of reads that fall into a peak and is 
often used as a measure of ChIP-seq quality. For this example, we
need a BED file containing the peak regions. Such files are
usually computed using a peak caller. Also, two bam files are
going to be used, corresponding to two biological replicates.

.. code:: python

    bed_files = ["peaks.bed"]
    cr = countReadsPerBin.CountReadsPerBin([bam_file1, bam_file2],
                                            bedFile=bed_files,
                                            numberOfProcessors=10)
    reads_at_peaks = cr.run()
    print reads_at_peaks

.. parsed-literal::

    array([[ 322.,  248.],
           [ 231.,  182.],
           [ 112.,  422.],
           ..., 
           [ 120.,   76.],
           [ 235.,  341.],
           [ 246.,  265.]])


The result is a numpy array with a row for each peak region and a column for each BAM file.

.. code:: python

    reads_at_peaks.shape


.. parsed-literal::

    (6295, 2)

Now, the total number of reads per peaks per bam file is computed:

.. code:: python

    total = reads_at_peaks.sum(axis=0)

Next, we need to find the total number of mapped reads in each of the bam files. For
this we use the pysam module.

.. code:: python

    import pysam
    bam1 = pysam.AlignmentFile(bam_file1)
    bam2 = pysam.AlignmentFile(bam_file2)

Now, `bam1.mapped` and `bam2.mapped` contain the total number of mapped
reads in each of the bam files, respectively.

Finally, we can compute the FRiP score:

.. code:: python

    frip1 = float(total[0]) / bam1.mapped
    frip2 = float(total[1]) / bam2.mapped
    print frip1, frip2

.. parsed-literal::

    0.170030741997, 0.216740390353


Using mapReduce to sample paired-end fragment lengths
------------------------------------------------------

deepTools internally uses a map-reduce strategy, in which a computation is split into smaller
parts that are sent to different processors. The output from the different processors is subsequently collated. The following
example is based on the code available for `bamPEFragmentSize.py`

Here, we retrieve the reads from a BAM file and collect the
fragment length. Reads are retrieved using pysam, and the `read` object returned
contains the `template_length` attribute, which is the number of bases from the
leftmost to the rightmost mapped base in the read pair.

First, we will create a function that can collect fragment lengths over a genomic
position from a BAM file. As we will later call this function using
mapReduce, the function accepts only one argument, namely 
a tuple with the parameters: chromosome name, start position, end position, and BAM file name.

.. code:: python

    import pysam
    import numpy as np
    def get_fragment_length(args):
        chrom, start, end, bam_file_name = args
        bam = pysam.AlignmentFile(bam_file_name)
        f_lens_list = []
        for fetch_start in range(start, end, 1000000):
            # simply get the reads over a region of 10000 bases
            fetch_end = min(end, fetch_start + 10000)

            f_lens_list.append(np.array([abs(read.template_length)
                                  for read in bam.fetch(chrom, fetch_start, fetch_end)
                                  if read.is_proper_pair and read.is_read1]))

        # concatenate all results
        return np.concatenate(f_lens_list)


Now, we can use `mapReduce` to call this function and compute fragment lengths
over the whole genome. mapReduce needs to know the chromosome sizes, which
can be easily retrieved from the BAM file. Furthermore, it needs to know
the size of the region(s) sent to each processor. For this
example, a region of 10 million bases is sent to each processor using the `genomeChunkLength` parameter.
In other words, each processor executes the same `get_fragment_length` function to collect data over
different 10 million base regions. The arguments to mapReduce are the list of arguments sent to the function, besides
the first obligatory three (chrom start, end). In this case only one extra argument is passed
to the function, the BAM file name. The next two positional arguments are the name of the function to call
(`get_fragment_length`) and the chromosome sizes.

.. code:: python

    import deeptools.mapReduce
    bam = pysam.AlignmentFile(bamFile)
    chroms_sizes = list(zip(bam.references, bam.lengths))

    result = mapReduce.mapReduce([bam_file_name],
                                 get_fragment_length,
                                 chroms_sizes,
                                 genomeChunkLength=10000000,
                                 numberOfProcessors=20,
                                 verbose=True)

    fragment_lengths =  np.concatenate(result)

    print("mean fragment length {}".format(fragment_lengths.mean()))
    print("median fragment length {}".format(np.median(fragment_lengths)))


.. parsed-literal::

    0.170030741997, 0.216740390353


Indices and tables
------------------

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`