ShapeMapper file format descriptions

Input files

FASTA file(s)

This file specifies reference (target) sequences. File extension should be .fa or .fasta. File should contain one or more target DNA sequences ('T' not 'U'). Sequences must not contain spaces or tabs, but may be broken down into multiple lines. Each sequence must be preceded by a line with >RNA_name, where RNA_name is replaced with the name of the RNA of interest. Lowercase positions will be excluded from reactivity profiles, and should be used to indicate primer-binding sites if using amplicon primers on either end of the sequence. If multiple primer pairs were used, or primer sites are not on the ends of the sequence, provide the primer sequences in a separate file with --primers (see below).

Example:

>TPP_riboswitch
ggccttcgggccaaggaCTCGGGGTGCCCTTCTGCGTGAAGGCTGAGAAATACCCGT
ATCACCTGATCTGGATAATGCCAGCGTAGGGAAGTTCTCGATCCGGTTCGCCGGATC
CAAATcgggcttcggtccggttc

Primers file

Pairs of primers should be listed following the name of each RNA preceded by '>'. RNA names in this file must match RNA names in provided target sequence .fa files.

>RNA_name
forward_primer_sequence reverse_primer_sequence

Example:

>U1_snRNA
ATACTTACCTGGCA CAGGGGAAAGCGCGAA

Each primer pair should be on its own line. Multiple pairs can be provided, as can multiple RNAs.

FASTQ file(s)

These files must have the extension .fastq or .fq, or .fastq.gz if they are compressed, and must be FASTQ formatted.

If --folder is used to pass a folder of FASTQ files to ShapeMapper, ShapeMapper will attempt to identify and match up files corresponding to paired reads by finding 'R1'|'r1' and 'R2'|'r2' in the filenames, separated by '.' or '_' characters.

Reads from separate instrument barcode indices must be in separate files, and should not contain index sequences.

    

Output files

<name>_shapemapper_log.txt

Run progress and summary outputs. Includes mate pair merging stats, read alignment stats, reactivity profile quality control checks, and amplicon primer pair read depths.

    

<name>_<RNA>_profile.txt

Tab-delimited text columns. First line is column names.

Column name Content
Nucleotide Nucleotide number (1-based)
Sequence Nucleotide (AUGCaugc)
<Sample>_mutations Mutation counts
<Sample>_read_depth Read depth
<Sample>_effective_depth see Effective read depth
<Sample>_rate Effective mutation rate calculated as
(mutation count / effective read depth)
<Sample>_off_target_mapped_depth Simple mapped read depths for reads
not meeting --amplicon/--primer location
requirements
<Sample>_low_mapq_mapped_depth Simple mapped read depths for reads
not meeting --min-mapq
<Sample>_mapped_depth
or <Sample>_primer_pair_<n>_mapped_depth
Simple mapped read depths for included
reads, broken down by primer pair if
applicable
Reactivity_profile Calculated reactivity profile
Std_err Standard error
HQ_profile Reactivity profile with high-background and
low-depth positions excluded (set to nan)
HQ_stderr Standard error with high-background and
low-depth positions excluded
Norm_profile Reactivity profile after normalization
(scaling)
Norm_stderr Standard error after normalization

    

<name>_<RNA>.shape

Reactivity profile in format expected by RNAstructure software. Tab-delimited text. Two columns. First column is 1-based nucleotide position. Second is normalized reactivity, with excluded positions set to -999.

    

<name>_<RNA>.map

Same as .shape file, but with an additional two columns. Third column is stderr, fourth is nucleotide sequence.

    

<name>_<RNA>_varna_colors.txt and <name>_<RNA>_ribosketch_colors.txt

Simplified reactivity profile suitable for import into VARNA or Ribosketch. Single column of normalized reactivity values. Reactivities above 0.85 are set to 0.85, reactivities below 0 are set to 0, and missing data positions are set to 0.

    

<name>_<RNA>_profiles.pdf

Figures showing read depths, mutation rates, and reactivity profile.

    

<name>_<RNA>_histograms.pdf

Figures showing read depth, mutation rate and reactivity histograms.

    

<name>_<RNA>_mapped_depths.pdf

Figures showing simple mapped read depths. Shows reads excluded due to low aligner-reported MAPQ (mapping quality score), and shows off-target reads excluded due to not aligning near expected amplicon primer pair locations. Reads included in analysis are further broken down by primer pair. See example plots.

    

<name>_<RNA>_per-amplicon_abundance.txt

Nucleotide locations and maximum mapped read depths associated with each amplicon primer pair.

    

Optional intermediate output files

Processed reads

Commandline option: --output-processed-reads These files contain reads after initial quality trimming and paired read merging steps are performed.

    

Aligned reads

Commandline option: --output-aligned-reads Filename: *_aligned.sam if bowtie2 used or *_aligned_paired.sam and *_aligned_unpaired.sam if STAR used. Format: SAM

    

Parsed mutations

Commandline option: --output-parsed-mutations

Filename: <name>_<sample>_<RNA>_parsed.mut

Format:

Text, one line per mapped read. Major fields are tab-delimited. Final field contains internal space-delimited fields.

Field Content
1 read type (see below)
2 read name
3 0-based leftmost mapping position (inclusive)
4 0-based rightmost mapping position (inclusive)
5 read mapping category (see below)
6 primer pair index (0-based),
or -999 if no associated primers
7 mapped depth array (see below)
8 effective depth array (see below)
9 mutation count array (see below)
10 mutations (see below)

Read type:

Read type is one of

  • PAIRED_R1
  • PAIRED_R2
  • UNPAIRED_R1
  • UNPAIRED_R2
  • UNPAIRED
  • MERGED
  • PAIRED

Read mapping category:

Read mapping category is one of

  • INCLUDED
  • LOW_MAPQ
  • OFF_TARGET

Off-target and low-MAPQ reads are excluded from analysis (with the exception of showing up as dashed lines in mapped depth plots).

Arrays:

Arrays are right - left + 1 characters long (that is, the same length as the reference sequence over the aligned region), and contain only '0' and '1' characters.

Mapped depths: Simple read coverage over the aligned region. For unpaired reads or paired reads without a mapping mate pair, this array is full of '1's. For paired reads that do not overlap, this array will contain a region of interior '0's that indicate non-covered sequence between the mate pairs.

Effective read depths: Read coverage excluding primer regions, low-quality basecalls, and the covered region of multinucleotide mutations excepting the inferred adduct site. These arrays are summed to give the denominator used in calculating mutation rate.

Mutation counts: '1' indicates inferred adduct sites for a single read. These arrays are summed to give the numerator used in calculating mutation rate.

Mutations:

Mutation fields are space-delimited in groups of five. Each group represents a mutation from the target (reference) sequence.

Field Content
1 0-based nearest unchanged target sequence position on the left
2 0-based nearest unchanged target sequence position on the right
3 Double-quoted read sequence that replaces the target sequence
between the two positions.
4 Double-quoted basecall quality scores for each read position
in the previous field.
5 Double-quoted mutation classification (see below)

Mutation classifications:

"A-","T-","G-","C-" (single-nucleotide deletions)

"-A","-T","-G","-C","-N" (single-nucleotide insertions)

"AT", "AG", "AC", "TA", "TG", "TC", "GA", "GT", "GC", "CA", "CT", "CG" (single-nucleotide mismatches)

"multinuc_deletion"

"multinuc_insertion"

"multinuc_mismatch"

"complex_deletion"

"complex_insertion"

"N_match" (not a real mutation, just an ambiguous basecall)

All classifications may also have _ambig appended if a mutation involves any ambiguously aligned nucleotides.

    

Mutation counts

Commandline option: --output-counted-mutations

Filename: <name>_<sample>_<RNA>_mutation_counts.txt

Format:

Tab-delimited text columns. First line is column headers. One column for each mutation classification listed in the previous section, with the exception of N_match. These columns contain mutation counts listed 5′ to 3′.

Columns read_depth and effective_depth contain sequencing depths (see Effective read depth).

Columns off_target_mapped_depth and low_mapq_mapped_depth indicate simple mapped read depths for reads excluded due to low aligner-reported MAPQ or due to failing to align near expected amplicon primer sites.

Columns mapped_depth or primer_pair_<n>_mapped_depth indicate simple mapped read depths for all reads included in analysis, broken down by associated amplicon primer pair if applicable.

    

← back to README