ShapeMapper file format descriptions

Input files

FASTA file(s)

This file specifies reference (target) sequences. File extension should be .fa or .fasta. File should contain one or more target DNA sequences ('T' not 'U'). Sequences must not contain spaces or tabs, but may be broken down into multiple lines. Each sequence must be preceded by a line with >RNA_name, where RNA_name is replaced with the name of the RNA of interest. Lowercase positions will be excluded from reactivity profiles, and should be used to indicate primer-binding sites if using amplicon primers on either end of the sequence. If multiple primer pairs were used, or primer sites are not on the ends of the sequence, provide the primer sequences in a separate file with --primers (see below).

Example:

>TPP_riboswitch
ggccttcgggccaaggaCTCGGGGTGCCCTTCTGCGTGAAGGCTGAGAAATACCCGT
ATCACCTGATCTGGATAATGCCAGCGTAGGGAAGTTCTCGATCCGGTTCGCCGGATC
CAAATcgggcttcggtccggttc

Primers file

Pairs of primers should be listed following the name of each RNA preceded by '>'. RNA names in this file must match RNA names in provided target sequence .fa files.

>RNA_name
forward_primer_sequence reverse_primer_sequence

Example:

>U1_snRNA
ATACTTACCTGGCA CAGGGGAAAGCGCGAA

Each primer pair should be on its own line. Multiple pairs can be provided, as can multiple RNAs.

FASTQ file(s)

These files must have the extension .fastq or .fq, or .fastq.gz if they are compressed, and must be FASTQ formatted.

If --folder is used to pass a folder of FASTQ files to ShapeMapper, ShapeMapper will attempt to identify and match up files corresponding to paired reads by finding 'R1'|'r1' and 'R2'|'r2' in the filenames, separated by '.' or '_' characters.

Reads from separate instrument barcode indices must be in separate files, and should not contain index sequences.

Output files

`<name>_shapemapper_log.txt`

Run progress and summary outputs. Includes mate pair merging stats, read alignment stats, reactivity profile quality control checks, and amplicon primer pair read depths.

`<name>_<RNA>_profile.txt`

Tab-delimited text columns. First line is column names.

Column name	Content
`Nucleotide`	Nucleotide number (1-based)
`Sequence`	Nucleotide (AUGCaugc)
`<Sample>_mutations`	Mutation counts
`<Sample>_read_depth`	Read depth
`<Sample>_effective_depth`	see Effective read depth
`<Sample>_rate`	Effective mutation rate calculated as `(mutation count / effective read depth)`
`<Sample>_off_target_mapped_depth`	Simple mapped read depths for reads not meeting `--amplicon`/`--primer` location requirements
`<Sample>_low_mapq_mapped_depth`	Simple mapped read depths for reads not meeting `--min-mapq`
`<Sample>_mapped_depth` or `<Sample>_primer_pair_<n>_mapped_depth`	Simple mapped read depths for included reads, broken down by primer pair if applicable
`Reactivity_profile`	Calculated reactivity profile
`Std_err`	Standard error
`HQ_profile`	Reactivity profile with high-background and low-depth positions excluded (set to nan)
`HQ_stderr`	Standard error with high-background and low-depth positions excluded
`Norm_profile`	Reactivity profile after normalization (scaling)
`Norm_stderr`	Standard error after normalization

`<name>_<RNA>.shape`

Reactivity profile in format expected by RNAstructure software. Tab-delimited text. Two columns. First column is 1-based nucleotide position. Second is normalized reactivity, with excluded positions set to -999.

`<name>_<RNA>.map`

Same as .shape file, but with an additional two columns. Third column is stderr, fourth is nucleotide sequence.

`<name>_<RNA>_varna_colors.txt` and `<name>_<RNA>_ribosketch_colors.txt`

Simplified reactivity profile suitable for import into VARNA or Ribosketch. Single column of normalized reactivity values. Reactivities above 0.85 are set to 0.85, reactivities below 0 are set to 0, and missing data positions are set to 0.

`<name>_<RNA>_profiles.pdf`

Figures showing read depths, mutation rates, and reactivity profile.

`<name>_<RNA>_histograms.pdf`

Figures showing read depth, mutation rate and reactivity histograms.

`<name>_<RNA>_mapped_depths.pdf`

Figures showing simple mapped read depths. Shows reads excluded due to low aligner-reported MAPQ (mapping quality score), and shows off-target reads excluded due to not aligning near expected amplicon primer pair locations. Reads included in analysis are further broken down by primer pair. See example plots.

`<name>_<RNA>_per-amplicon_abundance.txt`

Nucleotide locations and maximum mapped read depths associated with each amplicon primer pair.

Optional intermediate output files

Processed reads

Commandline option: --output-processed-reads These files contain reads after initial quality trimming and paired read merging steps are performed.

Aligned reads

Commandline option: --output-aligned-reads Filename: *_aligned.sam if bowtie2 used or *_aligned_paired.sam and *_aligned_unpaired.sam if STAR used. Format: SAM

Parsed mutations

Commandline option: --output-parsed-mutations

Filename: <name>_<sample>_<RNA>_parsed.mut

Format:

Text, one line per mapped read. Major fields are tab-delimited. Final field contains internal space-delimited fields.

Field	Content
1	read type (see below)
2	read name
3	0-based leftmost mapping position (inclusive)
4	0-based rightmost mapping position (inclusive)
5	read mapping category (see below)
6	primer pair index (0-based), or -999 if no associated primers
7	mapped depth array (see below)
8	effective depth array (see below)
9	mutation count array (see below)
10	mutations (see below)

Read type:

Read type is one of

PAIRED_R1
PAIRED_R2
UNPAIRED_R1
UNPAIRED_R2
UNPAIRED
MERGED
PAIRED

Read mapping category:

Read mapping category is one of

INCLUDED
LOW_MAPQ
OFF_TARGET

Off-target and low-MAPQ reads are excluded from analysis (with the exception of showing up as dashed lines in mapped depth plots).

Arrays:

Arrays are right - left + 1 characters long (that is, the same length as the reference sequence over the aligned region), and contain only '0' and '1' characters.

Mapped depths: Simple read coverage over the aligned region. For unpaired reads or paired reads without a mapping mate pair, this array is full of '1's. For paired reads that do not overlap, this array will contain a region of interior '0's that indicate non-covered sequence between the mate pairs.

Effective read depths: Read coverage excluding primer regions, low-quality basecalls, and the covered region of multinucleotide mutations excepting the inferred adduct site. These arrays are summed to give the denominator used in calculating mutation rate.

Mutation counts: '1' indicates inferred adduct sites for a single read. These arrays are summed to give the numerator used in calculating mutation rate.

Mutations:

Mutation fields are space-delimited in groups of five. Each group represents a mutation from the target (reference) sequence.

Field	Content
1	0-based nearest unchanged target sequence position on the left
2	0-based nearest unchanged target sequence position on the right
3	Double-quoted read sequence that replaces the target sequence between the two positions.
4	Double-quoted basecall quality scores for each read position in the previous field.
5	Double-quoted mutation classification (see below)

Mutation classifications:

"A-","T-","G-","C-" (single-nucleotide deletions)

"-A","-T","-G","-C","-N" (single-nucleotide insertions)

"AT", "AG", "AC", "TA", "TG", "TC", "GA", "GT", "GC", "CA", "CT", "CG"(single-nucleotide mismatches)

"multinuc_deletion"

"multinuc_insertion"

"multinuc_mismatch"

"complex_deletion"

"complex_insertion"

"N_match" (not a real mutation, just an ambiguous basecall)

All classifications may also have _ambig appended if a mutation involves any ambiguously aligned nucleotides.

Mutation counts

Commandline option: --output-counted-mutations

Filename: <name>_<sample>_<RNA>_mutation_counts.txt

Format:

Tab-delimited text columns. First line is column headers. One column for each mutation classification listed in the previous section, with the exception of N_match. These columns contain mutation counts listed 5′ to 3′.

Columns read_depth and effective_depth contain sequencing depths (see Effective read depth).

Columns off_target_mapped_depth and low_mapq_mapped_depth indicate simple mapped read depths for reads excluded due to low aligner-reported MAPQ or due to failing to align near expected amplicon primer sites.

Columns mapped_depth or primer_pair_<n>_mapped_depth indicate simple mapped read depths for all reads included in analysis, broken down by associated amplicon primer pair if applicable.

← back to README

ShapeMapper file format descriptions

Input files

FASTA file(s)

Primers file

FASTQ file(s)

Output files

<name>_shapemapper_log.txt

<name>_<RNA>_profile.txt

<name>_<RNA>.shape

<name>_<RNA>.map

<name>_<RNA>_varna_colors.txt and <name>_<RNA>_ribosketch_colors.txt

<name>_<RNA>_profiles.pdf

<name>_<RNA>_histograms.pdf

<name>_<RNA>_mapped_depths.pdf

<name>_<RNA>_per-amplicon_abundance.txt

Optional intermediate output files

Processed reads

Aligned reads

Parsed mutations

Format:

Read type:

Read mapping category:

Arrays:

Mutations:

Mutation classifications:

Mutation counts

Format:

`<name>_shapemapper_log.txt`

`<name>_<RNA>_profile.txt`

`<name>_<RNA>.shape`

`<name>_<RNA>.map`

`<name>_<RNA>_varna_colors.txt` and `<name>_<RNA>_ribosketch_colors.txt`

`<name>_<RNA>_profiles.pdf`

`<name>_<RNA>_histograms.pdf`

`<name>_<RNA>_mapped_depths.pdf`

`<name>_<RNA>_per-amplicon_abundance.txt`