ART_SOLiD (version 1.3.3) README (last updated on 03/19/2015) Weichun Huang at whduke@gmail.com DESCRIPTION: ART_SOLiD is a simulation program to generate sequence read data of SOLiD sequencing reads. ART generates reads according to a SOLiD read error profile. The built-in error profile is an empiricial error profile summarized from large SOLiD sequencing data. ART has been using for testing or benchmarking a variety of method or tools for next-generation sequencing data analysis, including read alignment, de novo assembly, detection of SNP, CNV, or other structure variation. art_SOLiD can generate both single-end, matepair, and paired-end of SOLiD sequencing platform. art_SOLiD also support amplicon sequencing simulation with RNA references. Its outputs include FASTQ read, MAP alignment, and optional SAM alignment files. The map2bed.pl tool included can convert ART MAP alignment files to a UCSC BED file EXAMPLES In the "examples" subdirectory, the shell script "run_test_examples_SOLiD.sh" gives examples of using art_SOLiD for read simulation. To test these two examples, just run the script "run_test_examples_SOLiD.sh" USAGES SINGLE-END (F3 READ) SIMULATION art_SOLiD [ options ] MATE-PAIR READS (F3-R3 PAIR) SIMULATION art_SOLiD [ options ] PAIRED-END READS (F3-F5 PAIR) SIMULATION art_SOLiD [ options ] AMPLICON SEQUENCING SIMULATION art_SOLiD [ options ] -A s art_SOLiD [ options ] -A m art_SOLiD [ options ] -A p ===== PARAMETERS ===== INPUT_SEQ_FILE - filename of DNA/RNA reference sequences in FASTA format OUTPUT_FILE_PREFIX - prefix or directory for all output data files FOLD_COVERAGE - fold of read coverage over the reference sequences LEN_READ - length of F3/R3 reads LEN_READ_F3 - length of F3 reads for paired-end read simulation LEN_READ_F5 - length of F5 reads for paired-end read simulation READS_PER_AMPLICON - number of reads per amplicon READ_PAIRS_PER_AMPLICON - number of read pairs per amplicon MEAN_FRAG_LEN - mean DNA/RNA fragment size for matepair/paired-end read simulation STD_DEV - standard deviation of the DNA/RNA fragment sizes for matepair/paired-end read simulation ===== OPTIONAL PARAMETERS ===== -A specify the read type for amplicon sequencing simulation (s:single-end, m: matepair, p: paired-end) -M indicate to use CIGAR 'M' instead of '=/X' for alignment match/mismatch -s indicate to generate a SAM alignment file -r specify the random seed for the simulation -f specify the scale factor adjusting error rate (e.g., -f 0 for zero-error rate simulation) -p specify user's own read profile for simulation ===== EXAMPLES ===== 1) singl-end 25bp reads simulation at 10X coverage art_SOLiD -s seq_reference.fa ./outdir/single_dat 25 10 2) singl-end 75bp reads simulation at 20X coverage with users' error profile art_SOLiD -s -p ../SOLiD_profiles/pseudo_profile ./seq_reference.fa ./dat_userProfile 75 20 3) matepair 35bp (F3-R3) reads simulation at 20X coverage with DNA MEAN fragment size 2000bp and STD 50 art_SOLiD -s seq_reference.fa ./outdir/matepair_dat 35 20 2000 50 4) matepair reads simulation with a fixed random seed art_SOLiD -r 777 -s seq_reference.fa ./outdir/matepair_fs 50 10 1500 50 5) paired-end reads (75bp F3, 35bp F5) simulation with the MEAN fragment size 250 and STD 10 at 20X coverage art_SOLiD -s seq_reference.fa ./outdir/paired_dat 75 35 50 250 10 6) amplicon sequencing with 25bp single-end reads at 100 reads per amplicon art_SOLiD -A s -s amp_reference.fa ./outdir/amp_single 25 100 7) amplicon sequencing with 50bp matepair reads at 80 read pairs per amplicon art_SOLiD -A m -s amp_reference.fa ./outdir/amp_matepair 50 80 8) amplicon sequencing with paired-end reads (35bp F3, 25bp F5 reads) at 50 pairs per amplicon art_SOLiD -A p -s amp_reference.fa ./outdir/amp_pair 35 25 50 OUTPUT DATA FILES *.fq - FASTQ read data files. For matepair/paired-end read simulation, *_R3.fq/*_F5.fq contains data of the first reads, and *_F3.fq for the second reads. *.map - read alignment files. For matepair read simulation, *_R3/*_F5.map has read alignments for the first reads and *_F3.map for the second reads. *.sam - SAM read alignment files FASTQ file format A FASTQ file contains both sequence bases and quality scores of sequencing reads and is in the following format: @read_id sequence_read + base_quality_scores A base quality score is coded by the ASCII code of a single character, where the quality score is equal to ASCII code of the character minus 33. Example: @1_1_1_F3 T10121011322000311302213102311132 + 163=+48./48<347//,=/84-4)77(''-) @1_1_2_F3 T01213000012110232021321303011332 + 1>7<-653?01:55./.%3+'61-+40(*)#' MAP file format A MAP file has a Header and main Body parts. The header part includes the command used to generate this file and reference sequence id and length. The header @CM tag for command line, and @SQ for reference sequence. A header always starts with "##ART" and ends with "##Header End". ART MAP alignment files can be converted to a UCSC BED file by the map2bed.pl tool. HEADER EXAMPLE ##ART_SOLiD read_length 32 @CM ../../bin/MacOS64/art_SOLiD -s ./testSeq.fa ./single_SOLiD_test 32 10 @SQ seq1 7207 @SQ seq2 3056 ##Header End The body part of a MAP file is a tab-delimited text alignment data where each line is the mapping information record of a read. The format of each line is below: ref_seq_id read_id ref_map_pos ref_seq_strand num_sequencing_errors err_pos_1 WrongCorrectColor_1 err_pos_2 WrongCorrectColor_2 ... ref_map_pos - the alignment start position of reference sequence. ref_map_pos is always relative to the strand of reference sequence. That is, ref_map_pos 10 in the plus (+) strand is different from ref_map_pos 10 in the minus (‐) stand. num_sequencing_errors - number of sequencing call errors err_pos_n - the zero-indexed read position of the n_th call error WrongCorrectColor_n - the n_th call error representing by two digits number with the 1st number the wrong called color and the 2nd the correct color Example: seq1 1_1_1_F3 6028 + 4 18 31 22 03 25 32 27 23 SAM format SAM is a standard format for next-gen sequencing read alignments. The details of the format and examples are available at the links below: 1) http://samtools.sourceforge.net/SAM1.pdf 2) http://genome.sph.umich.edu/wiki/SAM Read sequences in a SAM file are in the regular DNA base space. As one change in SOLiD reads in the color space can result multiple changes in base space, so not all mismatches in base space between reads and its reference sequence are of SOLiD sequencing errors. BED format See the format at UCSC http://genome.ucsc.edu/FAQ/FAQformat.html#format1 NOTE: both MAP and BED format files use 0-based coordinate system while SAM format uses 1-based coordinate system. ERROR PROFILE FILE FORMAT An ART SOLiD read error profile has the following five columns: 1) read position 2) correct base 3) wrong base called 4) probability of calling the wrong base in the 1st read 5) probability of calling the wrong base in the 2nd read and an error profile end with "*end_of_profile". Please see "profile_default" in the folder SOLiD_profiles for a real example