README for PRISM: Pair Read informed Split Mapper, version 1.1.4 http://compbio.cs.toronto.edu/prism PRISM is brought to you by: Yue Jiang Michael Brudno The authors may be contacted at prism at cs dot toronto dot edu. -------------------------------------------------------------------------------- Table of Contents -------------------------------------------------------------------------------- 1. Overview 2. Minimum Requirements 2.1 RAM Requirement 2.2 Other Requirements 3. Sample Usage Scenarios 3.1 Installing PRISM 3.2 Compiling PRISM from source 3.3 Run sample 4. Usage of scripts 5. Usage of PRISM 6. Contact 7. Acknowledgements -------------------------------------------------------------------------------- 1. Overview -------------------------------------------------------------------------------- PRISM is a software for split read (reads which span across a structrual variant -- SV ) mapping of paired end reads and SV calling from the mapping result. The main features of PRISM include: - Works on paired-end (High Throughput Sequencing) HTS data; - SAM input format; - SAM output format; - Currently support BWA (http://bio-bwa.sourceforge.net/) LETTER SPACE output sam file and SHRiMP (http://compbio.cs.toronto.edu/shrimp/) COLOR SPACE out put sam file. Support for other HTS short reads aligner such as Bowtie etc. will come soon; PRISM tries to make full use of infomation contained in paired-end data. PRISM works on following fact: 1. If a unmapped read is a split read, and most of the reads nearby its mate are concordantly mapped, this unmapped read should be split aligned to a position about mean insert size away ( the region around this position is called concordant region) from the mapped mate. 2. If a unmapped read is a split read, and there is discordant pair cluster nearby its mapped mate, this unmapped read should be split aligned with one half to concordant region and the other half to a region which is fixed by the distal reads in the discordant cluster(this region is called discordant region). For a pair in which only one read is aligned, PRISM will try several split mapping for the unmapped mate with a modified Needleman-Wunsch algorithm, including: 1. try split mapping for the unmapped read in concordant region.(There is a simple explaination of concordant, discordant and hanging in Section 3.3.2) 2. try split mapping for the unmapped read in discordant region when discordant pairs found aroud its mapped mate. 3. the alignment with best score will be kept,and if there are discordant pairs, PRISM will prefer to keep the alignment from discordant region. After the alignment, PRISM has several tools to call SVs from the output sam files.PRISM will parse the potential location and type of a SV supported by each read from its CIGAR and mapping position. Then the parsed locations and types of SVs are gathered and sorted. Finally users can filter them by mapping scores, support read numbers and other kinds of threshhold. -------------------------------------------------------------------------------- 2. Minimum Requirements -------------------------------------------------------------------------------- 2.1 RAM Requirement ------------------- PRISM only deals with hanging pairs, the memory cost mainily depends on the size of reference genome. It took PRISM 5h on a single CPU and 200M memeory to get through the split mapping of 11.4 million 100bp paired-end illumina reads on chr1. However PRISM works MUCH slower on ColorSpace data due to the different split mapping algorithm. 2.2 Other Requirements ---------------------- PRISM uses the cluster algorithm of CNVer (Medevdev et al. 2010), the cluster code of cnver-0.7.2 is already integrated. If users want to use other tools for the cluster step please follow section 6. Python is required. If the input files are gziped Python 2.5 or above is requed. -------------------------------------------------------------------------------- 3. Sample Usage Scenarios -------------------------------------------------------------------------------- 3.1 Installing PRISM ---------------------- Assume we downloaded the linux binary package in ./PRISM.x86_64.tar.gz $ tar xvzf PRISM.x86_64.tar.gz $ cd PRISM $ file bin/smapper $ export PRISM_PATH=$PWD bin/smapper: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.6.9, statically linked, for GNU/Linux 2.6.9, not stripped Done! At this point the binaries are in bin/, scripts are in toolkit/ and sample data is in sample/. 3.2 Compiling PRISM from source --------------------------------- Assume we downloaded the source package in ./PRISM.src.tar.gz $ tar xvzf PRISM.src.tar.gz $ cd PRISM $ make [...warnings, hopefully no errors...] $ export PRISM_PATH=$PWD 3.3 Run sample ------------------------------------------------------------- Running PRISM contains a few steps. We have prepaired a sample dataset for users to get better understanding of how PRISM works. 3.3.1 Sample Files --------------------------- Assume PRISM folder is "{PRISM_PATH}/".Sample files are under "{PRISM_PATH}/sample" chr100.sam is the original alignment file generated by bwa-0.5.9rc1.The reference is in "{PRISM_PATH}/sample/reference".sv_locations_of_chr100.fa records the locations of structure variants in the donor. The insert size of the sample data is 500 and the standard deviation is 30. 3.3.2 Workflow --------------------------- There are a few scripts under "{PRISM_PATH}/toolkit".toolkit/run_PRISM.sh is a script contains all the steps for running PRISM incluing: a. preprocessing input files Before running PRISM main program chr100.sam needs to be divided into several sam files containing hanging pairs and discordant pairs respectively. We give a simple explaination for these pairs here. Users can find the details in the PRISM paper. Assume "->" represents a read mapped to + strand, "<-" a read mapped to - strand. "->...<-" represents a concordant pairs in which "..." represents normal mapping distance."......" represents longer mapping distance than "..." and "...?..." represents arbitary mapping distance. "?" represents a read not mapped. hanging pairs: ->...? or ?...<- discordant pairs for deletions: ->........<- discordant pairs for inversions: ->...?...-> or <-...?...<- discordant pairs for duplications: <-...?...-> This step is conducted by toolkit/bwasam2prism.py. The usage is given in section 4. **Please note the sam file used in this step MUST be SORTED BY READ NAMES which means the two reads from the same pair are in successive order. b. clustering discordant pairs: The discordant pairs will be clustered in this step.A series of cluster files will be generated under sample/PRISM_output.These files are chr100.sam.links.del.fw, chr100.sam.links.del.bw,chr100.sam.links.inver.fw,chr100.sam.links.inver.bw, chr100.sam.links.dup.fw and chr100.sam.links.dup.bw. These cluster files are part of input files of PRISM main program. This step is conducted by toolkit/cnver_cluster.sh. The usage is given in section 4. c. running PRISM main program PRISM main program runs in this step. The input files are the sam file containing hanging pairs generated in step a and the cluster files generated in step b. The hanging pair sam file should be sorted by mapping positions.PRISM will try split mapping for the unmapped read in hanging pairs and output a series of split mapping sam files. For the default settings in run_PRISM.sh these files will be like "split_xxx.sam". split_all.sam is a file containing split mapping for all structure variants types. This step is conducted by "{PRISM_PATH}/bin/sampper". The usage is given in section 4. d. calling structure variants from PRISM output sam files. The input file of this step is split_all.sam generated in step c. In sam files there is a "CIGAR" field recording how the read is aligned. This CIGAR is parsed in this step and converted to structure variant loci. These loci will be filterd by the read number support each sv and a few other factors. Users can set these values. This step is conducted by "toolkit/detect_and_filter.sh". The usage is given in section 4. These 4 steps can be run separately as soon as the previous step(s) are finished. 3.3.3 Parameters --------------------------- run_PRISM.sh has a few parameters: -m mean of insert size. -e standard deviation of insert size. -p minmal support read number of a cluster.[default:5] -I input folder of PRISM, the cluster files and input sam files will be generated in this folder.[default:"./PRISM_input"] -O output folder of PRISM, the output sam files will be generated in this folder.[default:"./PRISM_output"] -r reference file (PRISM works on ONE chromosome at one time, DO NOT use whole genome reference e.g. hg19.fa here, use chr1.fa chr2.fa ... instead) -i the original input sam file generated by bwa. -a the read aligner used to align the original reads. [BWA] for BWA letter space and [SHR] for SHRiMP color space.[default:BWA] -S the memory buffer size for unix sort.[default:1g] -l minimal read length, used to descide kmer length [default:100] In this sample users can run run_PRISM.sh as following: {PRISM_PATH}/toolkit/run_PRISM.sh -m500 -e30 -r {PRISM_PATH}/sample/reference/chr100.fa -i {PRISM_PATH}/sample/chr100.sam 3.3.4 Description of Result ---------------------------- After the run_PRISM.sh is done, users will get several result files in the output folder. 1. split_header.sam this file contains header for SAM files. 2. split_indel.sam small indel file. These indels are aligned without discordant pairs. split_cluster_del.sam deletion file. The deletions are supported by discordant pairs. split_inv.sam inversion file. split_dup.sam duplication file. 3. split_all.sam this file is a simple "cat" of the four split SAM files, the final SV loci are called from this file. 4. split_all.sam_cigar_sorted_sv this file is SV loci file, the columns from left to right are: or split_all.sam_rmmul_cigar_sorted_sv chr: chromsome gap_start: start location of the SV gap_end: end location of the SV gap_size: size of the SV type: SV type (DELetion, INSertion, DUPlication, INVersion) support_read_num: support read number of the SV contig_match: for each sv build a contig by all the split reads and compare this contig to the reference. This is the num of matched base contig_snp: the num of mismatched base of the contig left_match: the contig length on the left side of the sv right_match: the contig length on the right side of the sv mismatch_1~3: the mismatch num of the best 3 split reads score1~3: the alignment score of the best 3 split reads others: not used currently, will be considered in the future Note: if a field is not suitable for a sv, the values of the field will be "999" or "-999" 5. del_xx_xx these files are the sv loci file. "xx" is the size range. ins_xx_xx inv dup 6. del_xx_xxi_RO Sometimes the sv positions in each split read are not aligned due to mismatches in the border bases of ins_xx_xx_RO the sv. This can be fixed by realignment which PRISM doesn't do. Here we simply keep only the sv supported inv_RO by most reads and remove all the other svs overlaping this one to avoid reporting multi calls for the same sv. dup_RO These files are generated by removing overlaping calls in the foregoing files in "5". -------------------------------------------------------------------------------- 4. Usage of scripts -------------------------------------------------------------------------------- Script: bwasam2prism.py convert bwa letter space sam output to prism input Usage: bwasam2prism.py Example: bwasam2prism.py {PRISM_PATH}/sample/chr100.sam {PRISM_PATH}/sample/PRISM_input 500 30 --------------------------------------------------------------- Script: cnver_cluster.sh cnver cluster script, the discordant pairs will be clustered by this script. Usage: cnver_cluster.sh Example: cnver_cluster.sh 500 30 5 {PRISM_PATH}/sample/PRISM_input/chr100.sam --------------------------------------------------------------- Script: detect_and_filter.sh calling SVs from PRISM out and filter the results by support read number and alignment score. Usage: filter_and_detect.sh [Parameters] Parameters: -i PRISM output sam file. -r reference file (fasta) -S the memory buffer size for unix sort.[default:1g] Example: detect_and_filter.sh -i PRISM_output/split_all.sam -r sample/reference/chr100.fa --------------------------------------------------------------- Script: filter.sh filter SVs. Usage: filter.sh [Parameters] Parameters: -i sv loci file. -s support read num [default:5] -m Max mismatch for sv supported by >= supp_num reads [default:3] -n Max mismatch for sv supported by < supp_num reads [default:1] -l Flanking region length for sv supported by >= supp_num reads [default:15] -r Flanking region length for sv supported by < supp_num reads [default:30] -p Max snp num in flaning region [default:3] -o output folder [default:.] Example: filter.sh -i PRISM_output/split_all.sam_cigar_sorted_sv Note:the parameters are optimized for 100bp long reads. Users need to modify the values if the read length is different. -------------------------------------------------------------------------------- 5. Usage of PRISM -------------------------------------------------------------------------------- Command: mapsplit letter space split alignment Usage: smapper mapsplit [options] Parameters: -T split penalty for non-cluster-supported deletion(default:-74) -R split penalty for non-cluster-supported insertion(default:-50) -Q split penalty for cluster-supported deletion,duplication and inversion(default:-46) -m mapping distance mean -e mapping distance std -S singleton read file (A read pair that only one read is aligned.) -o output file prefix (A group of files with the same prefix will be generated.) -x max read length(default:100) -L forward Deletion cluster file name -l backward Deletion cluster file name -I forward Inversion cluster file name -i backward Inversion cluster file name -D forward Duplication cluster file name -d backward Dduplication cluster file name -r minimal read length, used to descide kmer length (default:100) Options: -c only align reads around cluster points -C The R2 sequence is in color-space (for SHRiMP color-space output) --------------------------------------------------------------- Command: screenout screen out reads which support a SV Usage: smapper screenout -------------------------------------------------------------------------------- 6. Using other tools for cluster step -------------------------------------------------------------------------------- It is possible fot users to use other tools's results as cluster files instead of CNVer's. Users can convert their files into BED format (http://genome.ucsc.edu/FAQ/FAQformat#format1) and pass them to smapper. Following is a description of CNVer's cluster file: The files are usually named as xxx.links... are discordant types of deletion/inversion/duplication. means the file is sorted by start/end position. This means there are 6 files if the users have all kinds of discordant pair clusters (2 for each type). Other tools may has all kinds of discordant analysis results in one file. Users need to separate the results into different files as described above. Then sort them by start position (colum #2) and end position (colum #3) separately. -------------------------------------------------------------------------------- 7. Contact -------------------------------------------------------------------------------- The program website is http://compbio.cs.toronto.edu/prism The authors of this software may be contacted at the following e-mail address: prism at cs dot toronto dot edu -------------------------------------------------------------------------------- 8. Acknowledgements -------------------------------------------------------------------------------- Development was performed at the University of Toronto's Computational Biology lab.