README for PRISM: Pair Read informed Split Mapper, version 1.1.4
http://compbio.cs.toronto.edu/prism

PRISM is brought to you by:
	Yue Jiang
	Michael Brudno

The authors may be contacted at prism at cs dot toronto dot edu.

--------------------------------------------------------------------------------
Table of Contents
--------------------------------------------------------------------------------

  1. Overview

  2. Minimum Requirements

     2.1 RAM Requirement
     2.2 Other Requirements

  3. Sample Usage Scenarios

     3.1 Installing PRISM
     3.2 Compiling PRISM from source
     3.3 Run sample
     
  4. Usage of scripts

  5. Usage of PRISM
  
  6. Contact

  7. Acknowledgements


--------------------------------------------------------------------------------
1. Overview
--------------------------------------------------------------------------------
PRISM is a software for split read (reads which span across a structrual 
variant -- SV ) mapping  of  paired  end reads and SV calling from the mapping result.   
The main features of PRISM include:

  - Works on paired-end (High Throughput Sequencing) HTS data;
  
  - SAM input format;

  - SAM output format;
  
  - Currently support BWA (http://bio-bwa.sourceforge.net/) LETTER SPACE output sam file and SHRiMP 
    (http://compbio.cs.toronto.edu/shrimp/) COLOR SPACE out put sam file. Support for other HTS short
    reads aligner such as Bowtie etc. will come soon; 

PRISM tries to make full use of infomation contained in paired-end data. PRISM works on following fact:

  1. If a unmapped read is a split read, and most of the reads nearby its mate are
  	 concordantly mapped, this unmapped read should be split aligned to a position 
  	 about mean insert size away ( the region around this position is called concordant region)
	 from the mapped mate.
  	 
  2. If a unmapped read is a split read, and there is discordant pair cluster nearby its mapped mate,
  	 this unmapped read should be split aligned with one half to concordant region and the other half
  	 to a region which is fixed by the distal reads in the discordant cluster(this region is called 
  	 discordant region).
  	 
For a pair in which only one read is aligned, PRISM will try several split mapping for the unmapped
 mate with a modified Needleman-Wunsch algorithm, including:

	1. try  split  mapping for the unmapped read in concordant region.(There is a simple explaination
	   of concordant, discordant and hanging in Section 3.3.2)
	   
	2. try  split  mapping  for  the  unmapped read in  discordant  region  when
	   discordant pairs found aroud its mapped mate.
	   
	3. the alignment with best  score  will  be kept,and if there are discordant
	   pairs, PRISM will prefer to keep the alignment from discordant region.
	     
After the alignment, PRISM has several  tools  to  call  SVs from the output sam
files.PRISM will parse the potential location and type of a SV supported by each
read from its CIGAR and mapping position. Then the parsed locations and types of
SVs are gathered and  sorted. Finally  users  can filter them by mapping scores,
support read numbers and other kinds of threshhold. 


--------------------------------------------------------------------------------
2. Minimum Requirements
--------------------------------------------------------------------------------

2.1 RAM Requirement
-------------------

PRISM only deals with hanging pairs, the memory cost mainily depends on the size
of reference genome. It took PRISM 5h on a single CPU and 200M memeory to get
through the split mapping of 11.4 million 100bp paired-end illumina reads on chr1.
However PRISM works MUCH slower on ColorSpace data due to the different split mapping
algorithm.

2.2 Other Requirements
----------------------

PRISM uses the cluster algorithm of CNVer (Medevdev et al. 2010), the cluster code of cnver-0.7.2
is already integrated. If users want to use other tools for the cluster step please follow section 6.

Python is required. If the input files are gziped Python 2.5 or above is requed.

--------------------------------------------------------------------------------
3. Sample Usage Scenarios
--------------------------------------------------------------------------------

3.1 Installing PRISM
----------------------

Assume we downloaded the linux binary package in
./PRISM.x86_64.tar.gz

  $ tar xvzf PRISM.x86_64.tar.gz
  $ cd PRISM
  $ file bin/smapper
  $ export PRISM_PATH=$PWD
  bin/smapper: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for
  GNU/Linux 2.6.9, statically linked, for GNU/Linux 2.6.9, not stripped
  
Done! At this point the binaries are in bin/, scripts are in toolkit/ and sample 
data is in sample/.

3.2 Compiling PRISM from source
---------------------------------

Assume we downloaded the source package in ./PRISM.src.tar.gz

  $ tar xvzf PRISM.src.tar.gz
  $ cd PRISM
  $ make
  [...warnings, hopefully no errors...]
  $ export PRISM_PATH=$PWD 


3.3 Run sample
-------------------------------------------------------------

Running PRISM contains a few steps. We have prepaired a sample dataset for users 
to get better understanding of how PRISM works.

3.3.1 Sample Files
---------------------------
Assume PRISM folder is "{PRISM_PATH}/".Sample files are under "{PRISM_PATH}/sample"
chr100.sam is the original alignment file generated by bwa-0.5.9rc1.The reference  
is in "{PRISM_PATH}/sample/reference".sv_locations_of_chr100.fa records the locations
of structure variants in the donor. The insert size of the sample data is 500 and
the standard deviation is 30.

3.3.2 Workflow
---------------------------
There are a few scripts under "{PRISM_PATH}/toolkit".toolkit/run_PRISM.sh is a script
contains all the steps for running PRISM incluing:

a. preprocessing input files

   Before running PRISM main program chr100.sam needs to be divided into several sam
   files containing hanging pairs and discordant pairs respectively.
   We give a simple explaination for these pairs here. Users can find the details in
   the PRISM paper. Assume "->" represents a read mapped to + strand, "<-" a read 
   mapped to - strand. "->...<-" represents a concordant pairs in which "..." 
   represents normal mapping distance."......" represents longer mapping distance than
   "..." and "...?..." represents arbitary mapping distance. "?" represents a read not
   mapped.

   hanging pairs: 			->...? or ?...<-
   discordant pairs for deletions:	->........<-
   discordant pairs for inversions:	->...?...-> or <-...?...<-
   discordant pairs for duplications:	<-...?...-> 

   This step is conducted by toolkit/bwasam2prism.py. The usage is given in section 4.
   
   **Please note the sam file used in this step MUST be SORTED BY READ NAMES which means
   the two reads from the same pair are in successive order. 

b. clustering discordant pairs:	

   The discordant pairs will be clustered in this step.A series of cluster files will
   be generated under sample/PRISM_output.These files are chr100.sam.links.del.fw,
   chr100.sam.links.del.bw,chr100.sam.links.inver.fw,chr100.sam.links.inver.bw,
   chr100.sam.links.dup.fw and chr100.sam.links.dup.bw. These cluster files are part
   of input files of PRISM main program.	 

   This step is conducted by toolkit/cnver_cluster.sh. The usage is given in section 4.

c. running PRISM main program

   PRISM main program runs in this step. The input files are the sam file containing 
   hanging pairs generated in step a and the cluster files generated in step b. The
   hanging pair sam file should be sorted by mapping positions.PRISM will try split 
   mapping for the unmapped read in hanging pairs and output a series of split mapping
   sam files. For the default settings in run_PRISM.sh these files will be like
   "split_xxx.sam". split_all.sam is a file containing split mapping for all structure
   variants types.

   This step is conducted by "{PRISM_PATH}/bin/sampper". The usage is given in section 4.
   
d. calling structure variants from PRISM output sam files.
   
   The input file of this step is split_all.sam generated in step c. In sam files there 
   is a "CIGAR" field recording how the read is aligned. This CIGAR is parsed in this
   step and converted to structure variant loci. These loci will be filterd by the read
   number support each sv and a few other factors. Users can set these values. 

   This step is conducted by "toolkit/detect_and_filter.sh". The usage is given in 
   section 4.

These 4 steps can be run separately as soon as the previous step(s) are finished. 

3.3.3 Parameters
---------------------------
  
  run_PRISM.sh has a few parameters:
  
	-m mean of insert size.
	-e standard deviation of insert size.
	-p minmal support read number of a cluster.[default:5]
	-I input folder of PRISM, the cluster files and input sam files will be generated in this folder.[default:"./PRISM_input"]
	-O output folder of PRISM, the output sam files will be generated in this folder.[default:"./PRISM_output"]
	-r reference file (PRISM works on ONE chromosome at one time, DO NOT use whole genome reference e.g. hg19.fa here, use chr1.fa chr2.fa ... instead)
	-i the original input sam file generated by bwa.
	-a the read aligner used to align the original reads. [BWA] for BWA letter space and [SHR] for SHRiMP color space.[default:BWA]
	-S the memory buffer size for unix sort.[default:1g]
	-l minimal read length, used to descide kmer length [default:100]


  In this sample users can run run_PRISM.sh as following:
	
	{PRISM_PATH}/toolkit/run_PRISM.sh -m500 -e30 -r {PRISM_PATH}/sample/reference/chr100.fa -i {PRISM_PATH}/sample/chr100.sam

3.3.4 Description of Result
----------------------------

  After the run_PRISM.sh is done, users will get several result files in the output folder.
  
  1. split_header.sam			this file contains header for SAM files.
  2. split_indel.sam			small indel file. These indels are aligned without discordant pairs.
     split_cluster_del.sam		deletion file. The deletions are supported by discordant pairs.
     split_inv.sam			inversion file.
     split_dup.sam			duplication file.
  3. split_all.sam			this file is a simple "cat" of the four split SAM files, the final SV loci are called from this file.  
  4. split_all.sam_cigar_sorted_sv	this file is SV loci file, the columns from left to right are:
                or
     split_all.sam_rmmul_cigar_sorted_sv
  		
	chr: 			chromsome
        gap_start: 		start location of the SV
        gap_end: 		end location of the SV
        gap_size: 		size of the SV
	type: 			SV type (DELetion, INSertion, DUPlication, INVersion)
	support_read_num: 	support read number of the SV
	contig_match: 		for each sv build a contig by all the split reads and compare this contig to the reference. This is the num of 
				matched base
	contig_snp: 		the num of mismatched base of the contig
	left_match:		the contig length on the left side of the sv
	right_match:		the contig length on the right side of the sv
	mismatch_1~3:		the mismatch num of the best 3 split reads
	score1~3:		the alignment score of the best 3 split reads
	others:			not used currently, will be considered in the future
	
	Note: if a field is not suitable for a sv, the values of the field will be "999" or "-999"			 

  5. del_xx_xx				these files are the sv loci file. "xx" is the size range.
     ins_xx_xx
     inv
     dup			
  6. del_xx_xxi_RO			Sometimes the sv positions in each split read are not aligned due to mismatches in the border bases of 
     ins_xx_xx_RO			the sv. This can be fixed by realignment which PRISM doesn't do. Here we simply keep only the sv supported 
     inv_RO				by most reads and remove all the other svs overlaping this one to avoid reporting multi calls for the same sv.
     dup_RO				These files are generated by removing overlaping calls in the foregoing files in "5".

--------------------------------------------------------------------------------
4. Usage of scripts
--------------------------------------------------------------------------------

Script:		bwasam2prism.py		convert bwa letter space sam output to prism input

Usage: 		bwasam2prism.py <bwa sam file> <output folder> <mean of insert size> <std of insert size>

Example:	bwasam2prism.py {PRISM_PATH}/sample/chr100.sam {PRISM_PATH}/sample/PRISM_input 500 30

---------------------------------------------------------------

Script:		cnver_cluster.sh 	cnver cluster script, the discordant pairs will be clustered by this script.

Usage:		cnver_cluster.sh <mean of insert size> <std of insert size> <min pair number for a cluster> <input sam file prefix> <mem for sort>	

Example: cnver_cluster.sh 500 30 5 {PRISM_PATH}/sample/PRISM_input/chr100.sam

---------------------------------------------------------------

Script:		detect_and_filter.sh 	calling SVs from PRISM out and filter the results by support read number and alignment score.

Usage:		filter_and_detect.sh [Parameters] 

Parameters:		 -i PRISM output sam file.
        		 -r reference file (fasta) 
			 -S the memory buffer size for unix sort.[default:1g]

Example: detect_and_filter.sh -i PRISM_output/split_all.sam -r sample/reference/chr100.fa

---------------------------------------------------------------

Script:         filter.sh    filter SVs.

Usage:          filter.sh [Parameters] 

Parameters:              -i sv loci file.
			 -s support read num [default:5]
			 -m Max mismatch for sv supported by >= supp_num reads [default:3]
			 -n Max mismatch for sv supported by < supp_num reads [default:1]
			 -l Flanking region length for  sv supported by >= supp_num reads [default:15]
			 -r Flanking region length for  sv supported by < supp_num reads [default:30]
			 -p Max snp num in flaning region [default:3]
			 -o output folder [default:.]

Example: filter.sh -i PRISM_output/split_all.sam_cigar_sorted_sv 

Note:the parameters are optimized for 100bp long reads. Users need to modify the values if the read length is different.

--------------------------------------------------------------------------------
5. Usage of PRISM
--------------------------------------------------------------------------------

Command:	mapsplit     letter space split alignment

Usage:		smapper mapsplit [options] <reference.fa>
			
Parameters:		-T split penalty for non-cluster-supported deletion(default:-74)
			-R split penalty for non-cluster-supported insertion(default:-50)
			-Q split penalty for cluster-supported deletion,duplication and inversion(default:-46)
			-m mapping distance mean
			-e mapping distance std
				
			-S singleton read file (A read pair that only one read is aligned.)
			-o output file prefix (A group of files with the same prefix will be generated.)
			-x max read length(default:100)
				
			-L forward Deletion cluster file name
			-l backward Deletion cluster file name
			-I forward Inversion cluster file name
			-i backward Inversion cluster file name
			-D forward Duplication cluster file name
			-d backward Dduplication cluster file name
			-r minimal read length, used to descide kmer length (default:100)
				
Options:		-c only align reads around cluster points
			-C The R2 sequence is in color-space (for SHRiMP color-space output)
				
---------------------------------------------------------------
				
Command:	screenout screen out reads which support a SV

Usage:		smapper screenout <split_mapping_sam_file> <sv_loci_file>

--------------------------------------------------------------------------------
6. Using other tools for cluster step
--------------------------------------------------------------------------------

It is possible fot users to use other tools's results as cluster files instead of CNVer's. Users 
can convert their files into BED format (http://genome.ucsc.edu/FAQ/FAQformat#format1) and pass them to smapper.

Following is a description of CNVer's cluster file:

The files are usually named as xxx.links.<del/inver/dup>.<fw/bw>. <del/inver/dup> are discordant
types of deletion/inversion/duplication. <fw/bw> means the file is sorted by start/end position.
This means there are 6 files if the users have all kinds of discordant pair clusters (2 for each
type).

Other tools may has all kinds of discordant analysis results in one file. Users need to separate
the results into different files as described above. Then sort them by start position (colum #2) 
and end position (colum #3) separately.

--------------------------------------------------------------------------------
7. Contact
--------------------------------------------------------------------------------

The program website is 
    	http://compbio.cs.toronto.edu/prism

The authors of this software may be contacted at the following e-mail address:
	prism at cs dot toronto dot edu


--------------------------------------------------------------------------------
8. Acknowledgements
--------------------------------------------------------------------------------

Development was performed at the University of Toronto's Computational Biology
lab.