switch

{: .no_toc .text-center }

{: .no_toc .text-delta }

TOC {:toc}

Description

Program to compute switch error rate (SER) and genotyping error rate (GER) given validation data. Validation data, in which haplotypes are known, can be obtained in multiple ways: - Simulated data, using msprime for instance. - Trio/duo data. Estimate haplotypes for offspring excluding the parents from the dataset. Use the parent and the offsprings in the validation set. - Chromosome X data. Pair male chromosomes X to make females in which phase in known.

Usage1: Benchmark haplotypes using simulated data

First, run a phasing run:

phase_common --input array/target.unrelated.bcf --region 1 --map info/chr1.gmap.gz --output target.phased.bcf --thread 8

The data in array/target.unrelated.bcf is phased simulated data, so we can also use it as validation set. Open the BCF file to check:

bcftools view -H array/target.unrelated.bcf | cut -f1-20 | head

Then, run thw switch program to compare the true haplotypes to those estimated by phase_common:

switch --validation array/target.unrelated.bcf --estimation target.phased.bcf --region 1 --output target.phased --thread 8

This command will produce a bunch of files: - target.phased.block.switch.txt.gzhas 2 columns (sample ID, position). This gives the coordinates of blocks of data coorectly phased. Can be used to produced a visual representation of the phasing . See supplemetary figure 1 of SHAPEIT4 paper. - target.phased.frequency.switch.txt.gz has 4 columns (MAC, #errors, #hets, SER). This gives SER stratified by MAC. - target.phased.sample.switch.txt.gz has 4 columns (sample ID, #errors, #hets, SER). This gives the SER per sample. This is the most important information given by the switch program. - target.phased.sample.typing.txt.gz has 4 columns (sample ID, #errors, #genotypes, GER). This gives GER per sample. - target.phased.variant.switch.txt.gz has 5 columns (rsid, position, #errors, #hets, SER). This gives the SER per variant relative to previous hets. - target.phased.variant.typing.txt.gz has 5 columns (rsid, position, #errors, #genotypes, GER). This gives the GER per variant.

To compute the SER across the entire dataset, we recommend to sum the numbers of errors and hets across all samples first:

zcat target.phased.sample.switch.txt.gz | awk 'BEGIN { e=0; t=0; } { e+=$2; t+=$3; } END { print "SER =", e*100/t; }'

Usage2: Benchmark haplotypes using family data

First, run build a benchmark dataset by removing parental genomes:

cat info/target.family.fam | cut -f2- | tr "\t" "\n" > parents.txt
bcftools view -Ob -o benchmark.data.bcf -S ^parents.txt array/target.family.bcf
bcftools index benchmark.data.bcf

Second, phase the benchmark dataset:

phase_common --input benchmark.data.bcf --region 1 --map info/chr1.gmap.gz --output target.phased.bcf --thread 8

Third, validate it using family data (original BCF + FAM file):

switch --validation array/target.family.bcf --estimation target.phased.bcf --pedigree info/target.family.fam --region 1 --output target.phased --thread 8

Fourth, compute SER:

zcat target.phased.sample.switch.txt.gz | awk 'BEGIN { e=0; t=0; } { e+=$2; t+=$3; } END { print "SER =", e*100/t; }'

Usage3: Benchmark haplotypes using haploid chromosome X data

To come!

Command line options

Basic options

Option name	Argument	Default	Description
--help	NA	NA	Produces help message
-T [ --thread ]	INT	1	Number of thread used

Input files

Option name	Argument	Default	Description
-V [--validation ]	STRING	NA	Validation dataset in VCF/BCF format
-E [--estimation ]	STRING	NA	Phased dataset in VCF/BCF format
-F [--frequency ]	STRING	NA	Variant frequency in VCF/BCF format, to exaclude variants and/or stratify SER by MAC
-P [--pedigree ]	STRING	NA	Pedigree information (offspring father mother)
-R [--region ]	STRING	NA	Target region
--nbins	INT	20	Number of bins used for calibration (for PP field)
--min-pp	FLOAT	0	Minimal PP value for entering computations
--singleton	STRING	NA	Singleton phase
--dupid	STRING	NA	Duplicate ID for UKB matching IDs

Output files

Option name	Argument	Default	Description
-O [--output ]	STRING	NA	Phased haplotypes in VCF/BCF format
--log	STRING	NA	Log file