switch

{: .no_toc .text-center }

Table of contents

{: .no_toc .text-delta }

  1. TOC {:toc}

Description

Program to compute switch error rate (SER) and genotyping error rate (GER) given validation data. Validation data, in which haplotypes are known, can be obtained in multiple ways: - Simulated data, using msprime for instance. - Trio/duo data. Estimate haplotypes for offspring excluding the parents from the dataset. Use the parent and the offsprings in the validation set. - Chromosome X data. Pair male chromosomes X to make females in which phase in known.

Usage1: Benchmark haplotypes using simulated data

First, run a phasing run:

phase_common --input array/target.unrelated.bcf --region 1 --map info/chr1.gmap.gz --output target.phased.bcf --thread 8
The data in array/target.unrelated.bcf is phased simulated data, so we can also use it as validation set. Open the BCF file to check:
bcftools view -H array/target.unrelated.bcf | cut -f1-20 | head

Then, run thw switch program to compare the true haplotypes to those estimated by phase_common:

switch --validation array/target.unrelated.bcf --estimation target.phased.bcf --region 1 --output target.phased --thread 8

This command will produce a bunch of files: - target.phased.block.switch.txt.gzhas 2 columns (sample ID, position). This gives the coordinates of blocks of data coorectly phased. Can be used to produced a visual representation of the phasing . See supplemetary figure 1 of SHAPEIT4 paper. - target.phased.frequency.switch.txt.gz has 4 columns (MAC, #errors, #hets, SER). This gives SER stratified by MAC. - target.phased.sample.switch.txt.gz has 4 columns (sample ID, #errors, #hets, SER). This gives the SER per sample. This is the most important information given by the switch program. - target.phased.sample.typing.txt.gz has 4 columns (sample ID, #errors, #genotypes, GER). This gives GER per sample. - target.phased.variant.switch.txt.gz has 5 columns (rsid, position, #errors, #hets, SER). This gives the SER per variant relative to previous hets. - target.phased.variant.typing.txt.gz has 5 columns (rsid, position, #errors, #genotypes, GER). This gives the GER per variant.

To compute the SER across the entire dataset, we recommend to sum the numbers of errors and hets across all samples first:
zcat target.phased.sample.switch.txt.gz | awk 'BEGIN { e=0; t=0; } { e+=$2; t+=$3; } END { print "SER =", e*100/t; }'

Usage2: Benchmark haplotypes using family data

First, run build a benchmark dataset by removing parental genomes:
cat info/target.family.fam | cut -f2- | tr "\t" "\n" > parents.txt
bcftools view -Ob -o benchmark.data.bcf -S ^parents.txt array/target.family.bcf
bcftools index benchmark.data.bcf
Second, phase the benchmark dataset:
phase_common --input benchmark.data.bcf --region 1 --map info/chr1.gmap.gz --output target.phased.bcf --thread 8
Third, validate it using family data (original BCF + FAM file):
switch --validation array/target.family.bcf --estimation target.phased.bcf --pedigree info/target.family.fam --region 1 --output target.phased --thread 8
Fourth, compute SER:
zcat target.phased.sample.switch.txt.gz | awk 'BEGIN { e=0; t=0; } { e+=$2; t+=$3; } END { print "SER =", e*100/t; }'

Usage3: Benchmark haplotypes using haploid chromosome X data

To come!


Command line options

Basic options

Option name Argument Default Description
--help NA NA Produces help message
-T [ --thread ] INT 1 Number of thread used

Input files

Option name Argument Default Description
-V [--validation ] STRING NA Validation dataset in VCF/BCF format
-E [--estimation ] STRING NA Phased dataset in VCF/BCF format
-F [--frequency ] STRING NA Variant frequency in VCF/BCF format, to exaclude variants and/or stratify SER by MAC
-P [--pedigree ] STRING NA Pedigree information (offspring father mother)
-R [--region ] STRING NA Target region
--nbins INT 20 Number of bins used for calibration (for PP field)
--min-pp FLOAT 0 Minimal PP value for entering computations
--singleton STRING NA Singleton phase
--dupid STRING NA Duplicate ID for UKB matching IDs

Output files

Option name Argument Default Description
-O [--output ] STRING NA Phased haplotypes in VCF/BCF format
--log STRING NA Log file