{: .no_toc .text-center }
{: .no_toc .text-delta }
Program to compute switch error rate (SER) and genotyping error rate (GER) given validation data. Validation data, in which haplotypes are known, can be obtained in multiple ways: - Simulated data, using msprime for instance. - Trio/duo data. Estimate haplotypes for offspring excluding the parents from the dataset. Use the parent and the offsprings in the validation set. - Chromosome X data. Pair male chromosomes X to make females in which phase in known.
First, run a phasing run:
phase_common --input array/target.unrelated.bcf --region 1 --map info/chr1.gmap.gz --output target.phased.bcf --thread 8
array/target.unrelated.bcf
is phased simulated
data, so we can also use it as validation set. Open the BCF file to
check:
bcftools view -H array/target.unrelated.bcf | cut -f1-20 | head
Then, run thw switch program to compare the true haplotypes to those estimated by phase_common:
switch --validation array/target.unrelated.bcf --estimation target.phased.bcf --region 1 --output target.phased --thread 8
This command will produce a bunch of files: -
target.phased.block.switch.txt.gz
has 2 columns (sample ID,
position). This gives the coordinates of blocks of data coorectly
phased. Can be used to produced a visual representation of the phasing .
See supplemetary figure 1 of SHAPEIT4
paper. - target.phased.frequency.switch.txt.gz
has 4
columns (MAC, #errors, #hets, SER). This gives SER stratified by MAC. -
target.phased.sample.switch.txt.gz
has 4 columns (sample
ID, #errors, #hets, SER). This gives the SER per sample. This is the
most important information given by the switch program. -
target.phased.sample.typing.txt.gz
has 4 columns (sample
ID, #errors, #genotypes, GER). This gives GER per sample. -
target.phased.variant.switch.txt.gz
has 5 columns (rsid,
position, #errors, #hets, SER). This gives the SER per variant relative
to previous hets. - target.phased.variant.typing.txt.gz
has
5 columns (rsid, position, #errors, #genotypes, GER). This gives the GER
per variant.
zcat target.phased.sample.switch.txt.gz | awk 'BEGIN { e=0; t=0; } { e+=$2; t+=$3; } END { print "SER =", e*100/t; }'
cat info/target.family.fam | cut -f2- | tr "\t" "\n" > parents.txt
bcftools view -Ob -o benchmark.data.bcf -S ^parents.txt array/target.family.bcf
bcftools index benchmark.data.bcf
phase_common --input benchmark.data.bcf --region 1 --map info/chr1.gmap.gz --output target.phased.bcf --thread 8
switch --validation array/target.family.bcf --estimation target.phased.bcf --pedigree info/target.family.fam --region 1 --output target.phased --thread 8
zcat target.phased.sample.switch.txt.gz | awk 'BEGIN { e=0; t=0; } { e+=$2; t+=$3; } END { print "SER =", e*100/t; }'
To come!
Option name | Argument | Default | Description |
---|---|---|---|
--help | NA | NA | Produces help message |
-T [ --thread ] | INT | 1 | Number of thread used |
Option name | Argument | Default | Description |
---|---|---|---|
-V [--validation ] | STRING | NA | Validation dataset in VCF/BCF format |
-E [--estimation ] | STRING | NA | Phased dataset in VCF/BCF format |
-F [--frequency ] | STRING | NA | Variant frequency in VCF/BCF format, to exaclude variants and/or stratify SER by MAC |
-P [--pedigree ] | STRING | NA | Pedigree information (offspring father mother) |
-R [--region ] | STRING | NA | Target region |
--nbins | INT | 20 | Number of bins used for calibration (for PP field) |
--min-pp | FLOAT | 0 | Minimal PP value for entering computations |
--singleton | STRING | NA | Singleton phase |
--dupid | STRING | NA | Duplicate ID for UKB matching IDs |
Option name | Argument | Default | Description |
---|---|---|---|
-O [--output ] | STRING | NA | Phased haplotypes in VCF/BCF format |
--log | STRING | NA | Log file |