See ../README for high-level documentation of the entire EIGENSOFT package. This file contains documentation of EIGENSTRAT programs: smartpca.perl: run PCA on input genotype data (calls smartpca) smarteigenstrat.perl: run EIGENSTRAT stratification correction. This program supports all 5 file formats, and supports quantitative phenotypes. gc.perl: apply Genomic Control (Devlin and Roeder, 1999) to the association statistics computed by EIGENSTRAT. We note that the programs eigenstrat and eigenstratQTL of EIGENSOFT version 2.0 have been replaced by smarteigenstrat.perl. However, we have retained the old programs for backwards compatibility (see below). See ./example.perl and ./exampleQTL.perl for toy examples using our programs. Please note that answers to the following questions can be found at http://www.hsph.harvard.edu/faculty/alkes-price/files/eigensoftfaq.htm -- How do I compute principal components using only a subset of populations and project other populations onto those principal components? -- Should regions of long-range LD in the genome be removed prior to PCA? -- How do I compute principal components using only a subset of SNPs but then run EIGENSTRAT using the full set of SNPs? ------------------------------------------------------------------------ DOCUMENTATION of smartpca.perl program: This program calls the smartpca program (see ../POPGEN/README). For this to work, the bin directory containing smartpca MUST be in your path. See ./example.perl for a toy example. ../bin/smartpca.perl -i example.geno : genotype file in any format (see ../CONVERTF/README) -a example.snp : snp file in any format (see ../CONVERTF/README) -b example.ind : indiv file in any format (see ../CONVERTF/README) -k k : (Default is 10) number of principal components to output -o example.pca : output file of principal components. Individuals removed as outliers will have all values set to 0.0 in this file. -p example.plot : prefix of output plot files of top 2 principal components. (labeling individuals according to labels in indiv file) -e example.eval : output file of all eigenvalues -l example.log : output logfile -m maxiter : (Default is 5) maximum number of outlier removal iterations. To turn off outlier removal, set -m 0. -t topk : (Default is 10) number of principal components along which to remove outliers during each outlier removal iteration. -s sigma : (Default is 6.0) number of standard deviations which an individual must exceed, along one of topk top principal components, in order to be removed as an outlier. OPTIONAL FLAGS: -w poplist : compute eigenvectors using populations in poplist only, where poplist is an ASCII file with one population per line -y plotlist : output plot will include populations in plotlist only, where plotlist is an ASCII file with one population per line -z badsnpname : list of SNPs which should be excluded from the analysis -q YES/NO : If set to YES, assume that there is a single population and the population field contains real-valued phenotypes. (Corresponds to qtmode parameter in smartpca program.) The default value for this parameter is NO. Estimated running time of the smartpca program is 2.5e-12 * nSNP * NSAMPLES^2 hours if not removing outliers. 2.5e-12 * nSNP * NSAMPLES^2 hours * (1+m) if m outlier removal iterations. Thus, under the default of up to 5 outlier removal iterations, running time is up to 1.5e-11 * nSNP * NSAMPLES^2 hours. Recommendation: we advise after running pca to check for large correlations between each principal component and all variables of interest. For example, large correlations with % missing data (per sample) could imply assay issues large correlations with plate membership could imply assay issues large correlations with phenotype indicate highly mismatched cases vs. controls which will lead to a loss of power upon applying eigenstrat correction. If input indiv file contains Case and Control labels only, then correlations between each principal component and Case/Control status will be listed at end of output logfile (-l flag). ------------------------------------------------------------------------ DOCUMENTATION of smarteigenstrat.perl program: [run smartpca.perl program first] This program is a PERL wrapper which calls the C program smarteigenstrat. Note: the bin directory containing smarteigenstrat MUST be in your path. See ./example.perl for a toy example. We recommend smarteigenstrat.perl for users who prefer command-line flags. However, users who prefer parameter files can run smarteigenstrat instead. ../bin/smarteigenstrat.perl -i example.geno : genotype file in any format (see ../CONVERTF/README) -a example.snp : snp file in any format (see ../CONVERTF/README) -b example.ind : individual file in any format (see ../CONVERTF/README). We note that phenotype information will be contained in example.ind, either as Case/Control labels or quantitative phenotypes if -q set to YES. -q YES/NO : If set to YES, use quantitative phenotypes in example.ind. If -q is set to YES, the third column of the input individual file in EIGENSTRAT format (or sixth column of input individual file in PED format) should be real numbers. The value -100.0 signifies "missing data". If -q is set to NO, these values should be "Case" or "Control". The default value for the -q parameter is NO. -p example.pca : input file of principal components (output of smartpca.perl) -k 1 : (Default is 10) number of principal components along which to correct for stratification. Note that l must be less than or equal to the number of principal components reported in the file example.pca. -o example.chisq : chisq association statistics. File contains log of flags to eigenstrat program, followed by one line per SNP: The first entry of each line is Armitage chisq statistic (Armitage, 1955) defined as NSAMPLES x (correlation between genotype and phenotype)^2. If the set of individuals with genotype and phenotype both valid is monomorphic for either genotype or phenotype, then NA is reported. The second entry of each line is the EIGENSTRAT chisq statistic, defined as (NSAMPLES-l-1) x (corr between adjusted_genotype and adjusted_phenotype)^2. If the set of individuals with genotype and phenotype both valid is monomorphic for either genotype or phenotype, then NA is reported. Note: even if l=0, there is a tiny difference between the two statistics because Armitage uses NSAMPLES while we use NSAMPLES-1, which we consider to be appropriate. -l example.log : standard output file The running time of smarteigenstrat.perl is very fast compared to the running time of smartpca.perl. ------------------------------------------------------------------------ DOCUMENTATION of smarteigenstrat program: Users who prefer parameter files to command-line flags can run the C program smarteigenstrat instead of the PERL wrapper smarteigenstrat.perl. The syntax of smarteigenstrat is "../bin/smarteigenstrat -p parfile" DESCRIPTION OF EACH PARAMETER in parameter file for smarteigenstrat: genotypename: input genotype file snpname: input snp file indivname: input indiv file pcaname: input pca file outputname: name of output file of chisq association statistics numeigs: number of principal components to correct for qtmode: YES for quantitative phenotype, NO (default) otherwise For details, see documentation of smarteigenstrat.perl above. OPTIONAL PARAMETERS: hashcheck: If set to YES and the input genotype file is in PACKEDANCESTRYMAP format, check the hash stored inside the file to make sure that individual and SNP files have not changed since the file was made. If they have, then exit in error. The default value for this parameter is YES. ------------------------------------------------------------------------ DOCUMENTATION of gc.perl: [run smartpca.perl & smarteigenstrat.perl first] ../bin/gc.perl infile outfile infile is input file of chisq statistics produced by eigenstrat program. It contains both uncorrected and EIGENSTRAT statistics for each SNP. outfile is output file. It lists lambda inflation values (for both uncorrected and EIGENSTRAT) chisq statistics after scaling by lambda (uncorrected and EIGENSTRAT) Computation of lambda is as described in Devlin and Roeder 1999. A lambda above 1 indicates inflation in chisq statistics. By definition, lambda is not allowed to be less than 1. Running time of the gc.perl program is very fast. ------------------------------------------------------------------------ STATISTICAL SIGNFICANCE of each principal component: twstats program See ../POPGEN/README for documentation of twstats program to compute Tracy-Widom statistics (statistical significance of each principal component). ------------------------------------------------------------------------ BACKWARDS COMPATIBILITY with EIGENSTRAT version 2.0: eigenstrat program For backwards compatibility with EIGENSTRAT version 2.0, we have also included our old program eigenstrat (for Case/Control phenotypes), as well as our old program eigenstratQTL (for quantitative phenotypes) from that release, which have now been replaced by our new program smarteigenstratQTL.perl. See ./example.oldstyle.perl for an example involving the eigenstrat program. Most users will want to use our new program smarteigenstrat.perl, which has added functionality. However, users wishing to understand or modify our source code may find it advantageous to instead work with the simpler eigenstrat programs. ------------------------------------------------------------------------ BACKWARDS COMPATIBILITY with 07/23/06 EIGENSTRAT release: pca program For backwards compatibility with the 07/23/06 EIGENSTRAT release, we have also included our old program pca used in that release, which has now been replaced by our new program smartpca.perl. See ./example.oldstyle.perl for an example involving the pca program. Most users will want to use our new program smartpca.perl, which calls the smartpca program and has added functionality. However, users wishing to understand or modify our source code may find it advantageous to instead work with the simpler pca program. ------------------------------------------------------------------------ TREATMENT OF X CHROMOSOME We recommend excluding X chr genotypes when computing principal components, as the different variance for males complicates proper theoretical treatment. Excluding X chr genotypes is the smartpca default (see ../POPGEN/README). We comment that including or excluding X chromosome genotypes has little effect on top principal components in dense genome-scan data sets we have analyzed. When computing association statistics using the eigenstrat program, the default is to include X chr genotypes. Assuming the usual data convention of reporting male X chr genotypes as homozygotes (instead of hemizygotes), this corresponds to a model in which male hemizygotes have the same disease risk as female homozygotes. Investigators should consider carefully whether this is their desired X chr risk model. (When correcting for stratification, we believe it is reasonable to use principal components inferred using autosomal data, as autosomal ancestry and X chr ancestry are likely to be very strongly correlated, even if percentages are different in absolute terms). ------------------------------------------------------------------------ RUNNING EIGENSTRAT WITH LESS THAN 100,000 MARKERS The EIGENSTRAT method is primarily aimed at genome-scan data sets with at least 100,000 markers. If running EIGENSTRAT on much smaller data sets, the inclusion of a candidate marker in the set of markers used to infer principal components may lead to a loss in power (Price et al. 2006 Supp Note). Thus, if running on much smaller data sets, it is necessary to exclude a candidate marker from the set of markers used to infer principal components used to correct for stratification at the candidate marker. In the case of a data set which uses ancestry-informative markers to infer ancestry, a good way to do this is to run smartpca.perl to infer principal components *only* using ancestry-informative markers, excluding the candidate markers (and excluding any ancestry-informative marker in LD with a candidate marker), and then run eigenstrat on candidate markers. ------------------------------------------------------------------------ Questions? See http://www.hsph.harvard.edu/faculty/alkes-price/files/eigensoftfaq.htm or email Samuela Pollack, spollack@hsph.harvard.edu SOFTWARE COPYRIGHT NOTICE AGREEMENT This software and its documentation are copyright (2010) by Harvard University and The Broad Institute. All rights are reserved. This software is supplied without any warranty or guaranteed support whatsoever. Neither Harvard University nor The Broad Institute can be responsible for its use, misuse, or functionality. The software may be freely copied for non-commercial purposes, provided this copyright notice is retained.