Before going into the details, it should be noted that while single-sample QC is an essential component of any genotyping analysis, it is not the only type of sample QC that should be applied. At best the single-sample QC metric will be well-correlated with final genotyping performance, but the correlation will never be perfect and there will always be some poor samples accepted and some good samples rejected. Often the former scenario is of greater concern to the user, and to increase the opportunity to identify poor quality samples the user should also look carefully at sample attributed after performing genotype clustering. In particular, samples that display outlier values for the clustering call rate and/or heterozygosity may be poor performers that slipped by the single-sample QC metric and may need to be excluded. There are additional checks that may apply onto to certain kinds of arrays - for example, if a mix of restriction enzymes is being used in the target prep (such has Nsp & Sty) it would be prudent to look at the relative performance on the Nsp-only and Sty-only SNPs to look for any samples with unusual ratios of performance (which might indicate a failure in one enzyme prep and not the other).
The current supported method for doing single-chip QC analysis is to run the Dynamic Model (DM) algorithm (Di et al, 2005). The DM method requires both perfect-match (PM) and mis-match (MM) probes, so on some PM-only chip designs a special subset of SNPs are tiled with both PM and MM probes specifically for the purpose of single-chip QC with the DM algorithm.
apt-geno-qc \
--cdf-file GenomeWideSNP_6.cdf \
--qcc-file GenomeWideSNP_6.qcc \
--qca-file GenomeWideSNP_6.qca \
--cel-files cel.txt \
--out-file qc.txt
The CDF file identifies which probes belong to which SNPs and is supplied by Affymetrix for all catalog and custom WGSA chip designs. The other input & output files are explained below.
As an example, here are the first ten lines from the QCC file released with the Genome Wide SNP Array 6.0:
#%format_version=1.0 #%content_version=1.0 #%primary_key=probeset_name group_name probeset_name all_qc nsp_qc AFFX-SNP_10000979 all_qc nsp_qc AFFX-SNP_10009702 all_qc sty_qc AFFX-SNP_10015773 all_qc sty_qc AFFX-SNP_10021569 all_qc nsp_sty_qc AFFX-SNP_10026879 all_qc sty_qc AFFX-SNP_10029725
So for example the first line implies that the SNP named AFFX-SNP_10000979 is a member of two groups, one named all_qc and one named nsp_qc. The QCA file (see next section) describes how each group of SNPs should be analyzed.
As an example, here are the contents of the QCA file for the Genome Wide SNP Array 6.0 (with some minor edits for clarity).
#%format_version=1.0 #%content_version=1.0 analysis_name group_name analysis options # Standard analyses QC_call_rate_all all_qc dm dm.cutoff=0.33,dm.hetMult=1.25,as.percentage=1,precision=4 QC_call_rate_Nsp nsp_qc dm dm.cutoff=0.33,dm.hetMult=1.25,as.percentage=1,precision=4 QC_call_rate_Sty sty_qc dm dm.cutoff=0.33,dm.hetMult=1.25,as.percentage=1,precision=4 QC_call_rate_Nsp_Sty_overlap nsp_sty_qc dm dm.cutoff=0.33,dm.hetMult=1.25,as.percentage=1,precision=4 Gender chrX gender em.cutoff=0.5,em.thresh=0.05,gender.cutoff=0.1,as.text=1
cel_files cel/GenomeWideSNP_6/hapmap/NA06985_GW6_C.CEL cel/GenomeWideSNP_6/hapmap/NA06991_GW6_C.CEL cel/GenomeWideSNP_6/hapmap/NA06993_GW6_C.CEL cel/GenomeWideSNP_6/hapmap/NA06994_GW6_C.CEL
An example from the Genome Wide SNP Array 6.0 is provided below.
#%format_version=1 #%content_version=1 #%default_analysis_name=QC call rate (all) #%library file=lib/GenomeWideSNP_6/GenomeWideSNP_6.cdf #gender information: male=0,female=1,unknown=-1 #method1:analysis_name=QC call rate (all);analysis=dm;group_name=all_qc;as.percentage=1;dm.cutoff=0.33;dm.hetMult=1.25;precision=4 #method2:analysis_name=QC call rate (Nsp);analysis=dm;group_name=nsp_qc;as.percentage=1;dm.cutoff=0.33;dm.hetMult=1.25;precision=4 #method3:analysis_name=QC call rate (Sty);analysis=dm;group_name=sty_qc;as.percentage=1;dm.cutoff=0.33;dm.hetMult=1.25;precision=4 #method4:analysis_name=QC call rate (Nsp/Sty overlap);analysis=dm;group_name=nsp_sty_qc;as.percentage=1;dm.cutoff=0.33;dm.hetMult=1.25;precision=4 #method5:analysis_name=Gender;analysis=gender;group_name=chrX;as.text=1;em.cutoff=0.5;em.thresh=0.05;gender.cutoff=0.1 Chip Gender QC call rate (Nsp) QC call rate (Nsp/Sty overlap) QC call rate (Sty) QC call rate (all) NA06985_GW6_C.CEL female 98.07 98.46 96.14 97.88 NA06991_GW6_C.CEL female 98.2 98.77 96.78 98.21 NA06993_GW6_C.CEL male 94.6 95.75 88.57 93.98 NA06994_GW6_C.CEL male 99.1 98.52 96.62 98.28 NA07000_GW6_C.CEL female 97.69 98.52 95.81 97.75 NA07019_GW6_C.CEL female 97.3 98.58 94.36 97.39 NA07022_GW6_C.CEL male 97.04 96.61 92.59 95.9 NA07029_GW6_C.CEL male 96.02 96.98 93.4 96 NA07034_GW6_C.CEL male 96.4 97.54 93.56 96.43
After running a few dozen or more distinct samples it should be possible to assess which of the SNPs tiled with MM probes appear to be working well with the assay, at which point the set of SNPs used for deriving the DM call rate should trimmed back to just those that appear to be working well. This will generally improve the correlation between the DM call rate and the genotyping performance, and (less importantly) will result in higher DM call rates.
In the case of multiple-enzyme applications of WGSA (for example, the simultaneous hybridization of Nsp- and Sty-generated target) it may be useful to also define enzyme-subsets to monitor performance in different subsets separately in addition to monitoring overall performance.
Once the set of SNPs to be used for computing QC has been finalized the user may wish to come up with a threshold or a range of QC values for ruling experiments in or out of subsequent clustering analysis. Determination of such a threshold or range is somewhat subjective. Studying the relationship between QC call rate and clustering call rate for a large batch of samples can be helpful in this regard. Additionally, where possible the relationship between QC call rate and accuracy should be studied. Useful proxies for accuracy include evaluation of reproducibility, Mendelian consistency for related samples, and concordance to some other gold standard where available.
Affymetrix Power Tools (APT) Release apt-1.10.1
1.5.3