VIGNETTES: Single-Sample QC Analysis for the Whole Genome Sampling Assay

Date:
2007-11-20

Contents

Introduction

Sample QC is an important part of genotyping analysis. Inclusion of a substantial proportion of low-quality samples in a genotype clustering run can end up impacting the quality of results for the other high-quality samples. So it is particularly useful to be able to perform single-sample QC analysis prior to clustering. Doing QC analysis on a single-sample basis also fits in well with most laboratory workflows - low-quality experiments can be identified at the time of processing, making rehybridization or reprocessing most practical. In this vignette guidelines are provided for using the application apt-geno-qc to develop a single-sample QC procedure for custom WGSA arrays.

Before going into the details, it should be noted that while single-sample QC is an essential component of any genotyping analysis, it is not the only type of sample QC that should be applied. At best the single-sample QC metric will be well-correlated with final genotyping performance, but the correlation will never be perfect and there will always be some poor samples accepted and some good samples rejected. Often the former scenario is of greater concern to the user, and to increase the opportunity to identify poor quality samples the user should also look carefully at sample attributed after performing genotype clustering. In particular, samples that display outlier values for the clustering call rate and/or heterozygosity may be poor performers that slipped by the single-sample QC metric and may need to be excluded. There are additional checks that may apply onto to certain kinds of arrays - for example, if a mix of restriction enzymes is being used in the target prep (such has Nsp & Sty) it would be prudent to look at the relative performance on the Nsp-only and Sty-only SNPs to look for any samples with unusual ratios of performance (which might indicate a failure in one enzyme prep and not the other).

The current supported method for doing single-chip QC analysis is to run the Dynamic Model (DM) algorithm (Di et al, 2005). The DM method requires both perfect-match (PM) and mis-match (MM) probes, so on some PM-only chip designs a special subset of SNPs are tiled with both PM and MM probes specifically for the purpose of single-chip QC with the DM algorithm.

Running a single-chip QC analysis

We start with an example of how to run a QC analysis on the catalog Genome Wide SNP 6.0 Array:

  apt-geno-qc \
    --cdf-file  GenomeWideSNP_6.cdf \
    --qcc-file  GenomeWideSNP_6.qcc \
    --qca-file  GenomeWideSNP_6.qca \
    --cel-files cel.txt \
    --out-file  qc.txt

The CDF file identifies which probes belong to which SNPs and is supplied by Affymetrix for all catalog and custom WGSA chip designs. The other input & output files are explained below.

The QCC file: defining sets of SNPs on which to perform single-chip analysis

The QCC (Quality Control Classes) file specifies classes of SNPs to be analyzed for QC purposes. The file format is a specialization of TSV. There are two required columns, described below. As with the TSV format in general, the order of the columns and the possible presence of additional columns do not matter.

As an example, here are the first ten lines from the QCC file released with the Genome Wide SNP Array 6.0:

#%format_version=1.0
#%content_version=1.0
#%primary_key=probeset_name
group_name      probeset_name
all_qc nsp_qc   AFFX-SNP_10000979
all_qc nsp_qc   AFFX-SNP_10009702
all_qc sty_qc   AFFX-SNP_10015773
all_qc sty_qc   AFFX-SNP_10021569
all_qc nsp_sty_qc       AFFX-SNP_10026879
all_qc sty_qc   AFFX-SNP_10029725

So for example the first line implies that the SNP named AFFX-SNP_10000979 is a member of two groups, one named all_qc and one named nsp_qc. The QCA file (see next section) describes how each group of SNPs should be analyzed.

The QCA file: specifying how each set of SNPs should be analyzed

The QCA (Quality Control Analysis) file is a companion to the QCC file - the QCC file specifies groups of SNPs, the QCA file specifies how they should be analyzed. The file format is a specialization of TSV. There are four required columns, described below. As with the TSV format in general, the order of the columns and the possible presence of additional columns do not matter.

As an example, here are the contents of the QCA file for the Genome Wide SNP Array 6.0 (with some minor edits for clarity).

#%format_version=1.0
#%content_version=1.0
analysis_name   group_name      analysis        options
# Standard analyses
QC_call_rate_all      all_qc  dm      dm.cutoff=0.33,dm.hetMult=1.25,as.percentage=1,precision=4
QC_call_rate_Nsp      nsp_qc  dm      dm.cutoff=0.33,dm.hetMult=1.25,as.percentage=1,precision=4
QC_call_rate_Sty      sty_qc  dm      dm.cutoff=0.33,dm.hetMult=1.25,as.percentage=1,precision=4
QC_call_rate_Nsp_Sty_overlap  nsp_sty_qc      dm      dm.cutoff=0.33,dm.hetMult=1.25,as.percentage=1,precision=4
Gender  chrX    gender  em.cutoff=0.5,em.thresh=0.05,gender.cutoff=0.1,as.text=1

The list of CEL files to analyze

The file specified with the --cel-files option identifies the CEL files to be analyzed. The file format is TSV. There is only one field required, cel_files. An example is provided below.

cel_files
cel/GenomeWideSNP_6/hapmap/NA06985_GW6_C.CEL
cel/GenomeWideSNP_6/hapmap/NA06991_GW6_C.CEL
cel/GenomeWideSNP_6/hapmap/NA06993_GW6_C.CEL
cel/GenomeWideSNP_6/hapmap/NA06994_GW6_C.CEL

The output report

The file specified by --out-file is where the results will be written. The file format is TSV with a row for each CEL file analyzed. There will be a field called Chip with the CEL file name, followed by a field for each of the analyses specified in the QCA file.

An example from the Genome Wide SNP Array 6.0 is provided below.

#%format_version=1
#%content_version=1
#%default_analysis_name=QC call rate (all)
#%library file=lib/GenomeWideSNP_6/GenomeWideSNP_6.cdf
#gender information: male=0,female=1,unknown=-1
#method1:analysis_name=QC call rate (all);analysis=dm;group_name=all_qc;as.percentage=1;dm.cutoff=0.33;dm.hetMult=1.25;precision=4
#method2:analysis_name=QC call rate (Nsp);analysis=dm;group_name=nsp_qc;as.percentage=1;dm.cutoff=0.33;dm.hetMult=1.25;precision=4
#method3:analysis_name=QC call rate (Sty);analysis=dm;group_name=sty_qc;as.percentage=1;dm.cutoff=0.33;dm.hetMult=1.25;precision=4
#method4:analysis_name=QC call rate (Nsp/Sty overlap);analysis=dm;group_name=nsp_sty_qc;as.percentage=1;dm.cutoff=0.33;dm.hetMult=1.25;precision=4
#method5:analysis_name=Gender;analysis=gender;group_name=chrX;as.text=1;em.cutoff=0.5;em.thresh=0.05;gender.cutoff=0.1
Chip    Gender  QC call rate (Nsp)      QC call rate (Nsp/Sty overlap)  QC call rate (Sty)      QC call rate (all)
NA06985_GW6_C.CEL       female  98.07   98.46   96.14   97.88
NA06991_GW6_C.CEL       female  98.2    98.77   96.78   98.21
NA06993_GW6_C.CEL       male    94.6    95.75   88.57   93.98
NA06994_GW6_C.CEL       male    99.1    98.52   96.62   98.28
NA07000_GW6_C.CEL       female  97.69   98.52   95.81   97.75
NA07019_GW6_C.CEL       female  97.3    98.58   94.36   97.39
NA07022_GW6_C.CEL       male    97.04   96.61   92.59   95.9
NA07029_GW6_C.CEL       male    96.02   96.98   93.4    96
NA07034_GW6_C.CEL       male    96.4    97.54   93.56   96.43

Guidelines on derivation of sets of SNPs and thresholds for use as a QC metric

When working for the first time with a new WGSA chip design it will often not be known in advance which SNPs will work well. In the first pass a reasonable approach would be to use the full set of SNPs tiled with MMs, analyzing them with the same parameter settings as used for the Genome Wide SNP Array 6.0 (in the example above). Call rates will probably be low (due to the inclusion of SNPs that aren't working well) but should still be correlated with genotyping performance.

After running a few dozen or more distinct samples it should be possible to assess which of the SNPs tiled with MM probes appear to be working well with the assay, at which point the set of SNPs used for deriving the DM call rate should trimmed back to just those that appear to be working well. This will generally improve the correlation between the DM call rate and the genotyping performance, and (less importantly) will result in higher DM call rates.

In the case of multiple-enzyme applications of WGSA (for example, the simultaneous hybridization of Nsp- and Sty-generated target) it may be useful to also define enzyme-subsets to monitor performance in different subsets separately in addition to monitoring overall performance.

Once the set of SNPs to be used for computing QC has been finalized the user may wish to come up with a threshold or a range of QC values for ruling experiments in or out of subsequent clustering analysis. Determination of such a threshold or range is somewhat subjective. Studying the relationship between QC call rate and clustering call rate for a large batch of samples can be helpful in this regard. Additionally, where possible the relationship between QC call rate and accuracy should be studied. Useful proxies for accuracy include evaluation of reproducibility, Mendelian consistency for related samples, and concordance to some other gold standard where available.

Affymetrix Power Tools (APT) Release apt-1.10.1

Generated on Mon Nov 3 12:21:42 2008 for Affymetrix Power Tools by  doxygen 1.5.3