VIGNETTES: Axiom ™ Single-Sample QC Analysis

Date:
2009-10-12

Contents

Introduction

Sample QC is an important part of genotyping analysis. Inclusion of a substantial proportion of low-quality samples in a genotype clustering run can negatively impact the quality of results for the other high-quality samples. So it is particularly useful to be able to perform single-sample QC analysis prior to clustering. Doing QC analysis on a single-sample basis also fits in well with most laboratory workflows - low-quality experiments can be identified at the time of processing, making rehybridization or reprocessing most practical. In this vignette guidelines are provided for using the application apt-geno-qc to perform single-sample QC for Axiom ™ arrays.

Axiom ™ arrays use a two-color system and the associated sample QC metrics are based on the principle that the two channels must be separable to achieve good genotyping results. The signal to background ratio (SBR) is the simplest form of QC, and by constructing probes based on known non-polymorphic sequences, we created indicator signal and background probes. In addition to SBR, we have developed a novel metric DishQC which takes into account both inter-channel and intra-channel signal separation and spread. Our standard practice to qualify samples based on their DishQC value.

It should be noted that while single-sample QC is an essential component of any genotyping analysis, it is not the only type of sample QC that should be applied. At best the single-sample QC metric will be well-correlated with final genotyping performance, but the correlation will never be perfect and there will always be some poor samples accepted and some good samples rejected. Often the former scenario is of greater concern to the user, and to increase the opportunity to identify poor quality samples the user should also look carefully at sample attributes after performing genotype clustering. In particular, samples that display outlier values for the clustering call rate and/or heterozygosity may be poor performers that slipped by the single-sample QC metric and may need to be excluded. An overview of how to perform multiple-sample clustering can be found in the vignette on genotype clustering for Axiom ™ arrays.

This vignette provides an overview of sample QC for Axiom ™ arrays using apt-geno-qc. For a more complete description of apt-geno-qc refer to the apt-geno-qc manual.

All library files referenced in this vignette can be found at the Axiom ™ product page.

Running a single-chip QC analysis

To run the QC procedure on each sample in a set:

  apt-geno-qc \
    --analysis-file-path ${AXIOM_LIB_PATH}
    --xml-file Axiom_GW_Hu_SNP.r2.apt-geno-qc.AxiomQC1.xml \
    --cel-files cel.txt \
    --out-file qc.txt

apt-geno-qc supports declaring command-line options via an XML parameter file by using the --xml-file option, the above example follows this procedure. The particular xml options file mentioned above is available as part of the library file package distributed at the Axiom ™ product page. The XML parameter file is a useful tool for defining and storing settings for data analysis.

The --analysis-files-path is the path to the folder where the library files (including the XML file and the other library files referenced therein) are stored. For example, if the Axiom ™ library files are stored in /mypath/ then the command line string would be:

      apt-geno-qc \
    --analysis-files-path /mypath/ \
    --xml-file Axiom_GW_Hu_SNP.r2.apt-geno-qc.AxiomQC1.xml \
    --cel-files cel_file_list.txt \
    --out-file qc.txt

At this point most users will probably want to look at the DQC values for each sample to determine pass/fail status. These values can be obtained from the qc.txt output file. The format is a specialization of TSV and the column with the DQC values is named "axiom_dishqc_DQC". A DQC of 0.82 or better is considered a pass for Axiom Human arrays. For Axiom Bovine array, a DQC of 0.95 or better is considered as passing criteria.

The list of CEL files to analyze

The file specified with the --cel-files option identifies the CEL files to be analyzed. The file format is TSV. There is only one field required, cel_files. An example is provided below.

cel_files
cel/hapmap/NA10851.CEL
cel/hapmap/NA10854.CEL
cel/hapmap/NA10855.CEL
cel/hapmap/NA10856.CEL
cel/hapmap/NA10857.CEL
cel/hapmap/NA10859.CEL

The XML Input Parameter file

In this vignette, all of the QC settings are stored in the XML input parameter file distributed as part of the library file package located at Axiom ™ product page. It is recommended that you use the official XML file included in the library file package versus creating your own from the example below.

<?xml version="1.0"?>
<ParameterSet subgroupType="Analysis" subgroupName="Axiom_GW_Hu_SNP.apt-geno-qc.AxiomQC1" executableName="apt-geno-qc">
    <Parameters>
        <Parameter name="cdf-file"           currentValue="Axiom_GW_Hu_SNP.r2.cdf" />
        <Parameter name="chrX-probes"        currentValue="Axiom_GW_Hu_SNP.r2.chrXprobes" />
        <Parameter name="chrY-probes"        currentValue="Axiom_GW_Hu_SNP.r2.chrYprobes" />
        <Parameter name="target-sketch"      currentValue="Axiom_GW_Hu_SNP.r2.AxiomGT1.sketch" />
        <Parameter name="qcc-file"           currentValue="Axiom_GW_Hu_SNP.r2.qcc" />
        <Parameter name="qca-file"           currentValue="Axiom_GW_Hu_SNP.r2.qca" />
        <Parameter name="female-thresh"      currentValue="0.54" />
        <Parameter name="male-thresh"        currentValue="1.0" />
    </Parameters>
</ParameterSet>

The CDF and sketch files are required to perform proper sample QC on the Axiom ™ platform. The QCC and QCA files are described below are can be modified for user needs.

The QCC file: defining sets of probesets on which to perform single-chip analysis

The QCC (Quality Control Classes) file specifies classes of probesets to be analyzed for QC purposes. The file format is a specialization of TSV. There are two required columns, described below. As with the TSV format in general, the order of the columns and the possible presence of additional columns do not matter.

The QCC file is distributed as part of the library file package located at Axiom ™ product page. It is recommended that you use the official XML file included in the library file package versus creating your own from the example below.

As an example, here are the first ten lines of Axiom_GW_Hu_SNP.r2.qcc:

#%guid=4fc8a14a-cff5-41f2-90a0-17f42cee894f
#%create-date=Wed Sep 30 00:00:00 PDT 2009
#%content_version=2.0
#%format_version=1.0
#%primary_key=probeset_name
#%chip_type=Axiom_GW_Hu_SNP
#%chip_type=Axiom_GW_Hu_SNP.r2
#%lib_set_name=Axiom_GW_Hu_SNP
#%lib_set_version=r2
group_name	probeset_name	ligation_base
Mono_30	AFFX-NP-11078525	C
Mono_30	AFFX-NP-11078526	G
Mono_30	AFFX-NP-11078527	C
Mono_30	AFFX-NP-11078528	G
Mono_30	AFFX-NP-11078529	C
Mono_30	AFFX-NP-11078530	G
Mono_30	AFFX-NP-11078531	A
Mono_30	AFFX-NP-11078532	T
Mono_30	AFFX-NP-11078533	C
Mono_30	AFFX-NP-11078534	G

So for example the first data line implies that the probeset named AFFX-NP-11083037 is a member of the group Mono_sample_30 and interrogates the A allele. The QCA file (see next section) describes how each group of probesets should be analyzed.

The QCA file: specifying how each collection of probesets should be analyzed

The QCA (Quality Control TSV. There are four required columns, described below. As with the TSV format in general, the order of the columns and the possible presence of additional columns do not matter.

The QCA file is distributed as part of the library file package located at Axiom product page. It is recommended that you use the official XML file included in the library file package versus creating your own from the example below.

Below are the example contents of Axiom_GW_Hu_SNP.r2.qca. axiom_dishqc is the primary indicator of quality and should be used to identify failing samples (those with a axiom_dishqc_DQC value less than 0.82).

#%guid=2489a0ab-12c3-40d6-80e1-06df5617fe77
#%create-date=Wed Sep 30 00:00:00 PDT 2009
#%format_version=1.0			
#%content_version=1.0
#%default_analysis_name=axiom_dishqc
#%chip_type=Axiom_GW_Hu_SNP
#%chip_type=Axiom_GW_Hu_SNP.r2
#%lib_set_name=Axiom_GW_Hu_SNP
#%lib_set_version=r2
#
# There is a row for each analysis method on each SNP list.
#   analysis name is the name that will be associated with the analysis & snp list combination
#   snp_list is the identifier for the list of SNPs
#   analysis is the name of the algorithm to run
#   options is a field (may be empty) to pass options to the algorithm.
#
analysis_name	group_name	analysis	options
## signal contrast qc
axiom_signal_contrast	Mono_sample_30	mc-signalbackground

## DISH QC
axiom_dishqc	Mono_sample_30	mc-dishqc

## varibility score
axiom_varscore	Mono_sample_30	mc-variability

For each analysis type, multiple metrics will be written, each labeled with the analysis_name. Here are the metrics that are printed out in the output qc.txt file.

Metric name

Description

axiom_signal_contrast_AT_B_IQR Interquartile range of control GC probe raw intensities (background intensities) in the AT channel
axiom_signal_contrast_AT_B Mean of control GC probe raw intensities (background intensities) in the AT channel
axiom_signal_contrast_AT_FLD Linear Discriminant for signal and background in the AT channel, defined as (median_of_GC_probe_intensities - median_of_AT_probe_intensities)^2 / [0.5 * (Axiom_signal_contrast_AT_B_IQR^2 + Axiom_signal_contrast_AT_S_IQR^2)]
axiom_signal_contrast_AT_SBR Signal to background ratio in the AT channel, defined as Axiom_signal_contrast_AT_S / Axiom_signal_contrast_AT_B
axiom_signal_contrast_AT_S_IQR The interquartile range of control AT probe raw intensities (signal intensities) in the AT channel
axiom_signal_contrast_AT_S Mean of control AT probe raw intensities (signal intensities) in the AT channel
axiom_signal_contrast_A_signal_mean Mean of control A probe raw intensities in the AT channel
axiom_signal_contrast_C_signal_mean Mean of control C probe raw intensities in the GC channel
axiom_signal_contrast_GC_B_IQR The interquartile range of control AT probe raw intensities (background intensities) in the GC channel
axiom_signal_contrast_GC_B Mean of control AT probe raw intensities (background intensities) in the GC channel
axiom_signal_contrast_GC_FLD Linear Discriminant for signal and background in the GC channel, defined as (median_of_GC_probe_intensities - median_of_AT_probe_intensities)^2 / [0.5 * (Axiom_signal_contrast_GC_B_IQR^2 + Axiom_signal_contrast_GC_S_IQR^2)]
axiom_signal_contrast_GC_SBR Signal to background ratio in the GC channel, defined as Axiom_signal_contrast_GC_S / Axiom_signal_contrast_GC_B
axiom_signal_contrast_GC_S_IQR Interquartile range of control GC probe raw intensities (signal intensities) in the GC channel
axiom_signal_contrast_GC_S Mean of control GC probe raw intensities (signal intensities) in the GC channel
axiom_signal_contrast_G_signal_mean Mean of control G probe raw intensities in the GC channel
axiom_signal_contrast_T_signal_mean Mean of control T probe raw intensities in the AT channel
axiom_dishqc_DQC QC metric that evaluates the overlap between the two homozygous peaks (AT versus GC) in contrast space using normalized intensities of control non-polymorphic probes from both channels. It is defined as the fraction of AT probes not within 2 standard deviations of the GC probes in the contrast space.
axiom_dishqc_log_diff_qc Another cross channel QC metric, defined as mean(log(AT_SBR))/std(log(AT_SBR)) + mean(log(GC_SBR))/std(log(GC_SBR)), where signal and background are calculated for control non-polymorphic probes after intensity normalization
axiom_varscore_CV_GC Median of the coefficient of variation for each control GC probeset in the GC channel
axiom_varscore_CV_AT Median of the coefficient of variation for each control AT probeset in the AT channel
cn-probe-chrXY-ratio_gender_meanX The average probe intensity (raw, untransformed) of X chromosome nonpolymorphic probes.
cn-probe-chrXY-ratio_gender_meanY The average probe intensity (raw, untransformed) of Y chromosome nonpolymorphic probes.
cn-probe-chrXY-ratio_gender_ratio Gender ratio Y/X = cn probe chrXY-ratio_gender_meanY/ cn probe chrXY ratio_gender_meanX.
cn-probe-chrXY-ratio_gender Gender calls made by the cn-probe-chrXY-ratio_gender method. If the cn-probe-chrXY-ratio_gender_ratio is less than the lower cutoff the gender call is female. If the cn-probe-chrXY-ratio_gender_ratio is greater than the upper cutoff, then the gender call is male. If the cn-probe-chrXY-ratio_gender_ratio is between the lower and upper cutoffs, then the gender call is unknown.

The output report

The file specified by --out-file is where the results will be written. The file format is TSV with a row for each CEL file analyzed. There will be a column called "cel_files" with the CEL file name, followed by columns hosting QC metrics.

The primary field to inspect for experiment quality is axiom_dishqc_DQC, with values less than 0.82 indicating a failing sample.

Affymetrix Power Tools (APT) Release 1.14.3