VIGNETTES: Empirical Performance Ranking of Probes for Genotyping with the Whole Genome Sampling Assay

Date:
2007-01-08

Contents

Introduction

Empirical probe ranking for genotyping arrays is sometimes referred to as 'selecting the optimal probe'. For well-behaving probe sets, there is often little functional difference between the high-performing probes, so the ranking obtained is likely a chance function of the data. In such circumstances this procedure is perhaps better described as 'avoiding poor probes'.

Although the Genome Wide SNP 5.0 Array selected probes that were 'unpaired', i.e. consisted of different sequences for the A-allele and the B-allele (naturally they are always different in the SNP position), we have found that this strategy is less effective than always selecting paired probes, as was done for the Genome Wide SNP 6.0 Array. We therefore have written the software to select paired-probes only. This increases manufacturing efficiency, improves resistance to spatial gradients and sequence-based responses to assay conditions, and empirically leads to superior performance. Please bear this condition in mind if any ranking of probes is hand-tuned.

With that warning aside, empirical probe ranking consists of four basic steps

At the user level, this is done by using BRLMM-P for performing the genotype calling, simply adding the line '--select-probes' to the input to apt-probeset-genotype. This sets the flag indicating that the user wishes to regress individual probes against the results from the full probeset, and all the above steps are executed automatically. There are some complexities which will be described below for obtaining best results.

Note that this functionality has not been extensively tested and should be considered experimental. Use of this functionality should be accompanied by a judicious amount of spot-checking of the results to make sure they look reasonable.

This methodology has been extended slightly (still experimental) to function with 'apt-summary-genotype', for selection of individual probe-sets that work well. In this case, the method analyzes the full-set contrast values as though they were a single probe and evaluates the logistic regression model fit. Caution: Model fits obtained from different called genotypes are not necessarily comparable across probe-sets that evaluate the same underlying marker. In such a case, use the 'override' command to ensure that the same reference genotypes are used for all probe-sets being compared.

Methodology

For each probeset, we call genotypes using the analysis string provided to brlmm-p. This should be tuned for the best performance possible on the array, because these internally generated genotypes will be treated as the reference information. One of the most useful tunings is to construct a 'hints' file which provides explicit labelings of data points as one genotype or another, for those experiments where such reference information is available. This generally resolves ambiguities in cluster location, even if only a few hints are available. See the vignette on clustering without priors for guidance on how to use the hints functionality.

Second, for each matched pair A-allele and B-allele probe (un-matched pairs are suboptimal for many reasons), we generate a 1-dimensional contrast value for each cel file. The contrast value is a monotonic transformation of the quantity (A-B)/(A+B) where A and B are the normalized probe intensities. This directly examines the difference between the alleles, ignoring overall intensity effects. This concentrates the regression on finding probe-pairs that discriminate strongly between genotypes.

Once the contrast value has been obtained, it is used as the input variable in the logistic regression routine predicting genotype from contrast, using the mapping that genotypes consist of either 0 copies of the B allele, 1 copy of the B allele or 2 copies of the B allele, given observations of two alleles. The logistic regression is attempting to fit a model Pr(allele=B|contrast=X) = 1/(1+exp(a*X+b)) to the given data that predicts genotype as well as possible given contrast. Note that this differs from the clustering model used to actually call genotypes. This model can certainly be improved, but is useful for detecting suboptimal probe-pairs.

There are a number of important caveats to this model. First, any regression model will fit well a SNP with only one genotype available in the input data, with no discrimination among probes. Second, logistic regression breaks down when confronted with perfectly separated clusters. We handle both these cases by including a Bayesian prior on the model which prevents breakdown (by assuming the potential for an error in the genotypes provided for regression). However, of course, the monomorphic case still provides no useful data for ranking.

Once the model is fit for a probe pair, we output a single line summarizing the results to a file. This output file is tab-delimited text with the following fields and a header line containing names for each column:

See the FAQ item on probe IDs for more info.

The AIC, which turns out to be the most useful measure of fit, consists of the standard -2*log-likelihood+2*parameters value from the regression. The smallest AIC is the 'best' probe pair. This was the measure used for selection in the development of the Genome Wide SNP 6.0 Array - the single best probe pair was identified on a screening array and was then tiled in replicate (3 or 4 copies) on the final product. In general, the AIC is appropriately correlated with other quality measures, and combines all the data sufficiently well.

The 'concordance' is measured between the predicted values from the logistic regression against the full-set genotype calls. The predictions from the logistic regression are turned into discrete calls by using a t-like statistic comparing Pr(B allele) against the values, 0, 0.5, 1.0, i.e. (p-x)/sqrt(variance) smallest in absolute value. The statistic for hets is divided by 2, because the slope is steepest near p=0.5. The genotype (probability) with the smallest absolute value of the metric is 'called' as the genotype from the logistic regression. These 'calls' are then compared with the full-set genotype calls, and divided into summaries for heterozygous and homozygous calls. This predicted concordance is of limited utility (generally not many data points, not the same clustering method as used in actual analysis, miscalls more often than clustering methods) for distinguishing amongst generally well performing probes.

The FLD is computed directly from the genotypes and the data, consisting of the distance between cluster centers divided by the standard deviation within clusters. This measures the separation between clusters directly in terms of the within cluster variation. However, nice as this metric is, it is difficult to interpret when genotypes are missing for one or more clusters (and is filled in by prior information). There are of course 3 FLD values, one for each cluster pair. Note that the largest FLD is the 'best', which differs from the AIC.

All these summary values are output to a standard file [method].select-probes.txt, similar to other analysis pathways using apt-probeset-genotype. As mentioned above, AIC is sensitive to all deviations from ideal performance, and was found to be most useful in selecting probes.

Example usage

This probe ranking pathway is experimental software and not heavily optimized. Currently it cannot correctly handle chrY SNPs in a run that include female samples (see the list of known bugs below). As a result if the user wishes to perform selection on chrY as well as other kinds of SNPs two separate runs of apt-probeset-genotype will be required.

An example run of probe ranking (based on the Genome Wide SNP 6.0 Array) to empirically rank SNPs on an arbitrary mix of autosomal, chrX and mitochondrial SNPs:

apt-probeset-genotype \
  --select-probes \
  --cdf-file             lib/GenomeWideSNP_6.cdf \
  --special-snps         lib/GenomeWideSNP_6.specialSNPs \
  --read-models-brlmmp   lib/GenomeWideSNP_6.brlmm-p.models \
  --analysis             quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.mix=1.bic=2.HARD=3.SB=0.45.KX=1.KH=1.5.KXX=0.5.KAH=-0.6.KHB=-0.6.transform=MVA.AAM=2.0.BBM=-2.0.AAV=0.06.BBV=0.06.ABV=0.06.copyqc=0.000001.wobble=0.05.MS=0.05.hints=1.CP=16 \
  --genotypes            myKnownGenotypes.txt \
  --set-gender-method    user-supplied \
  --read-genders         myCels.genders.txt \
  --cel-files            myCels.txt \
  --probeset-ids         myAutoMitoChrX_SNPs.txt \
  --out-dir              out.AutoMitoChrX

We now present an example run of probe ranking (based on the Genome Wide SNP 6.0 Array) to empirically rank chrY SNPs. The key differences relative to the run above are:

apt-probeset-genotype \
  --select-probes \
  --cdf-file             lib/GenomeWideSNP_6.cdf \
  --special-snps         lib/GenomeWideSNP_6.specialSNPs \
  --read-models-brlmmp   lib/GenomeWideSNP_6.brlmm-p.models \
  --analysis             quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.mix=1.bic=2.HARD=3.SB=0.45.KX=1.KH=1.5.KXX=0.5.KAH=-0.6.KHB=-0.6.transform=MVA.AAM=2.0.BBM=-2.0.AAV=0.06.BBV=0.06.ABV=0.06.copyqc=0.000001.wobble=0.05.MS=0.05.hints=1.CP=16 \
  --genotypes            myKnownGenotypes.txt \
  --set-gender-method    user-supplied \
  --read-genders         myCels.genders.txt \
  --cel-files            myCels.male.txt \
  --probeset-ids         myChrY_SNPs.txt \
  --out-dir              out.ChrY

Review of empirical probe ranking

Special Cases:

Known Bugs

There is one known bug

Affymetrix Power Tools (APT) Release apt-1.10.1

Generated on Mon Nov 3 12:21:42 2008 for Affymetrix Power Tools by  doxygen 1.5.3