VIGNETTES: Genotype clustering without SNP-specific priors (and how to generate SNP-specific prior for future use)

Date:
2007-11-20

Contents

Introduction

The Whole Genome Sampling Assay (WGSA) SNP-genotyping technology tends to provide very reproducible signals from one samaple to the next, but the characteristics of the probe intensities vary somewhat from one SNP to the next. As a result when clustering to generate genotype calls there is generally an extra level of performance and robustness available by taking into account this SNP-specific behavior. This section describes how to use the BRLMM-P clustering algorithm in the context of having no SNP-specific priors to start with.

With BRLMM-P there are two mechanisms by which to train SNP-specific information. One way is to take advantage of the Bayesian nature of BRLMM-P, starting off with a weak prior and allowing the observed data be the primary contributor to the SNP-specific posteriors. The resulting posteriors can be saved and used as priors for future runs. The other way is to use BRLMM-P's ability to factor in known genotypes when they are available for some of the SNPs and samples being studied. These two methods can be applied at the same time. BRLMM-P is a likelihood-based clustering model and when searching through the parameter space in the context of supplied known genotypes it applies a penalty to the likelihood of any solution that violates the supplied information, guiding the result towards the correct answer.

It is important to consider the number and type of samples used with either of these approaches. For both approaches it will be hard to derive useful information for a genotype that is never observed in the samples analyzed, though BRLMM-P will still make a guess on the location of an unobserved genotype. Moreover, BRLMM-P tends to be more successful in clustering SNPs for which there is an observation of all three clusters. With that in mind, it is generally best to use as large and diverse a collection of samples as possible. As a reference, the SNP-specific models supplied with the Human Genome Wide SNP Array 6.0 were built based on clustering 270 HapMap samples (taking advantage of their known genotypes) with 200 samaples from a diversity panel.

Using empirical data to guide clustering

Using empirical data to estimate SNP-specific models

  apt-probeset-genotype \
    --cdf-file  mychiptype.cdf \
    --analysis brlmm-p-plus.force \
    --write-models \
    --no-gender-force \
    --out-dir out \
    *.CEL

Using empirical data and assumed-known genotypes to guide clustering

Using known genotypes in conjunction with empirical data to guide BRLMM-P in clustering and to estimate SNP-specific models.

  apt-probeset-genotype \
    --cdf-file mychiptype.cdf \
    --genotypes hints.txt \
    --analysis quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100
               .mix=1.bic=2.HARD=3.SB=0.45.KX=1.KH=1.5.KXX=0.5.KAH=-0.6
               .KHB=-0.6.transform=MVA.AAM=2.0.BBM=-2.0.AAV=0.06.BBV=0.06
               .ABV=0.06.copyqc=0.000001.wobble=0.05.MS=1.hints=1.CP=16 \
    --no-gender-force \
    --write-models \
    --out-dir out \
    *.CEL 

Note - the very long analysis string has been broken up in the example above for display purposes only (to fit on the page). In practice you would enter the string as a continuous sequence, with no intervening whitespace characters.

Affymetrix Power Tools (APT) Release apt-1.10.1

Generated on Mon Nov 3 12:21:41 2008 for Affymetrix Power Tools by  doxygen 1.5.3