- Date:
- 2007-11-20
The Whole Genome Sampling Assay (WGSA) SNP-genotyping technology tends to provide very reproducible signals from one samaple to the next, but the characteristics of the probe intensities vary somewhat from one SNP to the next. As a result when clustering to generate genotype calls there is generally an extra level of performance and robustness available by taking into account this SNP-specific behavior. This section describes how to use the BRLMM-P clustering algorithm in the context of having no SNP-specific priors to start with.
With BRLMM-P there are two mechanisms by which to train SNP-specific information. One way is to take advantage of the Bayesian nature of BRLMM-P, starting off with a weak prior and allowing the observed data be the primary contributor to the SNP-specific posteriors. The resulting posteriors can be saved and used as priors for future runs. The other way is to use BRLMM-P's ability to factor in known genotypes when they are available for some of the SNPs and samples being studied. These two methods can be applied at the same time. BRLMM-P is a likelihood-based clustering model and when searching through the parameter space in the context of supplied known genotypes it applies a penalty to the likelihood of any solution that violates the supplied information, guiding the result towards the correct answer.
It is important to consider the number and type of samples used with either of these approaches. For both approaches it will be hard to derive useful information for a genotype that is never observed in the samples analyzed, though BRLMM-P will still make a guess on the location of an unobserved genotype. Moreover, BRLMM-P tends to be more successful in clustering SNPs for which there is an observation of all three clusters. With that in mind, it is generally best to use as large and diverse a collection of samples as possible. As a reference, the SNP-specific models supplied with the Human Genome Wide SNP Array 6.0 were built based on clustering 270 HapMap samples (taking advantage of their known genotypes) with 200 samaples from a diversity panel.
Using empirical data to estimate SNP-specific models
apt-probeset-genotype \
--cdf-file mychiptype.cdf \
--analysis brlmm-p-plus.force \
--write-models \
--no-gender-force \
--out-dir out \
*.CEL
- The '--no-gender-force' option forces apt-probeset-genotype to treat all SNPs in the same manner, with no special treatment for sex-chromosome or mitochondrial SNPs. For details on how to get optimal treatment on sex-chromosome and mitochondrial SNPs see the documenation on handling special SNPs.
- The "brlmm-p-plus.force" analysis string is an alias for a long analysis specification: "quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.mix=1.bic=2.HARD=3.SB=0.45.KX=1.KH=1.5.KXX=0.5.KAH=-0.6.KHB=-0.6.transform=MVA.AAM=2.0.BBM=-2.0.AAV=0.06.BBV=0.06.ABV=0.06.copyqc=0.000001.wobble=0.05.MS=1". See the BRLMM-P whitepaper and the '--explain' option for apt-probeset-genotype ("apt-probeset-genotype --explain brlmm-p") for detailed understanding of what the string means. The quick summary is that it runs BRLMM-P in MvA (transform=MVA) and calls are always forced (MS=1).
- The '--write-models' option causes the posterior estimates to be written out to a file named brlmm-p-plus.force.snp-posteriors.txt in the output directory. The format of the file is tab- delimited text with a header line and one row per SNP. The first column is the probeset-id. If special treatment for sex-chromosome and mitochondrial SNPs is being used some of the probeset-ids will have a ":1" suffix to indicate a haploid model. The next three columns are for the BB, AB and AA genotypes respectively. Each entry consists of four comma-separated values: the posterior mean and variance for the SNP cluster, and the number of pseudo-observations underlying the mean and variance estimates respecitvely. The last column is a comma-separated list of the three covariances.
Using known genotypes in conjunction with empirical data to guide BRLMM-P in clustering and to estimate SNP-specific models.
apt-probeset-genotype \
--cdf-file mychiptype.cdf \
--genotypes hints.txt \
--analysis quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100
.mix=1.bic=2.HARD=3.SB=0.45.KX=1.KH=1.5.KXX=0.5.KAH=-0.6
.KHB=-0.6.transform=MVA.AAM=2.0.BBM=-2.0.AAV=0.06.BBV=0.06
.ABV=0.06.copyqc=0.000001.wobble=0.05.MS=1.hints=1.CP=16 \
--no-gender-force \
--write-models \
--out-dir out \
*.CEL
Note - the very long analysis string has been broken up in the example above for display purposes only (to fit on the page). In practice you would enter the string as a continuous sequence, with no intervening whitespace characters.
- The '--genotypes' option identifies a file to be used to read the assumed-known genotypes. The file format is tab-delimited text with the following requirements:
- There should be a header line
- There should be one column named 'probeset-id' and one column named after each CEL file being analyzed.
- There should be one row per SNP being analyzed - this means for all SNPs for which genotype calls are being reported, along with any SNPs analyzed along the way for gender calling. So for example if using the --probest-ids option to analyze just a subset of SNPs and estimating gender from the het rate on chrX SNPs then there should be a row in the hints table for all the chrX SNPs used in gender calling, even if they aren't actually being directly requested or reported.
- Each entry in the table should be one of {-1,0,1,2} corresponding to {N,AA,AB,BB}
- The --analysis option in this example specifies a string that is based on the string encoded by the 'brlmm-p-plus' alias, suffixed with ".hints=1.CP=16".
- The "hints=1" modifier tells BRLMM-P that hints (known genotypes) are being provided.
- The "CP=16" sets the "contradiction penalty" - the amount by which the likelihood is penalized when a considered solution violates the provided hints. The scale of this penalty is such that a penalty of N has equivalent impact to a sqrt(N) sigma outlier on the likelihood - so setting it to 16 states that violations of the hints will act like a 4-sigma outlier on the model. Reducing the CP decreases the degree to which the supplied hints dominate, and in contexts where there are many high-quality samples to empirically guide the analysis and/or the quality of the supplied hints is not perfect users may wish to explore the effect on the posteriors of different settings of CP.
Affymetrix Power Tools (APT) Release apt-1.10.1
Generated on Mon Nov 3 12:21:41 2008 for Affymetrix Power Tools by
1.5.3