APT implements the Birdseed v1 algorithm, developed in collaboration with the Broad Institute, which Affymetrix has validated and supports for use with the SNP 6.0 array. APT also implements the BRLMM-P algorithm, which Affymetrix has validated and supports for the SNP 5.0 array.
Additionally, a newer version of Birdseed accessed using --analysis birdseed-v2 option and the latest development edition of the Birdseed algorithm accessed via the use of the --analysis birdseed-dev option are implemented. Affymetrix is not currently supporting the newer birdseed-v2 and birdseed-dev methods, in contrast with the supported methods described above. Moreover, the SNP priors and lists of "qualified" SNPs for Birdseed-dev are not currently available from Affymetrix for either SNP 5.0 or 6.0. Further information and support files for Birdseed-dev and Birdseed-v2 are available from the Broad Institute (http://www.broad.mit.edu/mpg/birdsuite/).
Future APT updates are expected to migrate improvements currently available via the birdseed-dev option into methods supported in the same manner as the rest of APT methods
As birdseed, brlmm and brlmm-p are model based algorithms they need to be run on multiple CEL files at once to estimate probe effect and SNP cluster parameters. For Mapping 500K data it is advisable to run on at least 50 distinct samples (excluding replicates) and ideally on about 100. For Genome-Wide Human SNP 5.0 and 6.0 arrays it is advisable to cluster with at least 44 genetically distinct samples, though adding more will continue to be of benefit in particular for correctly calling rare genotypes.
The basic requirements for a run of apt-probeset-genotype are:
WARNING: apt-probeset-genotype will overwrite any existing output files it finds. If you wish to keep existing results make sure to specify a different output directory name.
WARNING: Model files are algorithm specific. Birdseed model files must be used with the Birdseed analysis method and BRLMM-P model files with the brlmm-p method.
NOTE: On windows the DOS prompt does not support wildcard expansion and the preferred method is to supply a text file with the path to the cel files via the '--cel-files' option (see below for details of file format).
NOTE: The windows DOS prompt also does not allow a continuation of a command with the '\' character, unlike unix. So in the examples below the '\' character should be omitted and everything entered on a single line.
apt-probeset-genotype \ -o results_dir \ -c Mapping250K_Sty.cdf \ --chrX-snps Mapping250K_Sty.chrx \ *.CEL
The output will consist of a report file with some summary statistics about each chip analyzed and a pair of tab-delimited text files with suffixes .calls.txt and .confidences.txt containing the genotype calls and their associated confidences.
On windows a command equivalent to the example above for Mapping 500K would look like:
apt-probeset-genotype -c Mapping250K_Sty.cdf --chrX-snps Mapping250K_Sty.chrx -o results_dir --cel-files cel_file_list.txt
For Mapping 500K chips apt-probeset-genotype runs 100 CELs in 1-2 hours on a 3GHz 2Mb RAM machine using local disk.
apt-probeset-genotype \ -o results_dir \ -c GenomeWideSNP_5.cdf \ --chrX-snps GenomeWideSNP_5.chrx \ --read-models-brlmmp GenomeWideSNP_5.models \ -a brlmm-p \ *.CEL
Note in particular the use of the option "-a brlmm-p" which specifies that the BRLMM-P calling algorithm should be used (the default is brlmm, which won't work on a chip without MM probes such as the 5.0 chips).
apt-probeset-genotype \ -o results_dir \ -c GenomeWideSNP_6.cdf \ --set-gender-method cn-probe-chrXY-ratio \ --chrX-probes GenomeWideSNP_6.chrXprobes \ --chrY-probes GenomeWideSNP_6.chrYprobes \ --special-snps GenomeWideSNP_6.specialSNPs \ --read-models-birdseed GenomeWideSNP_6.birdseed-v2.models \ -a birdseed-v2 \ *.CEL
Note in particular the use of the option "-a birdseed-v2" which specifies that the Birdseed calling algorithm should be used (the default is brlmm, which won't work on a chip without MM probes such as the 5.0 and 6.0 chips).
Also see the important notes regarding birdseed-v1, birdseed-v2, and birdseed-dev in the introduction above.
The following will give you the older birdseed (v1) behavior:
apt-probeset-genotype \ -o results_dir \ -c GenomeWideSNP_6.cdf \ --special-snps GenomeWideSNP_6.specialSNPs \ --read-models-birdseed GenomeWideSNP_6.birdseed.models \ -a birdseed \ *.CEL
apt-probeset-genotype \ -s subset_sty.txt \ -c Mapping250K_Sty.cdf \ --chrX-snps Mapping250K_Sty.chrx \ -o results_dir \ *.CEL
See the apt-probeset-summarize manual for an more complete example of running an analysis on a compute farm.
The reason that the XDA CHP format is discouraged for the GenomeWide SNP 5.0 chips is that it doesn't contain entries for SNP IDs, the identity of a SNP is inferred from its order in the file. In the case of the GenomeWide SNP 5.0 chips there are some SNPs that are not part of the default library file which some advanced users may choose to explore. This leads to the possibility of generating CHP files containing different SNP lists, something not well supported by the XDA CHP format. The AGCC CHP format has a slot for SNP IDs and thus is safer to use with chips for which users may be looking at different SNP lists.
For SNP 6.0 XDA CHP file format output is not allowed at all.
Details on the contents of the CHP files for various calling algorithms can be found below, and a full description of the XDA and AGCC CHP formats can be found in a local copy of the Affymetrix Developer's Network file format documentation.
APT is not supported through the Affymetrix call center, Field Application Specialists, or the standard Affymetrix Technical support channels.
If you encounter an issue please make sure to collect the following information and report the problem to devnet@affymetrix.com
apt-probeset-genotype - program for determining genotype calls
from Affymetrix SNP microarrays. The model based algorithms for
making calls (brlmm/brlmm-p/birdseed) require multiple cel files
to be analyzed at once to learn the parameters for each SNP.
usage:
BRLMM (500K arrays):
apt-probeset-genotype -c chip.cdf --chrX-snps chip.chrx \
-o out-dir/ *.cel
BRLMM-P (GenomeWide SNP 5.0 arrays):
apt-probeset-genotype -c chip.cdf --chrX-snps chip.chrx \
-o out-dir/ -a brlmm-p --read-models-brlmmp chip.models \
*.cel
Birdseed (GenomeWide SNP 6.0 arrays):
apt-probeset-genotype -c chip.cdf --special-snps chip.specialSNPs \
-o out-dir/ -a birdseed --read-models-birdseed chip.birdseed.models \
*.cel
See the apt-probeset-genotype manual for more information about
birdseed including the latest improvements from The Broad.
options:
Basic Info and Control Options
-h, --help This message. [default 'false']
--explain Explain a particular operation (i.e.
--explain brlmm or --explain brlmm-p).
[default '']
-v, --verbose How verbose to be with status messages 0 -
quiet, 1 - usual messages, 2 - more
messages. [default '1']
--version Output program version and quit. [default
'false']
-f, --force Disable various checks including chip
types. Consider using --chip-type option
rather than --force. [default 'false']
Input Options
--cel-files Text file specifying cel files to process,
one per line with the first line being
'cel_files'. [default '']
-c, --cdf-file File defining probe sets. Use either
--cdf-file or --spf-file. [default '']
--spf-file File defining probe sets in spf (simple
probe format) which is like a text cdf
file. [default '']
--chrX-snps File containing snps on chrX
(non-pseudoautosomal region). [default '']
--special-snps File containing all snps of unusual copy
(chrX,mito,Y) [default '']
--chrX-probes File containing probe_id (1-based) of
probes on chrX. Used for copy number probe
chrX/Y ratio gender calling. [Experimental]
[default '']
--chrY-probes File containing probe_id (1-based) of
probes on chrY. Used for copy number probe
chrX/Y ratio gender calling. [Experimental]
[default '']
-s, --probeset-ids Tab delimited file with column
'probeset_id' specifying probesets to
genotype. [default '']
--probeset-ids-reported Tab delimited file with column
'probeset_id' specifying probesets to
report. This should be a subset of those
specified with --probeset-ids if that
option is used. [default '']
--probe-class-file File containing probe_id (1-based) of
probes and a 'class' designation. Used to
compute mean probe intensity by class for
report file. [default '']
--chip-type Chip types to check library and CEL files
against. Can be specified multiple times.
The first one is propigated as the chip
type in the output files. Warning, use of
this option will override the usual check
between chip types found in the library
files and cel files. You should use this
option instead of --force when possible.
[default '']
--snp-annotation-file SNP Annotation file. [default '']
--cn-annotation-file CN Annotation file. [default '']
--genotype-markers-cn-file Tab delimited file with copy number calls
for genotype probesets within copy number
regions [default '']
Output Options
-o, --out-dir Directory to write result files into. Any
previous results in directory will be
overwritten. [default '.']
--table-output Output matching matrices of tab delimited
genotype calls and confidences. [default
'true']
--output-forced-calls Output a separate file with forced calls.
[default 'false']
--output-context Output a separate file with the allele
context used. This is only relevant for
marker type probesets which have multiple
groups of probes for each allele based on
the context of nearby SNPs. [default
'false']
--cc-chp-output Output resulting calls in directory called
'cc-chp' under out-dir. This makes one AGCC
Multi Data CHP file per cel file analyzed.
[default 'false']
--xda-chp-output Output resulting calls in directory called
'chp' under out-dir. This makes one GCOS
XDA CHP file per cel file analyzed. Note
that this format is not supported beyond
the Mapping500K chips, for subsequent chips
look at the CC CHP format instead. [default
'false']
--cc-chp-out-dir Over-ride the default location for chp
output. [default '']
--xda-chp-out-dir Over-ride the default location for chp
output. [default '']
--summaries Output the summary values from the
quantifcation method for each allele. For
brlmm-p this will also write a file of
transformed summary values in contrast
space used in the clustering. [default
'false']
Analysis Options
-a, --analysis String representing analysis pathway
desired. For example:
'quant-norm.sketch=50000,pm-only,brlmm'.
[default 'brlmm']
--qmethod-spec Quantification Method to use for
summarizing alleles. [default
'plier.optmethod=1']
--read-models-brlmm File to read precomputed BRLMM snp specific
models from. [default '']
--read-models-brlmmp File to read precomputed BRLMM-P snp
specific models from. [default '']
--read-models-birdseed File to read precomputed birdseed snp
specific models from. [default '']
--write-models Should we write snp specific models out for
analysis? [experimental] [default 'false']
--feat-effects Output feature effects when available.
[default 'false']
--use-feat-eff File defining a plier feature effect for
each probe. Note that precomputed effects
should only be used for an appropriately
similar analysis (i.e. feature effects for
pm-only may be different than for pm-mm).
[default '']
--residuals Output the residuals from the
quantification method if available.
[default 'false']
--target-sketch File specifying a target distribution to
use for quantile normalization. [default
'']
--write-sketch Write the quantile normalization
distribution (or sketch) to a file for
reuse with target-sketch option. [default
'false']
--dm-thresh Minimum DM p-value to seed clusters with.
[default '.17']
--dm-hetmult DM hetmultiplier to balance het/hom calls,
additive to log likelihood. [default '0']
--prior-size How many probesets to use for determining
prior. [default '0']
--list-sample Only sample for prior from list specified
via --probeset-ids, not entire chip.
[default 'false']
--read-priors-brlmm File to load BRLMM priors from. Prior
format is tab separated id, center, var,
and center.var. [default '']
--write-prior Write prior out to file in output-dir.
[default 'false']
--norm-size Do contrast normalization using a sample of
this many snps (brlmm-p) [default '0']
--write-norm Write covariate norm fcns to file [default
'false']
--set-analysis-name Explicitly set the analysis name. This
affects output file names (ie prefix) and
various meta info. [default '']
Gender Options
--set-gender-method Explicitly force the use of a particular
gender method for genotype calling. Valid
values include: cn-probe-chrXY-ratio,
dm-chrX-het-rate,
em-cluster-chrX-het-contrast,
user-supplied, and none. If you are
supplied seed genotype calls, you can also
use supplied-genotypes-chrX-het-rate. When
not set, the default behavior depends on
the analysis. [default '']
--read-genders Explicitly read genders from a file.
[default '']
--no-gender-force Perform analysis even without a suitable
gender method for genotype calling.
[default 'false']
--em-gender Enable EM Gender calling if special-snps or
chrX-snp file is provided. [default 'true']
--female-thresh Threshold for calling females when using
cn-probe-chrXY-ratio method. [default
'0.48']
--male-thresh Threshold for calling females when using
cn-probe-chrXY-ratio method. [default
'0.71']
Advanced Options
--kill-list Do not use the probes specified in file for
computing results. [experimental] [default
'']
--dm-out Output any initial seed calls used by BRLMM
(seed default is DM calls). Only relevant
for BRLMM. [default 'false']
--all-types Try and analyze all probeset types rather
than just genotyping. [Experimental]
[default 'false']
--genotypes File to read seed genotypes from instead of
using DM to generate. [experimental]
[default '']
--select-probes Output estimates of which probes are most
accurate [default 'false']
--call-coder-max-alleles For encoding/decoding calls, the max number
of alleles per marker to allow. [default
'6']
--call-coder-type The data size used to encode the call.
[default 'UCHAR']
--call-coder-version The version of the encoder/decoder to use
[default '1.0']
Execution Control Options
--mem-usage How much memory (RAM) to use for this job
in megabytes. Only relevant when
--use-disk=false. [default '0']
--block-size How many probesets to process at once,
useful when memory is limited. If set to 0
program attempts to guess available RAM and
set appropriately. Only relevant if
--use-disk is false. [default '0']
--max-block-size This sets a cap on how high the
blockSize*celFiles can go. When set to 0
there is no cap. Only relevant if
--use-disk is false. [default '0']
--use-disk Use disk based representation to avoid
excessive RAM use. [default 'true']
--disk-dir Directory for temporary files when working
off disk. Using network mounted drives is
not advised. When not set, the output
folder will be used. [default '']
--disk-cache Size of memory cache when working off disk
in megabytes. [default '50']
A5 output options
--a5-global-file Filename for the A5 global output file.
[Experimental] [default '']
--a5-global-file-no-replace Append or create rather than replace.
[Experimental] [default 'false']
--a5-group Group name where to put results in the A5
output files. [Experimental] [default '']
--a5-calls Output the genotype calls and confidences
in A5 format. [Experimental] [default
'false']
--a5-calls-use-global Use the global A5 file for calls and
confidences.[Experimental] [default
'false']
--a5-summaries Output the summary values from the
quantifcation method for each allele in A5
format. [Experimental] [default 'false']
--a5-summaries-use-global Use the global A5 file for summaries.
[Experimental] [default 'false']
--a5-feature-effects Output feature effects in A5 format.
[Experimental] [default 'false']
--a5-feature-effects-use-global Use the global A5 file for feature
effects.[Experimental] [default 'false']
--a5-residuals Output feature level residuals in A5
format. [Experimental] [default 'false']
--a5-residuals-use-global Use the global A5 file for residuals.
[Experimental] [default 'false']
--a5-sketch Output normalization sketch in A5 format.
--write-sketch option will override this
option. [Experimental] [default 'false']
--a5-sketch-use-global Put the sketch in the global A5 output
file. [Experimental] [default 'false']
--a5-write-models Output genotype models/posteriors in A5
format. --write-models option will override
this option. [Experimental] [default
'false']
--a5-write-models-use-global Put the models in the global A5 output
file. [Experimental] [default 'false']
A5 input options
--a5-global-input-file Filename for the group in the global input
file.[Experimental] [default '']
--a5-input-group Group name for input. Defaults to
--a5-group or if that is not set, then '/'.
[Experimental] [default '']
--a5-sketch-input-global Read the sketch from the global A5 input
file. [Experimental] [default 'false']
--a5-sketch-input-file Read the sketch from the an A5 input file.
[Experimental] [default '']
--a5-sketch-input-group Group name to read the sketch from.
Defaults to --a5-input-group.
[Experimental] [default '']
--a5-sketch-input-name The name of the data section. Defaults to
'target-sketch'. [Experimental] [default
'']
--a5-feature-effects-input-global Read the feature effects global A5 input
file. [Experimental] [default 'false']
--a5-feature-effects-input-file Read the feature effects from the an A5
input file. [Experimental] [default '']
--a5-feature-effects-input-group Group name to read the feature effects
from. Defaults to --a5-input-group.
[Experimental] [default '']
--a5-feature-effects-input-name The name of the data section. Defaults to
XXX.feature-response where XXX is the
analysis name and quant method. IE
'brlmm-p.plier'. [Experimental] [default
'']
--a5-models-input-global Read the Models from the global A5 input
file. The tsv5 name must be
'XXX.snp-posteriors'. [Experimental]
[default 'false']
--a5-models-input-file Read the models from the an A5 input file.
[Experimental] [default '']
--a5-models-input-group The group name where the models are
located. Defaults to the analysis name.
[Experimental] [default '']
--a5-models-input-name The name of the data section. Defaults to
XXX.snp-posteriors where XXX is the
analysis name. IE 'brlmm-p'. [Experimental]
[default '']
Standard Methods:
'birdseed' quant-norm.sketch=50000,pm-only,birdseed
'birdseed-dev' quant-norm.sketch=50000.target=1000,pm-only,birdseed-dev
'birdseed-dev.force' quant-norm.sketch=50000.target=1000,pm-only,birdseed-dev.conf-threshold=1
'birdseed-v1' quant-norm.sketch=50000,pm-only,birdseed-v1
'birdseed-v1.force' quant-norm.sketch=50000,pm-only,birdseed-v1.conf-threshold=1
'birdseed-v2' quant-norm.sketch=50000.target=1000,pm-only,birdseed-v2
'birdseed-v2.force' quant-norm.sketch=50000.target=1000,pm-only,birdseed-v2.conf-threshold=1
'birdseed.force' quant-norm.sketch=50000,pm-only,birdseed.conf-threshold=1
'brlmm' quant-norm.sketch=50000,pm-only,brlmm.transform=ccs.K=4
'brlmm-p' quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.K=2.SB=0.003.MS=0.05
'brlmm-p-plus' quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.mix=1.bic=2.HARD=3.SB=0.45.KX=1.KH=1.5.KXX=0.5.KAH=-0.6.KHB=-0.6.transform=MVA.AAM=2.0.BBM=-2.0.AAV=0.06.BBV=0.06.ABV=0.06.copyqc=0.000001.wobble=0.05.MS=0.05
'brlmm-p-plus.force' quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.mix=1.bic=2.HARD=3.SB=0.45.KX=1.KH=1.5.KXX=0.5.KAH=-0.6.KHB=-0.6.transform=MVA.AAM=2.0.BBM=-2.0.AAV=0.06.BBV=0.06.ABV=0.06.copyqc=0.000001.wobble=0.05.MS=1
'brlmm-p.force' quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.K=2.SB=0.003.MS=1
Data transformations:
rma-bg Performs an RMA style background adjustment as
described in Irizarry et al 2003.
quant-norm Class for doing quantile normalization. Can do
sketch and full quantile (just set sketch to
chip size or zero) and supports bioconductor
compatibility.
med-norm Class for doing median normalization. Adjust
intensities such that all chips have the same
median (or average).
adapter-type-norm Class for doing adapter type normalization.
Adjust intensities by adapter type.
mas5-bg Performs a MAS 5 background adjustment as
described in Liu et al, Bioinformatics (2002).
gc-bg Subtract bacground based on median intensity
of probes with similar GC content.
Pm Intensity Adjustments:
pm-only No adjustment. Just uses unmodified PM intensity values.
pm-mm Use mismatch probe as adjustment for perfect match. Has
strength of being unbiased, but often the mismatch probe
binds the match target.
pm-gcbg Do an adjustment based on the median intensity of probes
with similar GC content.
pm-sum Add itensity of PM probe for other allele to PM probes.
Quantification Methods:
plier The PLIER (Probe Logarithmic Error Intensity
Estimate) method produces an improved signal by
accounting for experimentally observed patterns in
feature behavior and handling error at the
appropriately at low and high signal values. This
version of PLIER differs from the previous version
by the addition of a SafteyZero, NumericalTolerance,
and FixPrecomputed. These options are intended to
improve the stability of PLIER results when using
precomputed feature reponse values. To get the older
PLIER behavior set SafetyZero to 0.0,
NumericalTolerance to 0.0, and FixPrecomputed to
false.
sea The SEA (Simplified Expression Analysis) method
provides a simple signal estimate, using the
initialization algorithm from the PLIER (Probe
Logarithmic Error Intensity Estimate) method and
omitting the PLIER parameter fitting. SEA is useful
for single chip signal estimation. The version of
PLIER used by SEA differs from the previous version
by the addition of a SafteyZero, NumericalTolerance,
and FixPrecomputed. These options are intended to
improve the stability of PLIER results when using
precomputed feature reponse values. To get the older
PLIER behavior set SafetyZero to 0.0,
NumericalTolerance to 0.0, and FixPrecomputed to
false.
iter-plier Do probe set quantification estimate by iteratively
calling PLIER with the probes that best correlate
with signal estimate. The version of PLIER used by
IterPLIER differs from the previous version by the
addition of a SafteyZero, NumericalTolerance, and
FixPrecomputed. These options are intended to
improve the stability of PLIER results when using
precomputed feature reponse values. To get the older
PLIER behavior set SafetyZero to 0.0,
NumericalTolerance to 0.0, and FixPrecomputed to
false.
med-polish Performs a median polish to estimate target and
probe effects. Resulting summaries are in log2 space
by default. Used in summary step of RMA as described
in Irizarry et al 2003.
dabg Calculates the p-value that the intensities in a
probeset could have been observed by chance in a
background distribution. Used as a substitute for
standard absent/present calls when mismatch probes
are not available.
mas5-detect Calculates the p-value for detection of an expressed
gene using the MAS 5.0 algorithm. This is a
rank-based algorithm, using discrimination scores,
described in Liu et al., Bioinformatics (2002)
18:1593 and the Statistical Algorithms Reference
Guide.
mas5-signal Calculates the average measurement for a probeset
using the MAS 5.0 algorithm. This is based on a
robust estimator, Tukey's biweight, described in
Hubbell et al., Bioinformatics (2002) 18:1585 and
the Statistical Algorithms Reference Guide. WARNING:
The implementation in APT does not allow for signal
level normalization across the chip. See the FAQ
item in the manual.
avgdiff Calculates the average measurement for a probeset
using the MAS 4 average difference algorithm, namely
the average difference between the pm and mm probe
signal.
median Use the median of probes for a particular chip as
the summary.
Analysis Streams:
expr Does expression summarization on probesets.
pca-select Determines PCA for probes and picks probes that are
near the principal component as the probes to use
for downstream analysis.
spect-select Picks probes that are similar to each other based
on spectral cluster and normalized cut.
version: apt-1.10.1 $Id: apt-probeset-genotype.cpp,v 1.236 2008/10/25 06:08:55 awilli Exp $
A. See the FAQ item on probe IDs for more info.
Q. What do I do when I don't have enough memory to process all the data? (when --use-disk=false)
A. Starting with release 1.10.0, apt-probeset-genotype defaults to using temporary files rather than trying to keep everything in memory. If you use the option "--use-disk=false" you can force the older in-memory mode which is going to be sensitive to how much memory you have. To tweak how much memory is used when running in the in-memory mode you can manually set the --block-size command to specify how many probesets will be run at once. The program will then reduce memory by only loading those probesets into RAM. If the block-size option is unset the program will attempt to figure out how much available RAM you have and run in that memory. To fit in memory the program will often need to read the original CEL files multiple times. Also, if doing a quantile normalization try using a sketch (or subset) of the chip for the normalization. Sketch normalization is the default so this would only apply if you are using non-default options.
Q. How can I make apt-probeset-genotype run more probesets per iteration? (when --use-disk=false)
A. Starting with release 1.10.0, apt-probeset-genotype defaults to using temporary files and a single iteration. If you use --use-disk=false to override this behavior, you can manually set the --block-size flag to prevent apt-probeset-genotype from guessing the amount of probesets to be run per iteration. Instead it will use the supplied value.
Q. The program died with an error message like "Assertion failed: A->probes.size() == 2, file ...cpp." What does this mean?
A. This is symptomatic of trying to run BRLMM for a SNP with no MM probes. In its typical mode of running BRLMM relies on DM to generate intital seed calls, and the DM algorithm requires MM probes.
Q. The program died with an error message like "DmListenergetGenoCall() - Can't find genotypes for name: SNP_A-1780432". What does this mean?
A. This is symptomatic of having specified the wrong chrX file for the analysis. In order to reduce the likelihood of accidentally using the wrong chrX file apt-probeset-genotype checks to make sure that all the SNPs specified in the chrX file are present on the chips being analyzed. If it finds a SNP present in the chrX file that is not identified in the CDF file it will die with the above message. Note that if you want to bypass the requirement of a chrX file you can use the --no-gender-force option.
Q. The program died and I got an error message saying "Killed". What does this mean and what can I do?
A. Linux has a "feature" that it will promise more memory than it actually has in the hope that many programs won't actually be using all their memory at once. However, if linux does run short of memory it will start killing programs arbitrarily. You can read more about linux's OOM (out of memory) killer at at LWN.net.
Q. Why does apt-probeset-genotype require information regarding SNPs on chromosomes X/Y/Mito?
A. The SNPs on chromosome X are evaluated separately for XX (female) and XY (male) individuals as the intensity estimates for the males will generally be lower on X due to one missing chromosome. The prior is also adjusted to remove the het center as XY individuals should only have hom calls on the X chromosome. For BRLMM analyses gender is estimated using the method employed in the GTYPE software: individuals are called male if less than 7.5% of the snps on X are called as hets by the initial DM calls using a .33 confidence threshold. For BRLMM-P gender is estimated by use of an Expectation Maximiation (EM) algorithm on the PM probes for chrX SNPs to estimate the het rate.
Q. How is the mask section in the CEL file used?
A. It is not. The contents of this section of the CEL file are ignored.
Q. How can I find out more information about the analysis string:
quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.K=2.SB=0.003.MS=0.05
A. Start with the included manual for apt-probeset-genotype. Usage information is also provided if you run apt-probeset-genotype without any arguments. Lastly, use the "--explain" option for more information about particular analysis methods. IE:
apt-probeset-genotype --explain brlmm-p
Q. Wild cards do not work on windows. For example:
apt-probeset-genotype ... *.CEL
A. APT relies on the command shell to do the wild card expansion (ie bash shell on NIX systems). The windows shell does not do wild card expansion, so there is no wild card expansion for APT when run from the windows shell. You may want to try a different windows shell or perhaps bash via cygwin. See the --cel-files option as an alternative to specify CEL files for analysis.
The following parameters are saved in the CCHPFileHeader object:
A complete explanation of the XDA CHP file format can be found in a local copy of the Affymetrix Developer's Network file format documentation.
Use of the clustering_space_x_value and clustering_space_y_value fields allows for plotting the data in the space that was used to perform the clustering. For BRLMM and BRLMM-P the x-value is 'transformed contrast' and the y-value is 'signal strength' - see the BRLMM and BRLMM-P whitepapers for more detail. For Birdseed (see http://www.broad.mit.edu/mpg/birdsuite/) this is A-signal va B-signal (linear scale, post quantile normalization and allele-specific median-polish).
A complete explanation of the AGCC CHP file format can be found in a local copy of the Affymetrix Developer's Network file format documentation.
The most common challenge people have running apt-probeset-genotype is with RAM. apt-probeset-genotype will attempt to split up jobs into the amount of RAM that appears free on your computer. The job is split by subdividing the analysis into blocks of probesets (or SNPs). Small block sizes will subdivide the job more and require less memory at the expense of having to read the CEL files more frequently. On the other hand, using a smaller number of large blocks will require more memory but will place minimal load on reading CEL files.
The default behaviour (when forcing in-memory mode with --use-disk=false) is for apt-probeset-genotype to estimate the optimal block size based upon the amount memory that appears to be free at the time the job begins. This default behaviour can be overridden by the user with the use of the --block-size option, which specifies the number of probesets to be processed at one time. For example, specifying --block-size=20000 will analyze the data in batches of 20,000 probesets at a time.
A problem that may be encountered (especially on a multi-user or multi-processor system) is running out of memory when a run of apt-probset-genotype is initiated and then another big-memory process is started afterwards. In this circumstance the first instance of apt-probeset-genotype sees substantial free memory and chooses a large block-size, but then the second process grabs more of the memory and the first run of apt-probeset-genotype runs out of memory. This problem can be addressed by planning the work load on your machine and/or using an appropriately small block size with the --block-size option.
RAM usage (in bytes) for Mapping 500K data can be estimated by the following equation:
Below are some guildlines about how many probesets to run at once (i.e. the --block-size) in 1.9 Gig of RAM as a function of number of CEL files:
Note that the above recommendations won't use all of the 1.9 gigs of RAM. In addition to needing a relatively large amount of memory the program also needs relatively large blocks of contiguous memory and as RAM usage approaches the maximum available these get harder and harder to find. If you've got memory to spare the amount of RAM to run all the data at once as a function of Chips:
Note that on most 32 bit (i.e. Pentium, Xeon, Windows) systems you can't use than ~2 Gig of RAM with a single process, even if there is more available.
The current full default brlmm analysis is: 'quant-norm.sketch=50000,pm-only,brlmm' where there can be multiple chipstream modules (in this case a single quant-norm) separated by commas and the last two entries are the pm adjuster (pm-only) and quantification method (brlmm). Parameters to a particular step in the analysis are supplied in key=value pairs and separated by periods. For example 'quant-norm.sketch=50000' indicates that the chips should be quantile normalized and that a sketch (subset of total data) of size 50000 should be used to do the normalization. Using a sketch can significantly reduce the amount of memory needed with minimal impact on normalization values. To do quantile normalization with just the PM probes and resolve ties in the same manner as bioconductor's RMA version of quantile normalization you would specify 'quant-norm.sketch=50000.bioc=true.usepm=true'. All of the parameters possible can be seen by using the --explain option in conjunction with the name of the module (i.e. apt-probeset-genotype --explain quant-norm).
So a few examples custom analyses would be:
'pm-only,brlmm.transform=rvt' - No normalization, use rvt space for clustering in blrmm.
'med-norm,pm-mm,brlmm.het-mult=.9' - Do a median normalization, use a PM-MM adjustment for probes and a het multiplier of .9 to try and balance hom/het calls.
'rma-bg,quant-norm.sketch=50000.usepm=true.bioc=true,pm-only,blrmm.K=4.tranform=CCS' - Do an RMA style background subtraction followed by an RMA style quantile normalization using a subset of 50000 data points followed by brlmm in CCS (contrast centers space) space with K = 4.
Use the --explain option to get more information on what parameters are available for the various methods. For example, "--explain brlmm", "--explain brlmm-p", and "--explain birdseed".
and
denote the intensity of the A and B alleles respectively as estimated by the quantification method (such as plier or RMA).
and
denote the new coordinates that
and
will be transformed into.
1.5.3