apt-probeset-genotype is a program for making genotype calls from Affymetrix SNP microarrays. It currently implements three different genotype calling algorithms:
APT implements the Birdseed v1 algorithm, developed in collaboration with the Broad Institute, which Affymetrix has validated and supports for use with the SNP 6.0 array. APT also implements the BRLMM-P algorithm, which Affymetrix has validated and supports for the SNP 5.0 array.
Additionally, a newer version of Birdseed accessed using --analysis birdseed-v2 option and the latest development edition of the Birdseed algorithm accessed via the use of the --analysis birdseed-dev option are implemented. Affymetrix is not currently supporting the newer birdseed-v2 and birdseed-dev methods, in contrast with the supported methods described above. Moreover, the SNP priors and lists of "qualified" SNPs for Birdseed-dev are not currently available from Affymetrix for either SNP 5.0 or 6.0. Further information and support files for Birdseed-dev and Birdseed-v2 are available from the Broad Institute (http://www.broad.mit.edu/mpg/birdsuite/).
Future APT updates are expected to migrate improvements currently available via the birdseed-dev option into methods supported in the same manner as the rest of APT methods
As birdseed, brlmm and brlmm-p are model based algorithms they need to be run on multiple CEL files at once to estimate probe effect and SNP cluster parameters. For Mapping 500K data it is advisable to run on at least 50 distinct samples (excluding replicates) and ideally on about 100. For Genome-Wide Human SNP 5.0 and 6.0 arrays it is advisable to cluster with at least 44 genetically distinct samples, though adding more will continue to be of benefit in particular for correctly calling rare genotypes.
We illustrate the most basic way to run apt-probeset-genotype with some examples.
The basic requirements for a run of apt-probeset-genotype are:
WARNING: apt-probeset-genotype will overwrite any existing output files it finds. If you wish to keep existing results make sure to specify a different output directory name.
WARNING: Model files are algorithm specific. Birdseed model files must be used with the Birdseed analysis method and BRLMM-P model files with the brlmm-p method.
NOTE: On windows the DOS prompt does not support wildcard expansion and the preferred method is to supply a text file with the path to the cel files via the '--cel-files' option (see below for details of file format).
NOTE: The windows DOS prompt also does not allow a continuation of a command with the '\' character, unlike unix. So in the examples below the '\' character should be omitted and everything entered on a single line.
The command below runs the Axiom ™ GT1 algorithm on Axiom ™ arrays. For full details on the use of Axiom ™ GT1 in apt-probeset-genotype refer to the vignette on genotype clustering for Axiom ™ arrays.
apt-probeset-genotype \
--analysis-files-path /library/file/path \
--xml-file Axiom_GW_Hu_SNP.r2.apt-probeset-genotype.AxiomGT1.xml \
--out-dir out \
--cel-files cel_file_list.txt
On unix systems a basic command using the default parameters to do a run on SNP6.0 data using birdseed (v2) would look like:
apt-probeset-genotype \ -o results_dir \ -c GenomeWideSNP_6.cdf \ --set-gender-method cn-probe-chrXY-ratio \ --chrX-probes GenomeWideSNP_6.chrXprobes \ --chrY-probes GenomeWideSNP_6.chrYprobes \ --special-snps GenomeWideSNP_6.specialSNPs \ --read-models-birdseed GenomeWideSNP_6.birdseed-v2.models \ -a birdseed-v2 \ *.CEL
Note in particular the use of the option "-a birdseed-v2" which specifies that the Birdseed calling algorithm should be used (the default is brlmm, which won't work on a chip without MM probes such as the 5.0 and 6.0 chips).
Also see the important notes regarding birdseed-v1, birdseed-v2, and birdseed-dev in the introduction above.
The following will give you the older birdseed (v1) behavior:
apt-probeset-genotype \ -o results_dir \ -c GenomeWideSNP_6.cdf \ --special-snps GenomeWideSNP_6.specialSNPs \ --read-models-birdseed GenomeWideSNP_6.birdseed.models \ -a birdseed \ *.CEL
On unix systems a basic command using the default parameters to do a run on SNP5.0 data would look like:
apt-probeset-genotype \ -o results_dir \ -c GenomeWideSNP_5.cdf \ --chrX-snps GenomeWideSNP_5.chrx \ --read-models-brlmmp GenomeWideSNP_5.models \ -a brlmm-p \ *.CEL
Note in particular the use of the option "-a brlmm-p" which specifies that the BRLMM-P calling algorithm should be used (the default is brlmm, which won't work on a chip without MM probes such as the 5.0 chips).
On unix systems a basic command using the default parameters to do a run on Mapping 500K data would look like:
apt-probeset-genotype \ -o results_dir \ -c Mapping250K_Sty.cdf \ --chrX-snps Mapping250K_Sty.chrx \ *.CEL
The output will consist of a report file with some summary statistics about each chip analyzed and a pair of tab-delimited text files with suffixes .calls.txt and .confidences.txt containing the genotype calls and their associated confidences.
On windows a command equivalent to the example above for Mapping 500K would look like:
apt-probeset-genotype -c Mapping250K_Sty.cdf --chrX-snps Mapping250K_Sty.chrx -o results_dir --cel-files cel_file_list.txt
For Mapping 500K chips apt-probeset-genotype runs 100 CELs in 1-2 hours on a 3GHz 2Mb RAM machine using local disk.
Building upon the examples above, here is an example in which only a subset of SNPs are analyzed and the results are written to a text table of genotype calls and a text table of call confidences. The subset of SNPs to be analyzed is specified in a tab-delimited text file called subset_sty.txt, which must contain a column named 'probeset_id'.
apt-probeset-genotype \ -s subset_sty.txt \ -c Mapping250K_Sty.cdf \ --chrX-snps Mapping250K_Sty.chrx \ -o results_dir \ --list-sample \ *.CEL
Note: Note that the --list-sample option is required when subsets are used with the (default) BRLMM analyses and should be omitted for other analysis types. For BRLMM, by default the number of probes used for generating priors is 10,000. If a subset of less than 10,000 probes is used, use the --prior-size option to specify a number less than or equall to the subset size.
See the apt-probeset-summarize manual for an more complete example of running an analysis on a compute farm.
In previous versions of apt-probeset-genotype the default output format for genotype calls was the XDA CHP format (also known as GCOS CHP format). For the GenomeWide SNP 5.0, SNP 6.0 and subsequent WGSA products the use of the XDA CHP format is strongly discouraged, instead we recommend the newer AGCC CHP format. To help avoid accidental use of the XDA CHP format the defaults for output format have been changed to produce tab-delimited text tables of calls and confidences. The creation of the text table output can be supressed with the --no-table-output option and the two CHP output formats can be selected with the --xda-chp-output and --cc-chp-output options.
The reason that the XDA CHP format is discouraged for the GenomeWide SNP 5.0 chips is that it doesn't contain entries for SNP IDs, the identity of a SNP is inferred from its order in the file. In the case of the GenomeWide SNP 5.0 chips there are some SNPs that are not part of the default library file which some advanced users may choose to explore. This leads to the possibility of generating CHP files containing different SNP lists, something not well supported by the XDA CHP format. The AGCC CHP format has a slot for SNP IDs and thus is safer to use with chips for which users may be looking at different SNP lists.
For SNP 6.0 XDA CHP file format output is not allowed at all.
Details on the contents of the CHP files for various calling algorithms can be found below, and a full description of the XDA and AGCC CHP formats can be found in a local copy of the Affymetrix Developer's Network file format documentation.
Support for APT is handled through the Affymetrix Developer Network. Specifically, questions, problems, feature requests, and other inquiries should be made through either the APT User Form or the Developer Network email address, devnet@affymetrix.com. (If you get an Internal Server Error when accessing the forum, try clearing your cookies for affymetrix.com.) To get emails updates about APT or to view previous APT announcements see the APT User Form.
APT is not supported through the Affymetrix call center, Field Application Specialists, or the standard Affymetrix Technical support channels.
If you encounter an issue please make sure to collect the following information and report the problem to devnet@affymetrix.com
apt-probeset-genotype creates a summary report file in the output directory with file name extension '.report.txt'. The report file contains some summary information about each chip analyzed and is useful in getting a quick overview of the CELs analyzed. The format of the file is tab-delimited text with a header line followed by a line for each CEL file analyzed. The columns are all explained below, most users will be mainly interested in the first few entries. The additional entries are provided as potentially useful metrics to track and identify outlier chips and are expected to be mainly of interest to advanced users. The column entries are:
apt-probeset-genotype - program for determining genotype calls
from Affymetrix SNP microarrays. The model based algorithms for
making calls (brlmm/brlmm-p/birdseed) require multiple cel files
to be analyzed at once to learn the parameters for each SNP.
usage:
BRLMM (500K arrays):
apt-probeset-genotype -c chip.cdf --chrX-snps chip.chrx \
-o out-dir/ *.cel
BRLMM-P (GenomeWide SNP 5.0 arrays):
apt-probeset-genotype -c chip.cdf --chrX-snps chip.chrx \
-o out-dir/ -a brlmm-p --read-models-brlmmp chip.models \
*.cel
Birdseed (GenomeWide SNP 6.0 arrays):
apt-probeset-genotype -c chip.cdf --special-snps chip.specialSNPs \
-o out-dir/ -a birdseed --read-models-birdseed chip.birdseed.models \
*.cel
See the apt-probeset-genotype manual for more information about
birdseed including the latest improvements from The Broad.
options:
Common Options (not used by all programs)
-h, --help Display program options and extra
documentation about possible analyses. See
-explain for information about a specific
operation. [default 'false']
-v, --verbose How verbose to be with status messages 0 -
quiet, 1 - usual messages, 2 - more
messages. [default '1']
--console-off Turn off the default messages to the
console but not logging or sockets.
[default 'false']
--use-socket Host and port to print messages over in
localhost:port format [default '']
--version Display version information. [default
'false']
-f, --force Disable various checks including chip
types. Consider using --chip-type option
rather than --force. [default 'false']
--throw-exception Throw an exception rather than calling
exit() on error. Useful for debugging. This
option is intended for command line use
only. If you are wrapping an Engine and
want exceptions thrown, then you should
call Err::setThrowStatus(true) to ensure
that all Err::errAbort() calls result in an
exception. [default 'false']
--analysis-files-path Search path for analysis library files.
Will override AFFX_ANALYSIS_FILES_PATH
environment variable. [default '']
--xml-file Input parameters in XML format (Will
override command line settings). [default
'']
--temp-dir Directory for temporary files when working
off disk. Using network mounted drives is
not advised. When not set, the output
folder will be used. The defaut is
typically the output directory or the
current working directory. [default '']
-o, --out-dir Directory for output files. Defaults to
current working directory. [default '.']
--log-file The name of the log file. Generally
defaults to the program name in the out-dir
folder. [default '']
Engine Options (Not used on command line)
--command-line The command line executed. [default '']
--exec-guid The GUID for the process. [default '']
--program-name The name of the program [default '']
--program-company The company providing the program [default
'']
--program-version The version of the program [default '']
--program-cvs-id The CVS version of the program [default '']
--version-to-report The version to report in the output files.
[default '']
--free-mem-at-start How much physical memory was available when
the engine run started. [default '0']
--meta-data-info Meta data in key=value pair that will be
output in headers. [default '']
Input Options
--cel-files Text file specifying cel files to process,
one per line with the first line being
'cel_files'. [default '']
-c, --cdf-file File defining probe sets. Use either
--cdf-file or --spf-file. [default '']
--spf-file File defining probe sets in spf (simple
probe format) which is like a text cdf
file. [default '']
--chrX-snps File containing snps on chrX
(non-pseudoautosomal region). [default '']
--special-snps File containing all snps of unusual copy
(chrX,mito,Y) [default '']
--chrX-probes File containing probe_id (1-based) of
probes on chrX. Used for copy number probe
chrX/Y ratio gender calling. [Experimental]
[default '']
--chrY-probes File containing probe_id (1-based) of
probes on chrY. Used for copy number probe
chrX/Y ratio gender calling. [Experimental]
[default '']
--chrZ-probes File containing probe_id (1-based) of
probes on chrZ. Used for copy number probe
chrW/Z ratio avian gender calling.
[Experimental] [default '']
--chrW-probes File containing probe_id (1-based) of
probes on chrW. Used for copy number probe
chrW/Z ratio avian gender calling.
[Experimental] [default '']
-s, --probeset-ids Tab delimited file with column
'probeset_id' specifying probesets to
genotype. [default '']
--probeset-ids-reported Tab delimited file with column
'probeset_id' specifying probesets to
report. This should be a subset of those
specified with --probeset-ids if that
option is used. [default '']
--probe-class-file File containing probe_id (1-based) of
probes and a 'class' designation. Used to
compute mean probe intensity by class for
report file. [default '']
--chip-type Chip types to check library and CEL files
against. Can be specified multiple times.
The first one is propigated as the chip
type in the output files. Warning, use of
this option will override the usual check
between chip types found in the library
files and cel files. You should use this
option instead of --force when possible.
[default '']
--annotation-file Annotation file. [default '']
--genotype-markers-cn-file Tab delimited file with copy number calls
for genotype probesets within copy number
regions [default '']
--file5-compact Should we output results in a compact file5
output. [default 'false']
--sqlite-output Shoul output some results in sqlite3
format? [default 'false']
Output Options
--table-output Output matching matrices of tab delimited
genotype calls and confidences. [default
'true']
--output-forced-calls Output a separate file with forced calls.
[default 'false']
--output-context Output a separate file with the allele
context used. This is only relevant for
marker type probesets which have multiple
groups of probes for each allele based on
the context of nearby SNPs. [default
'false']
--cc-chp-output Output resulting calls in directory called
'cc-chp' under out-dir. This makes one AGCC
Multi Data CHP file per cel file analyzed.
[default 'false']
--xda-chp-output Output resulting calls in directory called
'chp' under out-dir. This makes one GCOS
XDA CHP file per cel file analyzed. Note
that this format is not supported beyond
the Mapping500K chips, for subsequent chips
look at the CC CHP format instead. [default
'false']
--cc-chp-out-dir Over-ride the default location for chp
output. [default '']
--xda-chp-out-dir Over-ride the default location for chp
output. [default '']
--summaries Output the summary values from the
quantifcation method for each allele. For
brlmm-p this will also write a file of
transformed summary values in contrast
space used in the clustering. [default
'false']
--report-file Over-ride the default report file name.
[default '']
Analysis Options
-a, --analysis String representing analysis pathway
desired. For example:
'quant-norm.sketch=50000,pm-only,brlmm'.
[default 'brlmm']
--qmethod-spec Quantification Method to use for
summarizing alleles. [default
'plier.optmethod=1']
--read-models-brlmm File to read precomputed BRLMM snp specific
models from. [default '']
--read-models-brlmmp File to read precomputed BRLMM-P snp
specific models from. [default '']
--read-models-birdseed File to read precomputed birdseed snp
specific models from. [default '']
--write-models Should we write snp specific models out for
analysis? [experimental] [default 'false']
--db-from-prior-models File to write prior snp models to for
random access. [default '']
--db-from-posterior-models File to write posterior snp models to for
random access. [default '']
--feat-effects Output feature effects when available. By
convention med-polish feature effects have
total probeset median added to them, see
RMA module for details [default 'false']
--writeOldStyleFeatureEffectsFile Boolean value to determine whether or not
old style feature effects files are
written. [default 'false']
--feat-eff-remove-allele-suffix Remove the -A and -B suffix from probeset
name added during genotype process [default
'false']
--use-feat-eff File defining a plier feature effect for
each probe. Note that precomputed effects
should only be used for an appropriately
similar analysis (i.e. feature effects for
pm-only may be different than for pm-mm).
[default '']
--feat-details Output the feature details (usually
residuals) from the quantification method
if available. [default 'false']
--target-sketch File specifying a target distribution to
use for quantile normalization. [default
'']
--write-sketch Write the quantile normalization
distribution (or sketch) to a file for
reuse with target-sketch option. [default
'false']
--dm-thresh Minimum DM p-value to seed clusters with.
[default '.17']
--reference-profile File specifying reference chip profile.
[default '']
--write-profile Write the reference chip profile to a file
for reuse. [default 'false']
--dm-hetmult DM hetmultiplier to balance het/hom calls,
additive to log likelihood. [default '0']
--prior-size How many probesets to use for determining
prior. [default '0']
--list-sample Only sample for prior from list specified
via --probeset-ids, not entire chip.
[default 'false']
--read-priors-brlmm File to load BRLMM priors from. Prior
format is tab separated id, center, var,
and center.var. [default '']
--write-prior Write prior out to file in output-dir.
[default 'false']
--norm-size Do contrast normalization using a sample of
this many snps (brlmm-p) [default '0']
--write-norm Write covariate norm fcns to file [default
'false']
--set-analysis-name Explicitly set the analysis name. This
affects output file names (ie prefix) and
various meta info. [default '']
--include-quant-in-report-file-name Include the quant method name in the
expression report files. [default 'false']
Gender Options
--set-gender-method Explicitly force the use of a particular
gender method for genotype calling. Valid
values include: cn-probe-chrXY-ratio,
cn-probe-chrZW-ratio, dm-chrX-het-rate,
em-cluster-chrX-het-contrast,
user-supplied, and none. If you are
supplied seed genotype calls, you can also
use supplied-genotypes-chrX-het-rate. When
not set, the default behavior depends on
the analysis. [default '']
--read-genders Explicitly read genders from a file.
[default '']
--read-inbred Read penalty for hets by level of
inbreeding per sample. [default '']
--no-gender-force Perform analysis even without a suitable
gender method for genotype calling.
[default 'false']
--em-gender Enable EM Gender calling if special-snps or
chrX-snp file is provided. [default 'true']
--female-thresh Threshold for calling females when using
cn-probe-chrXY-ratio or
cn-probe-chrZW-ratio method. [default
'0.48']
--male-thresh Threshold for calling females when using
cn-probe-chrXY-ratio or
cn-probe-chrZW-ratiomethod. [default
'0.71']
--zw-gender-calling Handles case in which ZZ is male and ZW is
female. If unset, then internally set to
true when cn-probe-chrZW-ratio
gender-method is used. [default '']
Misc Options
--explain Explain a particular operation (i.e.
--explain brlmm or --explain brlmm-p).
[default '']
Advanced Options
--kill-list Do not use the probes specified in file for
computing results. [experimental] [default
'']
--dm-out Output any initial seed calls used by BRLMM
(seed default is DM calls). Only relevant
for BRLMM. [default 'false']
--all-types Try and analyze all probeset types rather
than just genotyping. [Experimental]
[default 'false']
--genotypes File to read seed genotypes from instead of
using DM to generate. [experimental]
[default '']
--select-probes Output estimates of which probes are most
accurate [default 'false']
--call-coder-max-alleles For encoding/decoding calls, the max number
of alleles per marker to allow. [default
'6']
--call-coder-type The data size used to encode the call.
[default 'UCHAR']
--call-coder-version The version of the encoder/decoder to use
[default '1.0']
Execution Control Options
--use-disk Store CEL intensities to be analyzed on
disk. [default 'true']
--disk-cache Size of memory cache when working off disk
in megabytes. [default '50']
A5 output options
--a5-global-file Filename for the A5 global output file.
[Experimental] [default '']
--a5-global-file-no-replace Append or create rather than replace.
[Experimental] [default 'false']
--a5-group Group name where to put results in the A5
output files. [Experimental] [default '']
--a5-calls Output the genotype calls and confidences
in A5 format. [Experimental] [default
'false']
--a5-calls-use-global Use the global A5 file for calls and
confidences.[Experimental] [default
'false']
--a5-summaries Output the summary values from the
quantifcation method for each allele in A5
format. [Experimental] [default 'false']
--a5-summaries-use-global Use the global A5 file for summaries.
[Experimental] [default 'false']
--a5-feature-effects Output feature effects in A5 format.
[Experimental] [default 'false']
--a5-feature-effects-use-global Use the global A5 file for feature
effects.[Experimental] [default 'false']
--a5-feature-details Output feature level residuals in A5
format. [Experimental] [default 'false']
--a5-feature-details-use-global Use the global A5 file for residuals.
[Experimental] [default 'false']
--a5-sketch Output normalization sketch in A5 format.
--write-sketch option will override this
option. [Experimental] [default 'false']
--a5-sketch-use-global Put the sketch in the global A5 output
file. [Experimental] [default 'false']
--a5-write-models Output genotype models/posteriors in A5
format. --write-models option will override
this option. [Experimental] [default
'false']
--a5-write-models-use-global Put the models in the global A5 output
file. [Experimental] [default 'false']
A5 input options
--a5-global-input-file Filename for the group in the global input
file.[Experimental] [default '']
--a5-input-group Group name for input. Defaults to
--a5-group or if that is not set, then '/'.
[Experimental] [default '']
--a5-sketch-input-global Read the sketch from the global A5 input
file. [Experimental] [default 'false']
--a5-sketch-input-file Read the sketch from the an A5 input file.
[Experimental] [default '']
--a5-sketch-input-group Group name to read the sketch from.
Defaults to --a5-input-group.
[Experimental] [default '']
--a5-sketch-input-name The name of the data section. Defaults to
'target-sketch'. [Experimental] [default
'']
--a5-feature-effects-input-global Read the feature effects global A5 input
file. [Experimental] [default 'false']
--a5-feature-effects-input-file Read the feature effects from the an A5
input file. [Experimental] [default '']
--a5-feature-effects-input-group Group name to read the feature effects
from. Defaults to --a5-input-group.
[Experimental] [default '']
--a5-feature-effects-input-name The name of the data section. Defaults to
XXX.feature-response where XXX is the
analysis name and quant method. IE
'brlmm-p.plier'. [Experimental] [default
'']
--a5-models-input-global Read the Models from the global A5 input
file. The tsv5 name must be
'XXX.snp-posteriors'. [Experimental]
[default 'false']
--a5-models-input-file Read the models from the an A5 input file.
[Experimental] [default '']
--a5-models-input-group The group name where the models are
located. Defaults to the analysis name.
[Experimental] [default '']
--a5-models-input-name The name of the data section. Defaults to
XXX.snp-posteriors where XXX is the
analysis name. IE 'brlmm-p'. [Experimental]
[default '']
SNPQC Options
--snpqc-probesets Filename of probesets to calculate
snpqc-call-rate, snpqc-hom-rate and
snpqc-het-rate for. [default '']
Engine Options (Not used on command line)
--cels Cel files to process. [default '']
--result-files CHP file names to output. Must be paired
with cels. [default '']
--time-start The time the engine run was started
[default '']
--time-end The time the engine run ended [default '']
--time-run-minutes The run time in minutes. [default '']
--analysis-guid The GUID for the analysis run. [default '']
Standard Methods:
'birdseed' quant-norm.sketch=50000,pm-only,birdseed
'birdseed-dev' quant-norm.sketch=50000.target=1000,pm-only,birdseed-dev
'birdseed-dev.force' quant-norm.sketch=50000.target=1000,pm-only,birdseed-dev.conf-threshold=1
'birdseed-v1' quant-norm.sketch=50000,pm-only,birdseed-v1
'birdseed-v1.force' quant-norm.sketch=50000,pm-only,birdseed-v1.conf-threshold=1
'birdseed-v2' quant-norm.sketch=50000.target=1000,pm-only,birdseed-v2
'birdseed-v2.force' quant-norm.sketch=50000.target=1000,pm-only,birdseed-v2.conf-threshold=1
'birdseed.force' quant-norm.sketch=50000,pm-only,birdseed.conf-threshold=1
'brlmm' quant-norm.sketch=50000,pm-only,brlmm.transform=ccs.K=4
'brlmm-p' quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.K=2.SB=0.003.MS=0.05
'brlmm-p-plus' quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.mix=1.bic=2.HARD=3.SB=0.45.KX=1.KH=1.5.KXX=0.5.KAH=-0.6.KHB=-0.6.transform=MVA.AAM=2.0.BBM=-2.0.AAV=0.06.BBV=0.06.ABV=0.06.copyqc=0.000001.wobble=0.05.MS=0.05
'brlmm-p-plus.force' quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.mix=1.bic=2.HARD=3.SB=0.45.KX=1.KH=1.5.KXX=0.5.KAH=-0.6.KHB=-0.6.transform=MVA.AAM=2.0.BBM=-2.0.AAV=0.06.BBV=0.06.ABV=0.06.copyqc=0.000001.wobble=0.05.MS=1
'brlmm-p.force' quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.K=2.SB=0.003.MS=1
Data transformations:
rma-bg Performs an RMA style background adjustment
as described in Irizarry et al 2003.
quant-norm Class for doing quantile normalization. Can
do sketch and full quantile (just set sketch
to chip size or zero) and supports
bioconductor compatibility.
artifact-reduction Class for artifact reduction.
med-norm Class for doing median normalization. Adjust
intensities such that all chips have the same
median (or average).
adapter-type-norm Class for doing adapter type normalization.
Adjust intensities by adapter type.
gc-bg Subtract bacground based on median intensity
of probes with similar GC content.
intensity-reporter Class for dumping intensity values to a file.
no-trans Placeholder chipstream that does no
transformation
Pm Intensity Adjustments:
pm-only No adjustment. Just uses unmodified PM intensity values.
pm-mm Use mismatch probe as adjustment for perfect match. Has
strength of being unbiased, but often the mismatch probe
binds the match target.
pm-gcbg Do an adjustment based on the median intensity of probes
with similar GC content.
pm-sum Add itensity of PM probe for other allele to PM probes.
Quantification Methods:
plier The PLIER (Probe Logarithmic Error Intensity
Estimate) method produces an improved signal by
accounting for experimentally observed patterns in
feature behavior and handling error at the
appropriately at low and high signal values. This
version of PLIER differs from the previous version by
the addition of a SafteyZero, NumericalTolerance, and
FixPrecomputed. These options are intended to improve
the stability of PLIER results when using precomputed
feature reponse values. To get the older PLIER
behavior set SafetyZero to 0.0, NumericalTolerance to
0.0, and FixPrecomputed to false.
sea The SEA (Simplified Expression Analysis) method
provides a simple signal estimate, using the
initialization algorithm from the PLIER (Probe
Logarithmic Error Intensity Estimate) method and
omitting the PLIER parameter fitting. SEA is useful
for single chip signal estimation. The version of
PLIER used by SEA differs from the previous version
by the addition of a SafteyZero, NumericalTolerance,
and FixPrecomputed. These options are intended to
improve the stability of PLIER results when using
precomputed feature reponse values. To get the older
PLIER behavior set SafetyZero to 0.0,
NumericalTolerance to 0.0, and FixPrecomputed to
false.
iter-plier Do probe set quantification estimate by iteratively
calling PLIER with the probes that best correlate
with signal estimate. The version of PLIER used by
IterPLIER differs from the previous version by the
addition of a SafteyZero, NumericalTolerance, and
FixPrecomputed. These options are intended to improve
the stability of PLIER results when using precomputed
feature reponse values. To get the older PLIER
behavior set SafetyZero to 0.0, NumericalTolerance to
0.0, and FixPrecomputed to false.
med-polish Performs a median polish to estimate target and probe
effects. Resulting summaries are in log2 space by
default. Used in summary step of RMA as described in
Irizarry et al 2003.
dabg Calculates the p-value that the intensities in a
probeset could have been observed by chance in a
background distribution. Used as a substitute for
standard absent/present calls when mismatch probes
are not available.
avgdiff Calculates the average measurement for a probeset
using the MAS 4 average difference algorithm, namely
the average difference between the pm and mm probe
signal.
median Use the median of probes for a particular chip as the
summary.
Analysis Streams:
expr Does expression summarization on probesets.
pca-select Determines PCA for probes and picks probes that are
near the principal component as the probes to use
for downstream analysis.
spect-select Picks probes that are similar to each other based
on spectral cluster and normalized cut.
Q. What is a probe_id?
A. See the FAQ item on probe IDs for more info.
Q. The program died with an error message like "Assertion failed: A->probes.size() == 2, file ../DmListener.cpp." What does this mean?
A. This is symptomatic of trying to run BRLMM for a SNP with no MM probes. In its typical mode of running BRLMM relies on DM to generate intital seed calls, and the DM algorithm requires MM probes.
Q. The program died with an error message like "DmListener::getGenoCall() - Can't find genotypes for name: SNP_A-1780432". What does this mean?
A. This is symptomatic of having specified the wrong chrX file for the analysis. In order to reduce the likelihood of accidentally using the wrong chrX file apt-probeset-genotype checks to make sure that all the SNPs specified in the chrX file are present on the chips being analyzed. If it finds a SNP present in the chrX file that is not identified in the CDF file it will die with the above message. Note that if you want to bypass the requirement of a chrX file you can use the --no-gender-force option.
Q. The program died and I got an error message saying "Killed". What does this mean and what can I do?
A. Linux has a "feature" that it will promise more memory than it actually has in the hope that many programs won't actually be using all their memory at once. However, if linux does run short of memory it will start killing programs arbitrarily. You can read more about linux's OOM (out of memory) killer at at LWN.net.
Q. Why does apt-probeset-genotype require information regarding SNPs on chromosomes X/Y/Mito?
A. The SNPs on chromosome X are evaluated separately for XX (female) and XY (male) individuals as the intensity estimates for the males will generally be lower on X due to one missing chromosome. The prior is also adjusted to remove the het center as XY individuals should only have hom calls on the X chromosome. For BRLMM analyses gender is estimated using the method employed in the GTYPE software: individuals are called male if less than 7.5% of the snps on X are called as hets by the initial DM calls using a .33 confidence threshold. For BRLMM-P gender is estimated by use of an Expectation Maximiation (EM) algorithm on the PM probes for chrX SNPs to estimate the het rate.
Q. How is the mask section in the CEL file used?
A. It is not. The contents of this section of the CEL file are ignored.
Q. How can I find out more information about the analysis string:
quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.K=2.SB=0.003.MS=0.05
A. Start with the included manual for apt-probeset-genotype. Usage information is also provided if you run apt-probeset-genotype without any arguments. Lastly, use the "--explain" option for more information about particular analysis methods. IE:
apt-probeset-genotype --explain brlmm-p
Q. Wild cards do not work on windows. For example:
apt-probeset-genotype ... *.CEL
A. APT relies on the command shell to do the wild card expansion (ie bash shell on NIX systems). The windows shell does not do wild card expansion, so there is no wild card expansion for APT when run from the windows shell. You may want to try a different windows shell or perhaps bash via cygwin. See the --cel-files option as an alternative to specify CEL files for analysis.
This section explains the contents of CHP files for the various algorithms. For details on the formats or for an explanation of why the XDA CHP format is not supported for some chip types, see above.
The XDA CHP file format is only supported for the BRLMM algorithm applied to the 100K or 500K arrays. Historically the genotyping XDA CHP file is closely tied to the DM model and while BRLMM uses the same format for backward compatibility it is important to note that the interpretation of some fields is different. Below are the names of the fields and corresponding BRLMM values that are stored in them.
The following parameters are saved in the CCHPFileHeader object:
A complete explanation of the XDA CHP file format can be found in a local copy of the Affymetrix Developer's Network file format documentation.
The AGCC CHP format consists of a header followed by a data section. The header section contains a large amount of information including the software version and the full set of parameters used in the clustering analysis. The data section consists of a matrix with a row for each SNP. The columns are:
Use of the clustering_space_x_value and clustering_space_y_value fields allows for plotting the data in the space that was used to perform the clustering. For BRLMM and BRLMM-P the x-value is 'transformed contrast' and the y-value is 'signal strength' - see the BRLMM and BRLMM-P whitepapers for more detail. For Birdseed (see http://www.broad.mit.edu/mpg/birdsuite/) this is A-signal va B-signal (linear scale, post quantile normalization and allele-specific median-polish).
A complete explanation of the AGCC CHP file format can be found in a local copy of the Affymetrix Developer's Network file format documentation.
While aliases for common analysis such as brlmm with default parameters are provided it is possible to construct custom analyses on the command line. There are both program options and analysis parameters that can be set to affect the results. Most people are familiar with the standard method for setting program options, but the specification of the analysis method and its parameters in apt-probeset-genotype works a little differently. The method for setting custom parameters to the analysis involves supplying a text representation of the analysis and parameters desired. This enables flexibility as each piece of an analysis is self-contained and they can be (almost) arbitrarily combined. Note that when using a custom analysis rather than an alias it is necessary to specify the entire analysis and not acceptable to pass custom parameters to the alias. For example, if you wanted to change the number of iterations brlmm performs you would have to specify 'quant-norm.sketch=50000,pm-only,brlmm.iterations=1' rather than just typing 'brlmm.iterations=1'
The current full default brlmm analysis is: 'quant-norm.sketch=50000,pm-only,brlmm' where there can be multiple chipstream modules (in this case a single quant-norm) separated by commas and the last two entries are the pm adjuster (pm-only) and quantification method (brlmm). Parameters to a particular step in the analysis are supplied in key=value pairs and separated by periods. For example 'quant-norm.sketch=50000' indicates that the chips should be quantile normalized and that a sketch (subset of total data) of size 50000 should be used to do the normalization. Using a sketch can significantly reduce the amount of memory needed with minimal impact on normalization values. To do quantile normalization with just the PM probes and resolve ties in the same manner as bioconductor's RMA version of quantile normalization you would specify 'quant-norm.sketch=50000.bioc=true.usepm=true'. All of the parameters possible can be seen by using the --explain option in conjunction with the name of the module (i.e. apt-probeset-genotype --explain quant-norm).
So a few examples custom analyses would be:
'pm-only,brlmm.transform=rvt' - No normalization, use rvt space for clustering in blrmm.
'med-norm,pm-mm,brlmm.het-mult=.9' - Do a median normalization, use a PM-MM adjustment for probes and a het multiplier of .9 to try and balance hom/het calls.
'rma-bg,quant-norm.sketch=50000.usepm=true.bioc=true,pm-only,blrmm.K=4.tranform=CCS' - Do an RMA style quantile normalization using a subset of 50000 data points followed by brlmm in CCS (contrast centers space) space with K = 4.
Use the --explain option to get more information on what parameters are available for the various methods. For example, "--explain brlmm", "--explain brlmm-p", and "--explain birdseed".
There are a number of different transformations that are implemented for different spaces which can be specified via the transform parameter to brlmm and are detailed below. For all of these transformations
and
denote the intensity of the A and B alleles respectively as estimated by the quantification method (such as plier or RMA).
and
denote the new coordinates that
and
will be transformed into.
1.7.1