apt-probeset-genotype is a program for making genotype calls from Affymetrix SNP microarrays. It currently implements three different genotype calling algorithms:
APT implements the Birdseed v1 algorithm, developed in collaboration with the Broad Institute, which Affymetrix has validated and supports for use with the SNP 6.0 array. APT also implements the BRLMM-P algorithm, which Affymetrix has validated and supports for the SNP 5.0 array.
Additionally, a newer version of Birdseed accessed using --analysis birdseed-v2 option and the latest development edition of the Birdseed algorithm accessed via the use of the --analysis birdseed-dev option are implemented. Affymetrix is not currently supporting the newer birdseed-v2 and birdseed-dev methods, in contrast with the supported methods described above. Moreover, the SNP priors and lists of "qualified" SNPs for Birdseed-dev are not currently available from Affymetrix for either SNP 5.0 or 6.0. Further information and support files for Birdseed-dev and Birdseed-v2 are available from the Broad Institute (http://www.broad.mit.edu/mpg/birdsuite/).
Future APT updates are expected to migrate improvements currently available via the birdseed-dev option into methods supported in the same manner as the rest of APT methods
As birdseed, brlmm and brlmm-p are model based algorithms they need to be run on multiple CEL files at once to estimate probe effect and SNP cluster parameters. For Mapping 500K data it is advisable to run on at least 50 distinct samples (excluding replicates) and ideally on about 100. For Genome-Wide Human SNP 5.0 and 6.0 arrays it is advisable to cluster with at least 44 genetically distinct samples, though adding more will continue to be of benefit in particular for correctly calling rare genotypes.
We illustrate the most basic way to run apt-probeset-genotype with some examples.
The basic requirements for a run of apt-probeset-genotype are:
WARNING: apt-probeset-genotype will overwrite any existing output files it finds. If you wish to keep existing results make sure to specify a different output directory name.
WARNING: Model files are algorithm specific. Birdseed model files must be used with the Birdseed analysis method and BRLMM-P model files with the brlmm-p method.
NOTE: On windows the DOS prompt does not support wildcard expansion and the preferred method is to supply a text file with the path to the cel files via the '--cel-files' option (see below for details of file format).
NOTE: The windows DOS prompt also does not allow a continuation of a command with the '\' character, unlike unix. So in the examples below the '\' character should be omitted and everything entered on a single line.
The command below runs the Axiom ™ GT1 algorithm on Axiom ™ arrays. For full details on the use of Axiom ™ GT1 in apt-probeset-genotype refer to the vignette on genotype clustering for Axiom ™ arrays.
apt-probeset-genotype \ --analysis-files-path /library/file/path \ --xml-file Axiom_GW_Hu_SNP.r2.apt-probeset-genotype.AxiomGT1.xml \ --out-dir out \ --cel-files cel_file_list.txt
On unix systems a basic command using the default parameters to do a run on SNP6.0 data using birdseed (v2) would look like:
apt-probeset-genotype \ -o results_dir \ -c GenomeWideSNP_6.cdf \ --set-gender-method cn-probe-chrXY-ratio \ --chrX-probes GenomeWideSNP_6.chrXprobes \ --chrY-probes GenomeWideSNP_6.chrYprobes \ --special-snps GenomeWideSNP_6.specialSNPs \ --read-models-birdseed GenomeWideSNP_6.birdseed-v2.models \ -a birdseed-v2 \ *.CEL
Note in particular the use of the option "-a birdseed-v2" which specifies that the Birdseed calling algorithm should be used (the default is brlmm, which won't work on a chip without MM probes such as the 5.0 and 6.0 chips).
Also see the important notes regarding birdseed-v1, birdseed-v2, and birdseed-dev in the introduction above.
The following will give you the older birdseed (v1) behavior:
apt-probeset-genotype \ -o results_dir \ -c GenomeWideSNP_6.cdf \ --special-snps GenomeWideSNP_6.specialSNPs \ --read-models-birdseed GenomeWideSNP_6.birdseed.models \ -a birdseed \ *.CEL
On unix systems a basic command using the default parameters to do a run on SNP5.0 data would look like:
apt-probeset-genotype \ -o results_dir \ -c GenomeWideSNP_5.cdf \ --chrX-snps GenomeWideSNP_5.chrx \ --read-models-brlmmp GenomeWideSNP_5.models \ -a brlmm-p \ *.CEL
Note in particular the use of the option "-a brlmm-p" which specifies that the BRLMM-P calling algorithm should be used (the default is brlmm, which won't work on a chip without MM probes such as the 5.0 chips).
On unix systems a basic command using the default parameters to do a run on Mapping 500K data would look like:
apt-probeset-genotype \ -o results_dir \ -c Mapping250K_Sty.cdf \ --chrX-snps Mapping250K_Sty.chrx \ *.CEL
The output will consist of a report file with some summary statistics about each chip analyzed and a pair of tab-delimited text files with suffixes .calls.txt and .confidences.txt containing the genotype calls and their associated confidences.
On windows a command equivalent to the example above for Mapping 500K would look like:
apt-probeset-genotype -c Mapping250K_Sty.cdf --chrX-snps Mapping250K_Sty.chrx -o results_dir --cel-files cel_file_list.txt
For Mapping 500K chips apt-probeset-genotype runs 100 CELs in 1-2 hours on a 3GHz 2Mb RAM machine using local disk.
Building upon the examples above, here is an example in which only a subset of SNPs are analyzed and the results are written to a text table of genotype calls and a text table of call confidences. The subset of SNPs to be analyzed is specified in a tab-delimited text file called subset_sty.txt, which must contain a column named 'probeset_id'.
apt-probeset-genotype \ -s subset_sty.txt \ -c Mapping250K_Sty.cdf \ --chrX-snps Mapping250K_Sty.chrx \ -o results_dir \ --list-sample \ *.CEL
Note: Note that the --list-sample option is required when subsets are used with the (default) BRLMM analyses and should be omitted for other analysis types. For BRLMM, by default the number of probes used for generating priors is 10,000. If a subset of less than 10,000 probes is used, use the --prior-size option to specify a number less than or equall to the subset size.
See the apt-probeset-summarize manual for an more complete example of running an analysis on a compute farm.
In previous versions of apt-probeset-genotype the default output format for genotype calls was the XDA CHP format (also known as GCOS CHP format). For the GenomeWide SNP 5.0, SNP 6.0 and subsequent WGSA products the use of the XDA CHP format is strongly discouraged, instead we recommend the newer AGCC CHP format. To help avoid accidental use of the XDA CHP format the defaults for output format have been changed to produce tab-delimited text tables of calls and confidences. The creation of the text table output can be supressed with the --no-table-output option and the two CHP output formats can be selected with the --xda-chp-output and --cc-chp-output options.
The reason that the XDA CHP format is discouraged for the GenomeWide SNP 5.0 chips is that it doesn't contain entries for SNP IDs, the identity of a SNP is inferred from its order in the file. In the case of the GenomeWide SNP 5.0 chips there are some SNPs that are not part of the default library file which some advanced users may choose to explore. This leads to the possibility of generating CHP files containing different SNP lists, something not well supported by the XDA CHP format. The AGCC CHP format has a slot for SNP IDs and thus is safer to use with chips for which users may be looking at different SNP lists.
For SNP 6.0 XDA CHP file format output is not allowed at all.
Details on the contents of the CHP files for various calling algorithms can be found below, and a full description of the XDA and AGCC CHP formats can be found in a local copy of the Affymetrix Developer's Network file format documentation.
Support for APT is handled through the Affymetrix Developer Network. Specifically, questions, problems, feature requests, and other inquiries should be made through either the APT User Form or the Developer Network email address, firstname.lastname@example.org. (If you get an Internal Server Error when accessing the forum, try clearing your cookies for affymetrix.com.) To get emails updates about APT or to view previous APT announcements see the APT User Form.
APT is not supported through the Affymetrix call center, Field Application Specialists, or the standard Affymetrix Technical support channels.
If you encounter an issue please make sure to collect the following information and report the problem to email@example.com
apt-probeset-genotype creates a summary report file in the output directory with file name extension '.report.txt'. The report file contains some summary information about each chip analyzed and is useful in getting a quick overview of the CELs analyzed. The format of the file is tab-delimited text with a header line followed by a line for each CEL file analyzed. The columns are all explained below, most users will be mainly interested in the first few entries. The additional entries are provided as potentially useful metrics to track and identify outlier chips and are expected to be mainly of interest to advanced users. The column entries are:
apt-probeset-genotype - program for determining genotype calls from Affymetrix SNP microarrays. The model based algorithms for making calls (brlmm/brlmm-p/birdseed) require multiple cel files to be analyzed at once to learn the parameters for each SNP. usage: BRLMM (500K arrays): apt-probeset-genotype -c chip.cdf --chrX-snps chip.chrx \ -o out-dir/ *.cel BRLMM-P (GenomeWide SNP 5.0 arrays): apt-probeset-genotype -c chip.cdf --chrX-snps chip.chrx \ -o out-dir/ -a brlmm-p --read-models-brlmmp chip.models \ *.cel Birdseed (GenomeWide SNP 6.0 arrays): apt-probeset-genotype -c chip.cdf --special-snps chip.specialSNPs \ -o out-dir/ -a birdseed --read-models-birdseed chip.birdseed.models \ *.cel See the apt-probeset-genotype manual for more information about birdseed including the latest improvements from The Broad. options: Common Options (not used by all programs) -h, --help Display program options and extra documentation about possible analyses. See -explain for information about a specific operation. [default 'false'] -v, --verbose How verbose to be with status messages 0 - quiet, 1 - usual messages, 2 - more messages. [default '1'] --console-off Turn off the default messages to the console but not logging or sockets. [default 'false'] --use-socket Host and port to print messages over in localhost:port format [default ''] --version Display version information. [default 'false'] -f, --force Disable various checks including chip types. Consider using --chip-type option rather than --force. [default 'false'] --throw-exception Throw an exception rather than calling exit() on error. Useful for debugging. This option is intended for command line use only. If you are wrapping an Engine and want exceptions thrown, then you should call Err::setThrowStatus(true) to ensure that all Err::errAbort() calls result in an exception. [default 'false'] --analysis-files-path Search path for analysis library files. Will override AFFX_ANALYSIS_FILES_PATH environment variable. [default ''] --xml-file Input parameters in XML format (Will override command line settings). [default ''] --temp-dir Directory for temporary files when working off disk. Using network mounted drives is not advised. When not set, the output folder will be used. The defaut is typically the output directory or the current working directory. [default ''] -o, --out-dir Directory for output files. Defaults to current working directory. [default '.'] --log-file The name of the log file. Generally defaults to the program name in the out-dir folder. [default ''] Engine Options (Not used on command line) --command-line The command line executed. [default ''] --exec-guid The GUID for the process. [default ''] --program-name The name of the program [default ''] --program-company The company providing the program [default ''] --program-version The version of the program [default ''] --program-cvs-id The CVS version of the program [default ''] --version-to-report The version to report in the output files. [default ''] --free-mem-at-start How much physical memory was available when the engine run started. [default '0'] --meta-data-info Meta data in key=value pair that will be output in headers. [default ''] Input Options --cel-files Text file specifying cel files to process, one per line with the first line being 'cel_files'. [default ''] -c, --cdf-file File defining probe sets. Use either --cdf-file or --spf-file. [default ''] --spf-file File defining probe sets in spf (simple probe format) which is like a text cdf file. [default ''] --chrX-snps File containing snps on chrX (non-pseudoautosomal region). [default ''] --special-snps File containing all snps of unusual copy (chrX,mito,Y) [default ''] --chrX-probes File containing probe_id (1-based) of probes on chrX. Used for copy number probe chrX/Y ratio gender calling. [Experimental] [default ''] --chrY-probes File containing probe_id (1-based) of probes on chrY. Used for copy number probe chrX/Y ratio gender calling. [Experimental] [default ''] --chrZ-probes File containing probe_id (1-based) of probes on chrZ. Used for copy number probe chrW/Z ratio avian gender calling. [Experimental] [default ''] --chrW-probes File containing probe_id (1-based) of probes on chrW. Used for copy number probe chrW/Z ratio avian gender calling. [Experimental] [default ''] -s, --probeset-ids Tab delimited file with column 'probeset_id' specifying probesets to genotype. [default ''] --probeset-ids-reported Tab delimited file with column 'probeset_id' specifying probesets to report. This should be a subset of those specified with --probeset-ids if that option is used. [default ''] --probe-class-file File containing probe_id (1-based) of probes and a 'class' designation. Used to compute mean probe intensity by class for report file. [default ''] --chip-type Chip types to check library and CEL files against. Can be specified multiple times. The first one is propigated as the chip type in the output files. Warning, use of this option will override the usual check between chip types found in the library files and cel files. You should use this option instead of --force when possible. [default ''] --annotation-file Annotation file. [default ''] --genotype-markers-cn-file Tab delimited file with copy number calls for genotype probesets within copy number regions [default ''] --file5-compact Should we output results in a compact file5 output. [default 'false'] --sqlite-output Shoul output some results in sqlite3 format? [default 'false'] Output Options --table-output Output matching matrices of tab delimited genotype calls and confidences. [default 'true'] --output-forced-calls Output a separate file with forced calls. [default 'false'] --output-context Output a separate file with the allele context used. This is only relevant for marker type probesets which have multiple groups of probes for each allele based on the context of nearby SNPs. [default 'false'] --cc-chp-output Output resulting calls in directory called 'cc-chp' under out-dir. This makes one AGCC Multi Data CHP file per cel file analyzed. [default 'false'] --xda-chp-output Output resulting calls in directory called 'chp' under out-dir. This makes one GCOS XDA CHP file per cel file analyzed. Note that this format is not supported beyond the Mapping500K chips, for subsequent chips look at the CC CHP format instead. [default 'false'] --cc-chp-out-dir Over-ride the default location for chp output. [default ''] --xda-chp-out-dir Over-ride the default location for chp output. [default ''] --summaries Output the summary values from the quantifcation method for each allele. For brlmm-p this will also write a file of transformed summary values in contrast space used in the clustering. [default 'false'] --report-file Over-ride the default report file name. [default ''] Analysis Options -a, --analysis String representing analysis pathway desired. For example: 'quant-norm.sketch=50000,pm-only,brlmm'. [default 'brlmm'] --qmethod-spec Quantification Method to use for summarizing alleles. [default 'plier.optmethod=1'] --read-models-brlmm File to read precomputed BRLMM snp specific models from. [default ''] --read-models-brlmmp File to read precomputed BRLMM-P snp specific models from. [default ''] --read-models-birdseed File to read precomputed birdseed snp specific models from. [default ''] --write-models Should we write snp specific models out for analysis? [experimental] [default 'false'] --db-from-prior-models File to write prior snp models to for random access. [default ''] --db-from-posterior-models File to write posterior snp models to for random access. [default ''] --feat-effects Output feature effects when available. By convention med-polish feature effects have total probeset median added to them, see RMA module for details [default 'false'] --writeOldStyleFeatureEffectsFile Boolean value to determine whether or not old style feature effects files are written. [default 'false'] --feat-eff-remove-allele-suffix Remove the -A and -B suffix from probeset name added during genotype process [default 'false'] --use-feat-eff File defining a plier feature effect for each probe. Note that precomputed effects should only be used for an appropriately similar analysis (i.e. feature effects for pm-only may be different than for pm-mm). [default ''] --feat-details Output the feature details (usually residuals) from the quantification method if available. [default 'false'] --target-sketch File specifying a target distribution to use for quantile normalization. [default ''] --write-sketch Write the quantile normalization distribution (or sketch) to a file for reuse with target-sketch option. [default 'false'] --dm-thresh Minimum DM p-value to seed clusters with. [default '.17'] --reference-profile File specifying reference chip profile. [default ''] --write-profile Write the reference chip profile to a file for reuse. [default 'false'] --dm-hetmult DM hetmultiplier to balance het/hom calls, additive to log likelihood. [default '0'] --prior-size How many probesets to use for determining prior. [default '0'] --list-sample Only sample for prior from list specified via --probeset-ids, not entire chip. [default 'false'] --read-priors-brlmm File to load BRLMM priors from. Prior format is tab separated id, center, var, and center.var. [default ''] --write-prior Write prior out to file in output-dir. [default 'false'] --norm-size Do contrast normalization using a sample of this many snps (brlmm-p) [default '0'] --write-norm Write covariate norm fcns to file [default 'false'] --set-analysis-name Explicitly set the analysis name. This affects output file names (ie prefix) and various meta info. [default ''] --include-quant-in-report-file-name Include the quant method name in the expression report files. [default 'false'] Gender Options --set-gender-method Explicitly force the use of a particular gender method for genotype calling. Valid values include: cn-probe-chrXY-ratio, cn-probe-chrZW-ratio, dm-chrX-het-rate, em-cluster-chrX-het-contrast, user-supplied, and none. If you are supplied seed genotype calls, you can also use supplied-genotypes-chrX-het-rate. When not set, the default behavior depends on the analysis. [default ''] --read-genders Explicitly read genders from a file. [default ''] --read-inbred Read penalty for hets by level of inbreeding per sample. [default ''] --no-gender-force Perform analysis even without a suitable gender method for genotype calling. [default 'false'] --em-gender Enable EM Gender calling if special-snps or chrX-snp file is provided. [default 'true'] --female-thresh Threshold for calling females when using cn-probe-chrXY-ratio or cn-probe-chrZW-ratio method. [default '0.48'] --male-thresh Threshold for calling females when using cn-probe-chrXY-ratio or cn-probe-chrZW-ratiomethod. [default '0.71'] --zw-gender-calling Handles case in which ZZ is male and ZW is female. If unset, then internally set to true when cn-probe-chrZW-ratio gender-method is used. [default ''] Misc Options --explain Explain a particular operation (i.e. --explain brlmm or --explain brlmm-p). [default ''] Advanced Options --kill-list Do not use the probes specified in file for computing results. [experimental] [default ''] --dm-out Output any initial seed calls used by BRLMM (seed default is DM calls). Only relevant for BRLMM. [default 'false'] --all-types Try and analyze all probeset types rather than just genotyping. [Experimental] [default 'false'] --genotypes File to read seed genotypes from instead of using DM to generate. [experimental] [default ''] --select-probes Output estimates of which probes are most accurate [default 'false'] --call-coder-max-alleles For encoding/decoding calls, the max number of alleles per marker to allow. [default '6'] --call-coder-type The data size used to encode the call. [default 'UCHAR'] --call-coder-version The version of the encoder/decoder to use [default '1.0'] Execution Control Options --use-disk Store CEL intensities to be analyzed on disk. [default 'true'] --disk-cache Size of memory cache when working off disk in megabytes. [default '50'] A5 output options --a5-global-file Filename for the A5 global output file. [Experimental] [default ''] --a5-global-file-no-replace Append or create rather than replace. [Experimental] [default 'false'] --a5-group Group name where to put results in the A5 output files. [Experimental] [default ''] --a5-calls Output the genotype calls and confidences in A5 format. [Experimental] [default 'false'] --a5-calls-use-global Use the global A5 file for calls and confidences.[Experimental] [default 'false'] --a5-summaries Output the summary values from the quantifcation method for each allele in A5 format. [Experimental] [default 'false'] --a5-summaries-use-global Use the global A5 file for summaries. [Experimental] [default 'false'] --a5-feature-effects Output feature effects in A5 format. [Experimental] [default 'false'] --a5-feature-effects-use-global Use the global A5 file for feature effects.[Experimental] [default 'false'] --a5-feature-details Output feature level residuals in A5 format. [Experimental] [default 'false'] --a5-feature-details-use-global Use the global A5 file for residuals. [Experimental] [default 'false'] --a5-sketch Output normalization sketch in A5 format. --write-sketch option will override this option. [Experimental] [default 'false'] --a5-sketch-use-global Put the sketch in the global A5 output file. [Experimental] [default 'false'] --a5-write-models Output genotype models/posteriors in A5 format. --write-models option will override this option. [Experimental] [default 'false'] --a5-write-models-use-global Put the models in the global A5 output file. [Experimental] [default 'false'] A5 input options --a5-global-input-file Filename for the group in the global input file.[Experimental] [default ''] --a5-input-group Group name for input. Defaults to --a5-group or if that is not set, then '/'. [Experimental] [default ''] --a5-sketch-input-global Read the sketch from the global A5 input file. [Experimental] [default 'false'] --a5-sketch-input-file Read the sketch from the an A5 input file. [Experimental] [default ''] --a5-sketch-input-group Group name to read the sketch from. Defaults to --a5-input-group. [Experimental] [default ''] --a5-sketch-input-name The name of the data section. Defaults to 'target-sketch'. [Experimental] [default ''] --a5-feature-effects-input-global Read the feature effects global A5 input file. [Experimental] [default 'false'] --a5-feature-effects-input-file Read the feature effects from the an A5 input file. [Experimental] [default ''] --a5-feature-effects-input-group Group name to read the feature effects from. Defaults to --a5-input-group. [Experimental] [default ''] --a5-feature-effects-input-name The name of the data section. Defaults to XXX.feature-response where XXX is the analysis name and quant method. IE 'brlmm-p.plier'. [Experimental] [default ''] --a5-models-input-global Read the Models from the global A5 input file. The tsv5 name must be 'XXX.snp-posteriors'. [Experimental] [default 'false'] --a5-models-input-file Read the models from the an A5 input file. [Experimental] [default ''] --a5-models-input-group The group name where the models are located. Defaults to the analysis name. [Experimental] [default ''] --a5-models-input-name The name of the data section. Defaults to XXX.snp-posteriors where XXX is the analysis name. IE 'brlmm-p'. [Experimental] [default ''] SNPQC Options --snpqc-probesets Filename of probesets to calculate snpqc-call-rate, snpqc-hom-rate and snpqc-het-rate for. [default ''] Engine Options (Not used on command line) --cels Cel files to process. [default ''] --result-files CHP file names to output. Must be paired with cels. [default ''] --time-start The time the engine run was started [default ''] --time-end The time the engine run ended [default ''] --time-run-minutes The run time in minutes. [default ''] --analysis-guid The GUID for the analysis run. [default ''] Standard Methods: 'birdseed' quant-norm.sketch=50000,pm-only,birdseed 'birdseed-dev' quant-norm.sketch=50000.target=1000,pm-only,birdseed-dev 'birdseed-dev.force' quant-norm.sketch=50000.target=1000,pm-only,birdseed-dev.conf-threshold=1 'birdseed-v1' quant-norm.sketch=50000,pm-only,birdseed-v1 'birdseed-v1.force' quant-norm.sketch=50000,pm-only,birdseed-v1.conf-threshold=1 'birdseed-v2' quant-norm.sketch=50000.target=1000,pm-only,birdseed-v2 'birdseed-v2.force' quant-norm.sketch=50000.target=1000,pm-only,birdseed-v2.conf-threshold=1 'birdseed.force' quant-norm.sketch=50000,pm-only,birdseed.conf-threshold=1 'brlmm' quant-norm.sketch=50000,pm-only,brlmm.transform=ccs.K=4 'brlmm-p' quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.K=2.SB=0.003.MS=0.05 'brlmm-p-plus' quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.mix=1.bic=2.HARD=3.SB=0.45.KX=1.KH=1.5.KXX=0.5.KAH=-0.6.KHB=-0.6.transform=MVA.AAM=2.0.BBM=-2.0.AAV=0.06.BBV=0.06.ABV=0.06.copyqc=0.000001.wobble=0.05.MS=0.05 'brlmm-p-plus.force' quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.mix=1.bic=2.HARD=3.SB=0.45.KX=1.KH=1.5.KXX=0.5.KAH=-0.6.KHB=-0.6.transform=MVA.AAM=2.0.BBM=-2.0.AAV=0.06.BBV=0.06.ABV=0.06.copyqc=0.000001.wobble=0.05.MS=1 'brlmm-p.force' quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.K=2.SB=0.003.MS=1 Data transformations: rma-bg Performs an RMA style background adjustment as described in Irizarry et al 2003. quant-norm Class for doing quantile normalization. Can do sketch and full quantile (just set sketch to chip size or zero) and supports bioconductor compatibility. artifact-reduction Class for artifact reduction. med-norm Class for doing median normalization. Adjust intensities such that all chips have the same median (or average). adapter-type-norm Class for doing adapter type normalization. Adjust intensities by adapter type. gc-bg Subtract bacground based on median intensity of probes with similar GC content. intensity-reporter Class for dumping intensity values to a file. no-trans Placeholder chipstream that does no transformation Pm Intensity Adjustments: pm-only No adjustment. Just uses unmodified PM intensity values. pm-mm Use mismatch probe as adjustment for perfect match. Has strength of being unbiased, but often the mismatch probe binds the match target. pm-gcbg Do an adjustment based on the median intensity of probes with similar GC content. pm-sum Add itensity of PM probe for other allele to PM probes. Quantification Methods: plier The PLIER (Probe Logarithmic Error Intensity Estimate) method produces an improved signal by accounting for experimentally observed patterns in feature behavior and handling error at the appropriately at low and high signal values. This version of PLIER differs from the previous version by the addition of a SafteyZero, NumericalTolerance, and FixPrecomputed. These options are intended to improve the stability of PLIER results when using precomputed feature reponse values. To get the older PLIER behavior set SafetyZero to 0.0, NumericalTolerance to 0.0, and FixPrecomputed to false. sea The SEA (Simplified Expression Analysis) method provides a simple signal estimate, using the initialization algorithm from the PLIER (Probe Logarithmic Error Intensity Estimate) method and omitting the PLIER parameter fitting. SEA is useful for single chip signal estimation. The version of PLIER used by SEA differs from the previous version by the addition of a SafteyZero, NumericalTolerance, and FixPrecomputed. These options are intended to improve the stability of PLIER results when using precomputed feature reponse values. To get the older PLIER behavior set SafetyZero to 0.0, NumericalTolerance to 0.0, and FixPrecomputed to false. iter-plier Do probe set quantification estimate by iteratively calling PLIER with the probes that best correlate with signal estimate. The version of PLIER used by IterPLIER differs from the previous version by the addition of a SafteyZero, NumericalTolerance, and FixPrecomputed. These options are intended to improve the stability of PLIER results when using precomputed feature reponse values. To get the older PLIER behavior set SafetyZero to 0.0, NumericalTolerance to 0.0, and FixPrecomputed to false. med-polish Performs a median polish to estimate target and probe effects. Resulting summaries are in log2 space by default. Used in summary step of RMA as described in Irizarry et al 2003. dabg Calculates the p-value that the intensities in a probeset could have been observed by chance in a background distribution. Used as a substitute for standard absent/present calls when mismatch probes are not available. avgdiff Calculates the average measurement for a probeset using the MAS 4 average difference algorithm, namely the average difference between the pm and mm probe signal. median Use the median of probes for a particular chip as the summary. Analysis Streams: expr Does expression summarization on probesets. pca-select Determines PCA for probes and picks probes that are near the principal component as the probes to use for downstream analysis. spect-select Picks probes that are similar to each other based on spectral cluster and normalized cut.
Q. What is a probe_id?
A. See the FAQ item on probe IDs for more info.
Q. The program died with an error message like "Assertion failed: A->probes.size() == 2, file ../DmListener.cpp." What does this mean?
A. This is symptomatic of trying to run BRLMM for a SNP with no MM probes. In its typical mode of running BRLMM relies on DM to generate intital seed calls, and the DM algorithm requires MM probes.
Q. The program died with an error message like "DmListener::getGenoCall() - Can't find genotypes for name: SNP_A-1780432". What does this mean?
A. This is symptomatic of having specified the wrong chrX file for the analysis. In order to reduce the likelihood of accidentally using the wrong chrX file apt-probeset-genotype checks to make sure that all the SNPs specified in the chrX file are present on the chips being analyzed. If it finds a SNP present in the chrX file that is not identified in the CDF file it will die with the above message. Note that if you want to bypass the requirement of a chrX file you can use the --no-gender-force option.
Q. The program died and I got an error message saying "Killed". What does this mean and what can I do?
A. Linux has a "feature" that it will promise more memory than it actually has in the hope that many programs won't actually be using all their memory at once. However, if linux does run short of memory it will start killing programs arbitrarily. You can read more about linux's OOM (out of memory) killer at at LWN.net.
Q. Why does apt-probeset-genotype require information regarding SNPs on chromosomes X/Y/Mito?
A. The SNPs on chromosome X are evaluated separately for XX (female) and XY (male) individuals as the intensity estimates for the males will generally be lower on X due to one missing chromosome. The prior is also adjusted to remove the het center as XY individuals should only have hom calls on the X chromosome. For BRLMM analyses gender is estimated using the method employed in the GTYPE software: individuals are called male if less than 7.5% of the snps on X are called as hets by the initial DM calls using a .33 confidence threshold. For BRLMM-P gender is estimated by use of an Expectation Maximiation (EM) algorithm on the PM probes for chrX SNPs to estimate the het rate.
Q. How is the mask section in the CEL file used?
A. It is not. The contents of this section of the CEL file are ignored.
Q. How can I find out more information about the analysis string:
A. Start with the included manual for apt-probeset-genotype. Usage information is also provided if you run apt-probeset-genotype without any arguments. Lastly, use the "--explain" option for more information about particular analysis methods. IE:
apt-probeset-genotype --explain brlmm-p
Q. Wild cards do not work on windows. For example:
apt-probeset-genotype ... *.CEL
A. APT relies on the command shell to do the wild card expansion (ie bash shell on NIX systems). The windows shell does not do wild card expansion, so there is no wild card expansion for APT when run from the windows shell. You may want to try a different windows shell or perhaps bash via cygwin. See the --cel-files option as an alternative to specify CEL files for analysis.
This section explains the contents of CHP files for the various algorithms. For details on the formats or for an explanation of why the XDA CHP format is not supported for some chip types, see above.
The XDA CHP file format is only supported for the BRLMM algorithm applied to the 100K or 500K arrays. Historically the genotyping XDA CHP file is closely tied to the DM model and while BRLMM uses the same format for backward compatibility it is important to note that the interpretation of some fields is different. Below are the names of the fields and corresponding BRLMM values that are stored in them.
The following parameters are saved in the CCHPFileHeader object:
A complete explanation of the XDA CHP file format can be found in a local copy of the Affymetrix Developer's Network file format documentation.
The AGCC CHP format consists of a header followed by a data section. The header section contains a large amount of information including the software version and the full set of parameters used in the clustering analysis. The data section consists of a matrix with a row for each SNP. The columns are:
Use of the clustering_space_x_value and clustering_space_y_value fields allows for plotting the data in the space that was used to perform the clustering. For BRLMM and BRLMM-P the x-value is 'transformed contrast' and the y-value is 'signal strength' - see the BRLMM and BRLMM-P whitepapers for more detail. For Birdseed (see http://www.broad.mit.edu/mpg/birdsuite/) this is A-signal va B-signal (linear scale, post quantile normalization and allele-specific median-polish).
A complete explanation of the AGCC CHP file format can be found in a local copy of the Affymetrix Developer's Network file format documentation.
While aliases for common analysis such as brlmm with default parameters are provided it is possible to construct custom analyses on the command line. There are both program options and analysis parameters that can be set to affect the results. Most people are familiar with the standard method for setting program options, but the specification of the analysis method and its parameters in apt-probeset-genotype works a little differently. The method for setting custom parameters to the analysis involves supplying a text representation of the analysis and parameters desired. This enables flexibility as each piece of an analysis is self-contained and they can be (almost) arbitrarily combined. Note that when using a custom analysis rather than an alias it is necessary to specify the entire analysis and not acceptable to pass custom parameters to the alias. For example, if you wanted to change the number of iterations brlmm performs you would have to specify 'quant-norm.sketch=50000,pm-only,brlmm.iterations=1' rather than just typing 'brlmm.iterations=1'
The current full default brlmm analysis is: 'quant-norm.sketch=50000,pm-only,brlmm' where there can be multiple chipstream modules (in this case a single quant-norm) separated by commas and the last two entries are the pm adjuster (pm-only) and quantification method (brlmm). Parameters to a particular step in the analysis are supplied in key=value pairs and separated by periods. For example 'quant-norm.sketch=50000' indicates that the chips should be quantile normalized and that a sketch (subset of total data) of size 50000 should be used to do the normalization. Using a sketch can significantly reduce the amount of memory needed with minimal impact on normalization values. To do quantile normalization with just the PM probes and resolve ties in the same manner as bioconductor's RMA version of quantile normalization you would specify 'quant-norm.sketch=50000.bioc=true.usepm=true'. All of the parameters possible can be seen by using the --explain option in conjunction with the name of the module (i.e. apt-probeset-genotype --explain quant-norm).
So a few examples custom analyses would be:
'pm-only,brlmm.transform=rvt' - No normalization, use rvt space for clustering in blrmm.
'med-norm,pm-mm,brlmm.het-mult=.9' - Do a median normalization, use a PM-MM adjustment for probes and a het multiplier of .9 to try and balance hom/het calls.
'rma-bg,quant-norm.sketch=50000.usepm=true.bioc=true,pm-only,blrmm.K=4.tranform=CCS' - Do an RMA style quantile normalization using a subset of 50000 data points followed by brlmm in CCS (contrast centers space) space with K = 4.
Use the --explain option to get more information on what parameters are available for the various methods. For example, "--explain brlmm", "--explain brlmm-p", and "--explain birdseed".
There are a number of different transformations that are implemented for different spaces which can be specified via the transform parameter to brlmm and are detailed below. For all of these transformations and denote the intensity of the A and B alleles respectively as estimated by the quantification method (such as plier or RMA). and denote the new coordinates that and will be transformed into.