MANUAL: apt-probeset-genotype (1.17.0)

Contents

Introduction

apt-probeset-genotype is a program for making genotype calls from Affymetrix SNP microarrays. It currently implements three different genotype calling algorithms:

APT implements the Birdseed v1 algorithm, developed in collaboration with the Broad Institute, which Affymetrix has validated and supports for use with the SNP 6.0 array. APT also implements the BRLMM-P algorithm, which Affymetrix has validated and supports for the SNP 5.0 array.

Additionally, a newer version of Birdseed accessed using --analysis birdseed-v2 option and the latest development edition of the Birdseed algorithm accessed via the use of the --analysis birdseed-dev option are implemented. Affymetrix is not currently supporting the newer birdseed-v2 and birdseed-dev methods, in contrast with the supported methods described above. Moreover, the SNP priors and lists of "qualified" SNPs for Birdseed-dev are not currently available from Affymetrix for either SNP 5.0 or 6.0. Further information and support files for Birdseed-dev and Birdseed-v2 are available from the Broad Institute (http://www.broad.mit.edu/mpg/birdsuite/).

Future APT updates are expected to migrate improvements currently available via the birdseed-dev option into methods supported in the same manner as the rest of APT methods

As birdseed, brlmm and brlmm-p are model based algorithms they need to be run on multiple CEL files at once to estimate probe effect and SNP cluster parameters. For Mapping 500K data it is advisable to run on at least 50 distinct samples (excluding replicates) and ideally on about 100. For Genome-Wide Human SNP 5.0 and 6.0 arrays it is advisable to cluster with at least 44 genetically distinct samples, though adding more will continue to be of benefit in particular for correctly calling rare genotypes.

Quick Start

We illustrate the most basic way to run apt-probeset-genotype with some examples.

The basic requirements for a run of apt-probeset-genotype are:

WARNING: apt-probeset-genotype will overwrite any existing output files it finds. If you wish to keep existing results make sure to specify a different output directory name.

WARNING: Model files are algorithm specific. Birdseed model files must be used with the Birdseed analysis method and BRLMM-P model files with the brlmm-p method.

NOTE: On windows the DOS prompt does not support wildcard expansion and the preferred method is to supply a text file with the path to the cel files via the '--cel-files' option (see below for details of file format).

NOTE: The windows DOS prompt also does not allow a continuation of a command with the '\' character, unlike unix. So in the examples below the '\' character should be omitted and everything entered on a single line.

Axiom ™ GT1 (Axiom ™ chips)

The command below runs the Axiom ™ GT1 algorithm on Axiom ™ arrays. For full details on the use of Axiom ™ GT1 in apt-probeset-genotype refer to the vignette on genotype clustering for Axiom ™ arrays.

  apt-probeset-genotype \
    --analysis-files-path /library/file/path \
    --xml-file Axiom_GW_Hu_SNP.r2.apt-probeset-genotype.AxiomGT1.xml \
    --out-dir out \
    --cel-files cel_file_list.txt

Running Birdseed (GenomeWide SNP 6.0 chips)

On unix systems a basic command using the default parameters to do a run on SNP6.0 data using birdseed (v2) would look like:

apt-probeset-genotype \
  -o results_dir \
  -c GenomeWideSNP_6.cdf \
  --set-gender-method cn-probe-chrXY-ratio \
  --chrX-probes GenomeWideSNP_6.chrXprobes \
  --chrY-probes GenomeWideSNP_6.chrYprobes \
  --special-snps GenomeWideSNP_6.specialSNPs \
  --read-models-birdseed GenomeWideSNP_6.birdseed-v2.models \
  -a birdseed-v2 \
  *.CEL

Note in particular the use of the option "-a birdseed-v2" which specifies that the Birdseed calling algorithm should be used (the default is brlmm, which won't work on a chip without MM probes such as the 5.0 and 6.0 chips).

Also see the important notes regarding birdseed-v1, birdseed-v2, and birdseed-dev in the introduction above.

The following will give you the older birdseed (v1) behavior:

apt-probeset-genotype \
  -o results_dir \
  -c GenomeWideSNP_6.cdf \
  --special-snps GenomeWideSNP_6.specialSNPs \
  --read-models-birdseed GenomeWideSNP_6.birdseed.models \
  -a birdseed \
  *.CEL

Running BRLMM-P (GenomeWide SNP 5.0 chips)

On unix systems a basic command using the default parameters to do a run on SNP5.0 data would look like:

apt-probeset-genotype \
  -o results_dir \
  -c GenomeWideSNP_5.cdf \
  --chrX-snps GenomeWideSNP_5.chrx \
  --read-models-brlmmp GenomeWideSNP_5.models \
  -a brlmm-p \
  *.CEL

Note in particular the use of the option "-a brlmm-p" which specifies that the BRLMM-P calling algorithm should be used (the default is brlmm, which won't work on a chip without MM probes such as the 5.0 chips).

Running BRLMM (Mapping 500K and preceeding chips)

On unix systems a basic command using the default parameters to do a run on Mapping 500K data would look like:

apt-probeset-genotype \
  -o results_dir \
  -c Mapping250K_Sty.cdf \
  --chrX-snps Mapping250K_Sty.chrx \
  *.CEL

The output will consist of a report file with some summary statistics about each chip analyzed and a pair of tab-delimited text files with suffixes .calls.txt and .confidences.txt containing the genotype calls and their associated confidences.

On windows a command equivalent to the example above for Mapping 500K would look like:

apt-probeset-genotype -c Mapping250K_Sty.cdf --chrX-snps Mapping250K_Sty.chrx  -o results_dir  --cel-files cel_file_list.txt

For Mapping 500K chips apt-probeset-genotype runs 100 CELs in 1-2 hours on a 3GHz 2Mb RAM machine using local disk.

Analyzing a subset of the SNPs

Building upon the examples above, here is an example in which only a subset of SNPs are analyzed and the results are written to a text table of genotype calls and a text table of call confidences. The subset of SNPs to be analyzed is specified in a tab-delimited text file called subset_sty.txt, which must contain a column named 'probeset_id'.

apt-probeset-genotype \
  -s subset_sty.txt \
  -c Mapping250K_Sty.cdf \
  --chrX-snps Mapping250K_Sty.chrx \
  -o results_dir \
  --list-sample \
  *.CEL

Note: Note that the --list-sample option is required when subsets are used with the (default) BRLMM analyses and should be omitted for other analysis types. For BRLMM, by default the number of probes used for generating priors is 10,000. If a subset of less than 10,000 probes is used, use the --prior-size option to specify a number less than or equall to the subset size.

See the apt-probeset-summarize manual for an more complete example of running an analysis on a compute farm.

A note about the CHP file format

In previous versions of apt-probeset-genotype the default output format for genotype calls was the XDA CHP format (also known as GCOS CHP format). For the GenomeWide SNP 5.0, SNP 6.0 and subsequent WGSA products the use of the XDA CHP format is strongly discouraged, instead we recommend the newer AGCC CHP format. To help avoid accidental use of the XDA CHP format the defaults for output format have been changed to produce tab-delimited text tables of calls and confidences. The creation of the text table output can be supressed with the --no-table-output option and the two CHP output formats can be selected with the --xda-chp-output and --cc-chp-output options.

The reason that the XDA CHP format is discouraged for the GenomeWide SNP 5.0 chips is that it doesn't contain entries for SNP IDs, the identity of a SNP is inferred from its order in the file. In the case of the GenomeWide SNP 5.0 chips there are some SNPs that are not part of the default library file which some advanced users may choose to explore. This leads to the possibility of generating CHP files containing different SNP lists, something not well supported by the XDA CHP format. The AGCC CHP format has a slot for SNP IDs and thus is safer to use with chips for which users may be looking at different SNP lists.

For SNP 6.0 XDA CHP file format output is not allowed at all.

Details on the contents of the CHP files for various calling algorithms can be found below, and a full description of the XDA and AGCC CHP formats can be found in a local copy of the Affymetrix Developer's Network file format documentation.

Support

Support for APT is handled through the Affymetrix Developer Network. Specifically, questions, problems, feature requests, and other inquiries should be made through either the APT User Form or the Developer Network email address, devnet@affymetrix.com. (If you get an Internal Server Error when accessing the forum, try clearing your cookies for affymetrix.com.) To get emails updates about APT or to view previous APT announcements see the APT User Form.

APT is not supported through the Affymetrix call center, Field Application Specialists, or the standard Affymetrix Technical support channels.

If you encounter an issue please make sure to collect the following information and report the problem to devnet@affymetrix.com

The Report File

apt-probeset-genotype creates a summary report file in the output directory with file name extension '.report.txt'. The report file contains some summary information about each chip analyzed and is useful in getting a quick overview of the CELs analyzed. The format of the file is tab-delimited text with a header line followed by a line for each CEL file analyzed. The columns are all explained below, most users will be mainly interested in the first few entries. The additional entries are provided as potentially useful metrics to track and identify outlier chips and are expected to be mainly of interest to advanced users. The column entries are:

  1. cel_files: CEL file name.
  2. computed_gender: Estimated gender.
  3. call_rate: BRLMM/BRLMM-P/Birdseed call rate at the default or user-specified threshold.
  4. het_rate: Percentage of SNPs called AB (i.e. the heterozygosity).
  5. hom_rate: Percentage of SNPs called AA or BB (i.e. the homozygosity).
  6. cluster_distance_mean: Average distance to the cluster center for the called genotype.
  7. cluster_distance_stdev: Standard deviation of the distance to the cluster center for the called genotype.
  8. raw_intensity_mean: Average of the raw PM probe intensities.
  9. raw_intensity_stdev: Standard deviation of the raw PM probe intensities.
  10. allele_summarization_mean: Average of the allele signal estimates (log2 scale).
  11. allele_summarization_stdev: Standard deviation of the allele signal estimates (log2 scale).
  12. allele_deviation_mean: Average of the absolute difference between the log2 allele signal estimate and its median across all chips.
  13. allele_deviation_stdev: Standard deviation of the absolute difference between the log2 allele signal estimate and its median across all chips.
  14. allele_mad_residuals_mean: Average of the median absolute deviation (MAD) between observed probe intensities and probe intensities fitted by the model.
  15. allele_mad_residuals_stdev: Standard deviation of the median absolute deviation (MAD) between observed probe intensities and probe intensities fitted by the model.
  16. em-cluster-chrX-het-contrast_gender: Gender estimate based on estimated heterozygosity on chrX.
  17. em-cluster-chrX-het-contrast_gender_chrX_het_rate: Estimated heterozygosity on chrX.
  18. cn-probe-chrXY-ratio_gender_meanX: Average intensity of chrX CN probes.
  19. cn-probe-chrXY-ratio_gender_meanY: Average intensity of chrY CN probes.
  20. cn-probe-chrXY-ratio_gender_ratio: Ratio of average chrY CN probe intensity to average chrX CN probe intensity.
  21. cn-probe-chrXY-ratio_gender: Gender estimate based on ratio of chrY to chrX average CN probe intensities.

Options:

apt-probeset-genotype - program for determining genotype calls
from Affymetrix SNP microarrays. The model based algorithms for
making calls (brlmm/brlmm-p/birdseed) require multiple cel files
to be analyzed at once to learn the parameters for each SNP. 

usage:
  BRLMM (500K arrays):
    apt-probeset-genotype -c chip.cdf --chrX-snps chip.chrx \ 
         -o out-dir/ *.cel

  BRLMM-P (GenomeWide SNP 5.0 arrays):
    apt-probeset-genotype -c chip.cdf --chrX-snps chip.chrx \ 
         -o out-dir/ -a brlmm-p --read-models-brlmmp chip.models \
          *.cel

  Birdseed (GenomeWide SNP 6.0 arrays):
    apt-probeset-genotype -c chip.cdf --special-snps chip.specialSNPs \ 
         -o out-dir/ -a birdseed --read-models-birdseed chip.birdseed.models \
          *.cel

  See the apt-probeset-genotype manual for more information about
  birdseed including the latest improvements from The Broad.


options:
 Common Options (not used by all programs)
   -h, --help                           Display program options and extra
                          documentation about possible analyses. See
                          -explain for information about a specific
                          operation. [default 'false'] 
   -v, --verbose How verbose to be with status messages 0 -
                          quiet, 1 - usual messages, 2 - more
                          messages. [default '1'] 
     --console-off Turn off the default messages to the 
                          console but not logging or sockets. 
                          [default 'false'] 
     --use-socket Host and port to print messages over in
                          localhost:port format [default ''] 
     --version Display version information. [default
                          'false'] 
   -f, --force Disable various checks including chip 
                          types. Consider using --chip-type option
                          rather than --force. [default 'false'] 
     --throw-exception Throw an exception rather than calling
                          exit() on error. Useful for debugging. This
                          option is intended for command line use
                          only. If you are wrapping an Engine and 
                          want exceptions thrown, then you should 
                          call Err::setThrowStatus(true) to ensure
                          that all Err::errAbort() calls result in an
                          exception. [default 'false'] 
     --analysis-files-path Search path for analysis library files. 
                          Will override AFFX_ANALYSIS_FILES_PATH
                          environment variable. [default ''] 
     --xml-file Input parameters in XML format (Will
                          override command line settings). [default
                          ''] 
     --temp-dir Directory for temporary files when working
                          off disk. Using network mounted drives is
                          not advised. When not set, the output 
                          folder will be used. The defaut is 
                          typically the output directory or the
                          current working directory. [default ''] 
   -o, --out-dir Directory for output files. Defaults to
                          current working directory. [default '.'] 
     --log-file The name of the log file. Generally 
                          defaults to the program name in the out-dir
                          folder. [default ''] 
 Engine Options (Not used on command line)
     --command-line The command line executed. [default ''] 
     --exec-guid The GUID for the process. [default ''] 
     --program-name The name of the program [default ''] 
     --program-company The company providing the program [default
                          ''] 
     --program-version The version of the program [default ''] 
     --program-cvs-id The CVS version of the program [default ''] 
     --version-to-report The version to report in the output files.
                          [default ''] 
     --free-mem-at-start How much physical memory was available when
                          the engine run started. [default '0'] 
     --meta-data-info Meta data in key=value pair that will be
                          output in headers. [default ''] 
 Input Options
     --cel-files Text file specifying cel files to process,
                          one per line with the first line being
                          'cel_files'. [default ''] 
   -c, --cdf-file File defining probe sets. Use either
                          --cdf-file or --spf-file. [default ''] 
     --spf-file File defining probe sets in spf (simple
                          probe format) which is like a text cdf 
                          file. [default ''] 
     --chrX-snps File containing snps on chrX
                          (non-pseudoautosomal region). [default ''] 
     --special-snps File containing all snps of unusual copy
                          (chrX,mito,Y) [default ''] 
     --chrX-probes File containing probe_id (1-based) of 
                          probes on chrX. Used for copy number probe
                          chrX/Y ratio gender calling. [Experimental]
                          [default ''] 
     --chrY-probes File containing probe_id (1-based) of 
                          probes on chrY. Used for copy number probe
                          chrX/Y ratio gender calling. [Experimental]
                          [default ''] 
     --chrZ-probes File containing probe_id (1-based) of 
                          probes on chrZ. Used for copy number probe
                          chrW/Z ratio avian gender calling.
                          [Experimental] [default ''] 
     --chrW-probes File containing probe_id (1-based) of 
                          probes on chrW. Used for copy number probe
                          chrW/Z ratio avian gender calling.
                          [Experimental] [default ''] 
   -s, --probeset-ids Tab delimited file with column 
                          'probeset_id' specifying probesets to
                          genotype. [default ''] 
     --probeset-ids-reported Tab delimited file with column 
                          'probeset_id' specifying probesets to
                          report. This should be a subset of those
                          specified with --probeset-ids if that 
                          option is used. [default ''] 
     --probe-class-file File containing probe_id (1-based) of 
                          probes and a 'class' designation. Used to
                          compute mean probe intensity by class for
                          report file. [default ''] 
     --chip-type Chip types to check library and CEL files
                          against. Can be specified multiple times.
                          The first one is propigated as the chip 
                          type in the output files. Warning, use of
                          this option will override the usual check
                          between chip types found in the library
                          files and cel files. You should use this
                          option instead of --force when possible.
                          [default ''] 
     --annotation-file Annotation file. [default ''] 
     --genotype-markers-cn-file Tab delimited file with copy number calls
                          for genotype probesets within copy number
                          regions [default ''] 
     --file5-compact Should we output results in a compact file5
                          output. [default 'false'] 
     --sqlite-output Should output some results in sqlite3
                          format? (Linux/OS X only) [default 'false'] 
 Output Options
     --table-output Output matching matrices of tab delimited
                          genotype calls and confidences. [default
                          'true'] 
     --output-forced-calls Output a separate file with forced calls.
                          [default 'false'] 
     --output-context Output a separate file with the allele
                          context used. This is only relevant for
                          marker type probesets which have multiple
                          groups of probes for each allele based on
                          the context of nearby SNPs. [default
                          'false'] 
     --output-probabilities Output a separate file with comma-separated
                          probabilities for BB,AB,AA,Ocean for each
                          probeset and sample [default 'false'] 
     --prob-file-sample-count Number of samples per probability file.
                          These files can be large, especially when
                          running 1000's of samples. [default '500'] 
     --cc-chp-output Output resulting calls in directory called
                          'cc-chp' under out-dir. This makes one AGCC
                          Multi Data CHP file per cel file analyzed.
                          [default 'false'] 
     --xda-chp-output Output resulting calls in directory called
                          'chp' under out-dir. This makes one GCOS 
                          XDA CHP file per cel file analyzed. Note
                          that this format is not supported beyond 
                          the Mapping500K chips, for subsequent chips
                          look at the CC CHP format instead. [default
                          'false'] 
     --cc-chp-out-dir Over-ride the default location for chp
                          output. [default ''] 
     --xda-chp-out-dir Over-ride the default location for chp
                          output. [default ''] 
     --summaries Output the summary values from the
                          quantifcation method for each allele. For
                          brlmm-p this will also write a file of
                          transformed summary values in contrast 
                          space used in the clustering. [default
                          'false'] 
     --summaries-only This is a hack to get summary files for
                          Axiom. It has the same effect as the
                          --summaries option, but it does no
                          genotyping. Output of CHP, calls,
                          confidences, normalized-summary, and report
                          files is suppressed with this option.
                          [default 'false'] 
     --report-file Over-ride the default report file name.
                          [default ''] 
 Analysis Options
   -a, --analysis String representing analysis pathway
                          desired. For example:
                          'quant-norm.sketch=50000,pm-only,brlmm'.
                          [default 'brlmm'] 
     --qmethod-spec Quantification Method to use for 
                          summarizing alleles. [default
                          'plier.optmethod=1'] 
     --read-models-brlmm File to read precomputed BRLMM snp specific
                          models from. [default ''] 
     --read-models-brlmmp File to read precomputed BRLMM-P snp
                          specific models from. [default ''] 
     --read-models-birdseed File to read precomputed birdseed snp
                          specific models from. [default ''] 
     --write-models Should we write snp specific models out for
                          analysis? [experimental] [default 'false'] 
     --db-from-prior-models File to write prior snp models to for 
                          random access. [default ''] 
     --db-from-posterior-models File to write posterior snp models to for
                          random access. [default ''] 
     --feat-effects Output feature effects when available. By
                          convention med-polish feature effects have
                          total probeset median added to them, see 
                          RMA module for details [default 'false'] 
     --writeOldStyleFeatureEffectsFile Boolean value to determine whether or not
                          old style feature effects files are 
                          written. [default 'false'] 
     --feat-eff-remove-allele-suffix Remove the -A and -B suffix from probeset
                          name added during genotype process [default
                          'false'] 
     --use-feat-eff File defining a plier feature effect for
                          each probe. Note that precomputed effects
                          should only be used for an appropriately
                          similar analysis (i.e. feature effects for
                          pm-only may be different than for pm-mm).
                          [default ''] 
     --feat-details Output the feature details (usually
                          residuals) from the quantification method 
                          if available. [default 'false'] 
     --target-sketch File specifying a target distribution to 
                          use for quantile normalization. [default 
                          ''] 
     --write-sketch Write the quantile normalization
                          distribution (or sketch) to a file for 
                          reuse with target-sketch option. [default
                          'false'] 
     --dm-thresh Minimum DM p-value to seed clusters with.
                          [default '.17'] 
     --reference-profile File specifying reference chip profile.
                          [default ''] 
     --write-profile Write the reference chip profile to a file
                          for reuse. [default 'false'] 
     --dm-hetmult DM hetmultiplier to balance het/hom calls,
                          additive to log likelihood. [default '0'] 
     --prior-size How many probesets to use for determining
                          prior. [default '0'] 
     --list-sample Only sample for prior from list specified
                          via --probeset-ids, not entire chip.
                          [default 'false'] 
     --read-priors-brlmm File to load BRLMM priors from. Prior 
                          format is tab separated id, center, var, 
                          and center.var. [default ''] 
     --write-prior Write prior out to file in output-dir.
                          [default 'false'] 
     --norm-size Do contrast normalization using a sample of
                          this many snps (brlmm-p) [default '0'] 
     --write-norm Write covariate norm fcns to file [default
                          'false'] 
     --set-analysis-name Explicitly set the analysis name. This
                          affects output file names (ie prefix) and
                          various meta info. [default ''] 
     --include-quant-in-report-file-name Include the quant method name in the
                          expression report files. [default 'false'] 
 Gender Options
     --set-gender-method Explicitly force the use of a particular
                          gender method for genotype calling. Valid
                          values include: cn-probe-chrXY-ratio,
                          cn-probe-chrZW-ratio, dm-chrX-het-rate,
                          em-cluster-chrX-het-contrast, 
                          user-supplied, and none. If you are 
                          supplied seed genotype calls, you can also
                          use supplied-genotypes-chrX-het-rate. When
                          not set, the default behavior depends on 
                          the analysis. [default ''] 
     --read-genders Explicitly read genders from a file.
                          [default ''] 
     --read-inbred Read penalty for hets by level of 
                          inbreeding per sample. [default ''] 
     --no-gender-force Perform analysis even without a suitable
                          gender method for genotype calling. 
                          [default 'false'] 
     --em-gender Enable EM Gender calling if special-snps or
                          chrX-snp file is provided. [default 'true'] 
     --female-thresh Threshold for calling females when using
                          cn-probe-chrXY-ratio or 
                          cn-probe-chrZW-ratio method. [default
                          '0.48'] 
     --male-thresh Threshold for calling females when using
                          cn-probe-chrXY-ratio or
                          cn-probe-chrZW-ratiomethod. [default 
                          '0.71'] 
     --zw-gender-calling Handles case in which ZZ is male and ZW is
                          female. If unset, then internally set to
                          true when cn-probe-chrZW-ratio 
                          gender-method is used. [default 'false'] 
 Misc Options
     --explain Explain a particular operation (i.e.
                          --explain brlmm or --explain brlmm-p).
                          [default ''] 
 Advanced Options
     --kill-list Do not use the probes specified in file for
                          computing results. [experimental] [default
                          ''] 
     --dm-out Output any initial seed calls used by BRLMM
                          (seed default is DM calls). Only relevant
                          for BRLMM. [default 'false'] 
     --all-types Try and analyze all probeset types rather
                          than just genotyping. [Experimental]
                          [default 'false'] 
     --genotypes File to read seed genotypes from instead of
                          using DM to generate. [experimental]
                          [default ''] 
     --select-probes Output estimates of which probes are most
                          accurate [default 'false'] 
     --call-coder-max-alleles For encoding/decoding calls, the max number
                          of alleles per marker to allow. [default
                          '6'] 
     --call-coder-type The data size used to encode the call.
                          [default 'UCHAR'] 
     --call-coder-version The version of the encoder/decoder to use
                          [default '1.0'] 
 Execution Control Options
     --use-disk Store CEL intensities to be analyzed on
                          disk. [default 'true'] 
     --disk-cache Size of memory cache when working off disk
                          in megabytes. [default '50'] 
 A5 output options
     --a5-global-file Filename for the A5 global output file.
                          [Experimental] [default ''] 
     --a5-global-file-no-replace Append or create rather than replace.
                          [Experimental] [default 'false'] 
     --a5-group Group name where to put results in the A5
                          output files. [Experimental] [default ''] 
     --a5-calls Output the genotype calls and confidences 
                          in A5 format. [Experimental] [default
                          'false'] 
     --a5-calls-use-global Use the global A5 file for calls and
                          confidences.[Experimental] [default 
                          'false'] 
     --a5-summaries Output the summary values from the
                          quantifcation method for each allele in A5
                          format. [Experimental] [default 'false'] 
     --a5-summaries-use-global Use the global A5 file for summaries.
                          [Experimental] [default 'false'] 
     --a5-feature-effects Output feature effects in A5 format.
                          [Experimental] [default 'false'] 
     --a5-feature-effects-use-global Use the global A5 file for feature
                          effects.[Experimental] [default 'false'] 
     --a5-feature-details Output feature level residuals in A5 
                          format. [Experimental] [default 'false'] 
     --a5-feature-details-use-global Use the global A5 file for residuals.
                          [Experimental] [default 'false'] 
     --a5-sketch Output normalization sketch in A5 format.
                          --write-sketch option will override this
                          option. [Experimental] [default 'false'] 
     --a5-sketch-use-global Put the sketch in the global A5 output 
                          file. [Experimental] [default 'false'] 
     --a5-write-models Output genotype models/posteriors in A5
                          format. --write-models option will override
                          this option. [Experimental] [default
                          'false'] 
     --a5-write-models-use-global Put the models in the global A5 output 
                          file. [Experimental] [default 'false'] 
 A5 input options
     --a5-global-input-file Filename for the group in the global input
                          file.[Experimental] [default ''] 
     --a5-input-group Group name for input. Defaults to 
                          --a5-group or if that is not set, then '/'.
                          [Experimental] [default ''] 
     --a5-sketch-input-global Read the sketch from the global A5 input
                          file. [Experimental] [default 'false'] 
     --a5-sketch-input-file Read the sketch from the an A5 input file.
                          [Experimental] [default ''] 
     --a5-sketch-input-group Group name to read the sketch from. 
                          Defaults to --a5-input-group. 
                          [Experimental] [default ''] 
     --a5-sketch-input-name The name of the data section. Defaults to
                          'target-sketch'. [Experimental] [default 
                          ''] 
     --a5-feature-effects-input-global Read the feature effects global A5 input
                          file. [Experimental] [default 'false'] 
     --a5-feature-effects-input-file Read the feature effects from the an A5
                          input file. [Experimental] [default ''] 
     --a5-feature-effects-input-group Group name to read the feature effects 
                          from. Defaults to --a5-input-group.
                          [Experimental] [default ''] 
     --a5-feature-effects-input-name The name of the data section. Defaults to
                          XXX.feature-response where XXX is the
                          analysis name and quant method. IE
                          'brlmm-p.plier'. [Experimental] [default 
                          ''] 
     --a5-models-input-global Read the Models from the global A5 input
                          file. The tsv5 name must be
                          'XXX.snp-posteriors'. [Experimental]
                          [default 'false'] 
     --a5-models-input-file Read the models from the an A5 input file.
                          [Experimental] [default ''] 
     --a5-models-input-group The group name where the models are 
                          located. Defaults to the analysis name.
                          [Experimental] [default ''] 
     --a5-models-input-name The name of the data section. Defaults to
                          XXX.snp-posteriors where XXX is the 
                          analysis name. IE 'brlmm-p'. [Experimental]
                          [default ''] 
 SNPQC Options
     --snpqc-probesets Filename of probesets to calculate
                          snpqc-call-rate, snpqc-hom-rate and
                          snpqc-het-rate for. [default ''] 
 Engine Options (Not used on command line)
     --cels Cel files to process. [default ''] 
     --result-files CHP file names to output. Must be paired
                          with cels. [default ''] 
     --time-start The time the engine run was started 
                          [default ''] 
     --time-end The time the engine run ended [default ''] 
     --time-run-minutes The run time in minutes. [default ''] 
     --analysis-guid The GUID for the analysis run. [default ''] 

Standard Methods:
 'birdseed'            quant-norm.sketch=50000,pm-only,birdseed
 'birdseed-dev'        quant-norm.sketch=50000.target=1000,pm-only,birdseed-dev
 'birdseed-dev.force'  quant-norm.sketch=50000.target=1000,pm-only,birdseed-dev.conf-threshold=1
 'birdseed-v1'         quant-norm.sketch=50000,pm-only,birdseed-v1
 'birdseed-v1.force'   quant-norm.sketch=50000,pm-only,birdseed-v1.conf-threshold=1
 'birdseed-v2'         quant-norm.sketch=50000.target=1000,pm-only,birdseed-v2
 'birdseed-v2.force'   quant-norm.sketch=50000.target=1000,pm-only,birdseed-v2.conf-threshold=1
 'birdseed.force'      quant-norm.sketch=50000,pm-only,birdseed.conf-threshold=1
 'brlmm'               quant-norm.sketch=50000,pm-only,brlmm.transform=ccs.K=4
 'brlmm-p'             quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.K=2.SB=0.003.MS=0.05
 'brlmm-p-plus'        quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.mix=1.bic=2.HARD=3.SB=0.45.KX=1.KH=1.5.KXX=0.5.KAH=-0.6.KHB=-0.6.transform=MVA.AAM=2.0.BBM=-2.0.AAV=0.06.BBV=0.06.ABV=0.06.copyqc=0.000001.wobble=0.05.MS=0.05
 'brlmm-p-plus.force'  quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.mix=1.bic=2.HARD=3.SB=0.45.KX=1.KH=1.5.KXX=0.5.KAH=-0.6.KHB=-0.6.transform=MVA.AAM=2.0.BBM=-2.0.AAV=0.06.BBV=0.06.ABV=0.06.copyqc=0.000001.wobble=0.05.MS=1
 'brlmm-p.force'       quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.K=2.SB=0.003.MS=1

Data transformations:
   rma-bg               Performs an RMA style background adjustment 
                        as described in Irizarry et al 2003. 
   quant-norm           Class for doing quantile normalization. Can 
                        do sketch and full quantile (just set sketch
                        to chip size or zero) and supports
                        bioconductor compatibility. 
   artifact-reduction   Class for artifact reduction. 
   med-norm             Class for doing median normalization. Adjust
                        intensities such that all chips have the same
                        median (or average). 
   adapter-type-norm    Class for doing adapter type normalization.
                        Adjust intensities by adapter type. 
   gc-bg                Subtract bacground based on median intensity
                        of probes with similar GC content. 
   gc-correction        Correct feature intensity for variations in
                        gc_count. 
   scale-intensities    Scale cel intensities. 
   intensity-reporter   Class for dumping intensity values to a file. 
   no-trans             Placeholder chipstream that does no
                        transformation 

Pm Intensity Adjustments:
   pm-only   No adjustment. Just uses unmodified PM intensity values. 
   pm-mm     Use mismatch probe as adjustment for perfect match. Has
             strength of being unbiased, but often the mismatch probe
             binds the match target. 
   pm-gcbg   Do an adjustment based on the median intensity of probes
             with similar GC content. 
   pm-sum    Add itensity of PM probe for other allele to PM probes. 

Quantification Methods:
   plier        The PLIER (Probe Logarithmic Error Intensity 
                Estimate) method produces an improved signal by
                accounting for experimentally observed patterns in
                feature behavior and handling error at the
                appropriately at low and high signal values. This
                version of PLIER differs from the previous version by
                the addition of a SafteyZero, NumericalTolerance, and
                FixPrecomputed. These options are intended to improve
                the stability of PLIER results when using precomputed
                feature reponse values. To get the older PLIER
                behavior set SafetyZero to 0.0, NumericalTolerance to
                0.0, and FixPrecomputed to false. 
   sea          The SEA (Simplified Expression Analysis) method
                provides a simple signal estimate, using the
                initialization algorithm from the PLIER (Probe
                Logarithmic Error Intensity Estimate) method and
                omitting the PLIER parameter fitting. SEA is useful
                for single chip signal estimation. The version of
                PLIER used by SEA differs from the previous version 
                by the addition of a SafteyZero, NumericalTolerance,
                and FixPrecomputed. These options are intended to
                improve the stability of PLIER results when using
                precomputed feature reponse values. To get the older
                PLIER behavior set SafetyZero to 0.0,
                NumericalTolerance to 0.0, and FixPrecomputed to
                false. 
   iter-plier   Do probe set quantification estimate by iteratively
                calling PLIER with the probes that best correlate 
                with signal estimate. The version of PLIER used by
                IterPLIER differs from the previous version by the
                addition of a SafteyZero, NumericalTolerance, and
                FixPrecomputed. These options are intended to improve
                the stability of PLIER results when using precomputed
                feature reponse values. To get the older PLIER
                behavior set SafetyZero to 0.0, NumericalTolerance to
                0.0, and FixPrecomputed to false. 
   med-polish   Performs a median polish to estimate target and probe
                effects. Resulting summaries are in log2 space by
                default. Used in summary step of RMA as described in
                Irizarry et al 2003. 
   dabg         Calculates the p-value that the intensities in a
                probeset could have been observed by chance in a
                background distribution. Used as a substitute for
                standard absent/present calls when mismatch probes 
                are not available. 
   avgdiff      Calculates the average measurement for a probeset
                using the MAS 4 average difference algorithm, namely
                the average difference between the pm and mm probe
                signal. 
   median       Use the median of probes for a particular chip as the
                summary. 

Analysis Streams:
   expr           Does expression summarization on probesets. 
   pca-select     Determines PCA for probes and picks probes that are
                  near the principal component as the probes to use
                  for downstream analysis. 
   spect-select   Picks probes that are similar to each other based 
                  on spectral cluster and normalized cut. 

Frequently Asked Questions

Q. What is a probe_id?

A. See the FAQ item on probe IDs for more info.

Q. The program died with an error message like "Assertion failed: A->probes.size() == 2, file ../DmListener.cpp." What does this mean?

A. This is symptomatic of trying to run BRLMM for a SNP with no MM probes. In its typical mode of running BRLMM relies on DM to generate intital seed calls, and the DM algorithm requires MM probes.

Q. The program died with an error message like "DmListener::getGenoCall() - Can't find genotypes for name: SNP_A-1780432". What does this mean?

A. This is symptomatic of having specified the wrong chrX file for the analysis. In order to reduce the likelihood of accidentally using the wrong chrX file apt-probeset-genotype checks to make sure that all the SNPs specified in the chrX file are present on the chips being analyzed. If it finds a SNP present in the chrX file that is not identified in the CDF file it will die with the above message. Note that if you want to bypass the requirement of a chrX file you can use the --no-gender-force option.

Q. The program died and I got an error message saying "Killed". What does this mean and what can I do?

A. Linux has a "feature" that it will promise more memory than it actually has in the hope that many programs won't actually be using all their memory at once. However, if linux does run short of memory it will start killing programs arbitrarily. You can read more about linux's OOM (out of memory) killer at at LWN.net.

Q. Why does apt-probeset-genotype require information regarding SNPs on chromosomes X/Y/Mito?

A. The SNPs on chromosome X are evaluated separately for XX (female) and XY (male) individuals as the intensity estimates for the males will generally be lower on X due to one missing chromosome. The prior is also adjusted to remove the het center as XY individuals should only have hom calls on the X chromosome. For BRLMM analyses gender is estimated using the method employed in the GTYPE software: individuals are called male if less than 7.5% of the snps on X are called as hets by the initial DM calls using a .33 confidence threshold. For BRLMM-P gender is estimated by use of an Expectation Maximiation (EM) algorithm on the PM probes for chrX SNPs to estimate the het rate.

Q. How is the mask section in the CEL file used?

A. It is not. The contents of this section of the CEL file are ignored.

Q. How can I find out more information about the analysis string:

    quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.K=2.SB=0.003.MS=0.05

A. Start with the included manual for apt-probeset-genotype. Usage information is also provided if you run apt-probeset-genotype without any arguments. Lastly, use the "--explain" option for more information about particular analysis methods. IE:

    apt-probeset-genotype --explain brlmm-p

Q. Wild cards do not work on windows. For example:

    apt-probeset-genotype ... *.CEL

A. APT relies on the command shell to do the wild card expansion (ie bash shell on NIX systems). The windows shell does not do wild card expansion, so there is no wild card expansion for APT when run from the windows shell. You may want to try a different windows shell or perhaps bash via cygwin. See the --cel-files option as an alternative to specify CEL files for analysis.

Advanced Topics

CHP File format.

This section explains the contents of CHP files for the various algorithms. For details on the formats or for an explanation of why the XDA CHP format is not supported for some chip types, see above.

XDA CHP file format

The XDA CHP file format is only supported for the BRLMM algorithm applied to the 100K or 500K arrays. Historically the genotyping XDA CHP file is closely tied to the DM model and while BRLMM uses the same format for backward compatibility it is important to note that the interpretation of some fields is different. Below are the names of the fields and corresponding BRLMM values that are stored in them.

The following parameters are saved in the CCHPFileHeader object:

A complete explanation of the XDA CHP file format can be found in a local copy of the Affymetrix Developer's Network file format documentation.

AGCC CHP file format

The AGCC CHP format consists of a header followed by a data section. The header section contains a large amount of information including the software version and the full set of parameters used in the clustering analysis. The data section consists of a matrix with a row for each SNP. The columns are:

Use of the clustering_space_x_value and clustering_space_y_value fields allows for plotting the data in the space that was used to perform the clustering. For BRLMM and BRLMM-P the x-value is 'transformed contrast' and the y-value is 'signal strength' - see the BRLMM and BRLMM-P whitepapers for more detail. For Birdseed (see http://www.broad.mit.edu/mpg/birdsuite/) this is A-signal va B-signal (linear scale, post quantile normalization and allele-specific median-polish).

A complete explanation of the AGCC CHP file format can be found in a local copy of the Affymetrix Developer's Network file format documentation.

Custom Analyses:

While aliases for common analysis such as brlmm with default parameters are provided it is possible to construct custom analyses on the command line. There are both program options and analysis parameters that can be set to affect the results. Most people are familiar with the standard method for setting program options, but the specification of the analysis method and its parameters in apt-probeset-genotype works a little differently. The method for setting custom parameters to the analysis involves supplying a text representation of the analysis and parameters desired. This enables flexibility as each piece of an analysis is self-contained and they can be (almost) arbitrarily combined. Note that when using a custom analysis rather than an alias it is necessary to specify the entire analysis and not acceptable to pass custom parameters to the alias. For example, if you wanted to change the number of iterations brlmm performs you would have to specify 'quant-norm.sketch=50000,pm-only,brlmm.iterations=1' rather than just typing 'brlmm.iterations=1'

The current full default brlmm analysis is: 'quant-norm.sketch=50000,pm-only,brlmm' where there can be multiple chipstream modules (in this case a single quant-norm) separated by commas and the last two entries are the pm adjuster (pm-only) and quantification method (brlmm). Parameters to a particular step in the analysis are supplied in key=value pairs and separated by periods. For example 'quant-norm.sketch=50000' indicates that the chips should be quantile normalized and that a sketch (subset of total data) of size 50000 should be used to do the normalization. Using a sketch can significantly reduce the amount of memory needed with minimal impact on normalization values. To do quantile normalization with just the PM probes and resolve ties in the same manner as bioconductor's RMA version of quantile normalization you would specify 'quant-norm.sketch=50000.bioc=true.usepm=true'. All of the parameters possible can be seen by using the --explain option in conjunction with the name of the module (i.e. apt-probeset-genotype --explain quant-norm).

So a few examples custom analyses would be:

'pm-only,brlmm.transform=rvt' - No normalization, use rvt space for clustering in blrmm.

'med-norm,pm-mm,brlmm.het-mult=.9' - Do a median normalization, use a PM-MM adjustment for probes and a het multiplier of .9 to try and balance hom/het calls.

'rma-bg,quant-norm.sketch=50000.usepm=true.bioc=true,pm-only,blrmm.K=4.tranform=CCS' - Do an RMA style quantile normalization using a subset of 50000 data points followed by brlmm in CCS (contrast centers space) space with K = 4.

Use the --explain option to get more information on what parameters are available for the various methods. For example, "--explain brlmm", "--explain brlmm-p", and "--explain birdseed".

Clustering Space Transformations:

There are a number of different transformations that are implemented for different spaces which can be specified via the transform parameter to brlmm and are detailed below. For all of these transformations $A$ and $B$ denote the intensity of the A and B alleles respectively as estimated by the quantification method (such as plier or RMA). $X$ and $Y$ denote the new coordinates that $A$ and $B$ will be transformed into.