MANUAL: apt-probeset-genotype (apt-1.10.1)

Contents

Introduction

apt-probeset-genotype is a program for making genotype calls from Affymetrix SNP microarrays. It currently implements three different genotype calling algorithms:

APT implements the Birdseed v1 algorithm, developed in collaboration with the Broad Institute, which Affymetrix has validated and supports for use with the SNP 6.0 array. APT also implements the BRLMM-P algorithm, which Affymetrix has validated and supports for the SNP 5.0 array.

Additionally, a newer version of Birdseed accessed using --analysis birdseed-v2 option and the latest development edition of the Birdseed algorithm accessed via the use of the --analysis birdseed-dev option are implemented. Affymetrix is not currently supporting the newer birdseed-v2 and birdseed-dev methods, in contrast with the supported methods described above. Moreover, the SNP priors and lists of "qualified" SNPs for Birdseed-dev are not currently available from Affymetrix for either SNP 5.0 or 6.0. Further information and support files for Birdseed-dev and Birdseed-v2 are available from the Broad Institute (http://www.broad.mit.edu/mpg/birdsuite/).

Future APT updates are expected to migrate improvements currently available via the birdseed-dev option into methods supported in the same manner as the rest of APT methods

As birdseed, brlmm and brlmm-p are model based algorithms they need to be run on multiple CEL files at once to estimate probe effect and SNP cluster parameters. For Mapping 500K data it is advisable to run on at least 50 distinct samples (excluding replicates) and ideally on about 100. For Genome-Wide Human SNP 5.0 and 6.0 arrays it is advisable to cluster with at least 44 genetically distinct samples, though adding more will continue to be of benefit in particular for correctly calling rare genotypes.

Quick Start

We illustrate the most basic way to run apt-probeset-genotype with some examples.

The basic requirements for a run of apt-probeset-genotype are:

WARNING: apt-probeset-genotype will overwrite any existing output files it finds. If you wish to keep existing results make sure to specify a different output directory name.

WARNING: Model files are algorithm specific. Birdseed model files must be used with the Birdseed analysis method and BRLMM-P model files with the brlmm-p method.

NOTE: On windows the DOS prompt does not support wildcard expansion and the preferred method is to supply a text file with the path to the cel files via the '--cel-files' option (see below for details of file format).

NOTE: The windows DOS prompt also does not allow a continuation of a command with the '\' character, unlike unix. So in the examples below the '\' character should be omitted and everything entered on a single line.

Running BRLMM (Mapping 500K and preceeding chips)

On unix systems a basic command using the default parameters to do a run on Mapping 500K data would look like:

apt-probeset-genotype \
  -o results_dir \
  -c Mapping250K_Sty.cdf \
  --chrX-snps Mapping250K_Sty.chrx \
  *.CEL

The output will consist of a report file with some summary statistics about each chip analyzed and a pair of tab-delimited text files with suffixes .calls.txt and .confidences.txt containing the genotype calls and their associated confidences.

On windows a command equivalent to the example above for Mapping 500K would look like:

apt-probeset-genotype -c Mapping250K_Sty.cdf --chrX-snps Mapping250K_Sty.chrx  -o results_dir  --cel-files cel_file_list.txt

For Mapping 500K chips apt-probeset-genotype runs 100 CELs in 1-2 hours on a 3GHz 2Mb RAM machine using local disk.

Running BRLMM-P (GenomeWide SNP 5.0 chips)

On unix systems a basic command using the default parameters to do a run on SNP5.0 data would look like:

apt-probeset-genotype \
  -o results_dir \
  -c GenomeWideSNP_5.cdf \
  --chrX-snps GenomeWideSNP_5.chrx \
  --read-models-brlmmp GenomeWideSNP_5.models \
  -a brlmm-p \
  *.CEL

Note in particular the use of the option "-a brlmm-p" which specifies that the BRLMM-P calling algorithm should be used (the default is brlmm, which won't work on a chip without MM probes such as the 5.0 chips).

Running Birdseed (GenomeWide SNP 6.0 chips)

On unix systems a basic command using the default parameters to do a run on SNP6.0 data using birdseed (v2) would look like:

apt-probeset-genotype \
  -o results_dir \
  -c GenomeWideSNP_6.cdf \
  --set-gender-method cn-probe-chrXY-ratio \
  --chrX-probes GenomeWideSNP_6.chrXprobes \
  --chrY-probes GenomeWideSNP_6.chrYprobes \
  --special-snps GenomeWideSNP_6.specialSNPs \
  --read-models-birdseed GenomeWideSNP_6.birdseed-v2.models \
  -a birdseed-v2 \
  *.CEL

Note in particular the use of the option "-a birdseed-v2" which specifies that the Birdseed calling algorithm should be used (the default is brlmm, which won't work on a chip without MM probes such as the 5.0 and 6.0 chips).

Also see the important notes regarding birdseed-v1, birdseed-v2, and birdseed-dev in the introduction above.

The following will give you the older birdseed (v1) behavior:

apt-probeset-genotype \
  -o results_dir \
  -c GenomeWideSNP_6.cdf \
  --special-snps GenomeWideSNP_6.specialSNPs \
  --read-models-birdseed GenomeWideSNP_6.birdseed.models \
  -a birdseed \
  *.CEL

Analyzing a subset of the SNPs

Building upon the examples above, here is an example in which only a subset of SNPs are analyzed and the results are written to a text table of genotype calls and a text table of call confidences. The subset of SNPs to be analyzed is specified in a tab-delimited text file called subset_sty.txt, which must contain a column named 'probeset_id'.

apt-probeset-genotype \
  -s subset_sty.txt \
  -c Mapping250K_Sty.cdf \
  --chrX-snps Mapping250K_Sty.chrx \
  -o results_dir \
  *.CEL

See the apt-probeset-summarize manual for an more complete example of running an analysis on a compute farm.

A note about the CHP file format

In previous versions of apt-probeset-genotype the default output format for genotype calls was the XDA CHP format (also known as GCOS CHP format). For the GenomeWide SNP 5.0, SNP 6.0 and subsequent WGSA products the use of the XDA CHP format is strongly discouraged, instead we recommend the newer AGCC CHP format. To help avoid accidental use of the XDA CHP format the defaults for output format have been changed to produce tab-delimited text tables of calls and confidences. The creation of the text table output can be supressed with the --no-table-output option and the two CHP output formats can be selected with the --xda-chp-output and --cc-chp-output options.

The reason that the XDA CHP format is discouraged for the GenomeWide SNP 5.0 chips is that it doesn't contain entries for SNP IDs, the identity of a SNP is inferred from its order in the file. In the case of the GenomeWide SNP 5.0 chips there are some SNPs that are not part of the default library file which some advanced users may choose to explore. This leads to the possibility of generating CHP files containing different SNP lists, something not well supported by the XDA CHP format. The AGCC CHP format has a slot for SNP IDs and thus is safer to use with chips for which users may be looking at different SNP lists.

For SNP 6.0 XDA CHP file format output is not allowed at all.

Details on the contents of the CHP files for various calling algorithms can be found below, and a full description of the XDA and AGCC CHP formats can be found in a local copy of the Affymetrix Developer's Network file format documentation.

Support

Support for APT is handled through the Affymetrix Developer Network. Specifically, questions, problems, feature requests, and other inquiries should be made through either the APT User Form or the Developer Network email address, devnet@affymetrix.com. (If you get an Internal Server Error when accessing the forum, try clearing your cookies for affymetrix.com.) To get emails updates about APT or to view previous APT announcements see the APT User Form.

APT is not supported through the Affymetrix call center, Field Application Specialists, or the standard Affymetrix Technical support channels.

If you encounter an issue please make sure to collect the following information and report the problem to devnet@affymetrix.com

The Report File

apt-probeset-genotype creates a summary report file in the output directory with file name extension '.report.txt'. The report file contains some summary information about each chip analyzed and is useful in getting a quick overview of the CELs analyzed. The format of the file is tab-delimited text with a header line followed by a line for each CEL file analyzed. The columns are all explained below, most users will be mainly interested in the first few entries. The additional entries are provided as potentially useful metrics to track and identify outlier chips and are expected to be mainly of interest to advanced users. The column entries are:

  1. cel_files: CEL file name.
  2. computed_gender: Estimated gender.
  3. call_rate: BRLMM/BRLMM-P/Birdseed call rate at the default or user-specified threshold.
  4. het_rate: Percentage of SNPs called AB (i.e. the heterozygosity).
  5. hom_rate: Percentage of SNPs called AA or BB (i.e. the homozygosity).
  6. cluster_distance_mean: Average distance to the cluster center for the called genotype.
  7. cluster_distance_stdev: Standard deviation of the distance to the cluster center for the called genotype.
  8. raw_intensity_mean: Average of the raw PM probe intensities.
  9. raw_intensity_stdev: Standard deviation of the raw PM probe intensities.
  10. allele_summarization_mean: Average of the allele signal estimates (log2 scale).
  11. allele_summarization_stdev: Standard deviation of the allele signal estimates (log2 scale).
  12. allele_deviation_mean: Average of the absolute difference between the log2 allele signal estimate and its median across all chips.
  13. allele_deviation_stdev: Standard deviation of the absolute difference between the log2 allele signal estimate and its median across all chips.
  14. allele_mad_residuals_mean: Average of the median absolute deviation (MAD) between observed probe intensities and probe intensities fitted by the model.
  15. allele_mad_residuals_stdev: Standard deviation of the median absolute deviation (MAD) between observed probe intensities and probe intensities fitted by the model.
  16. em-cluster-chrX-het-contrast_gender: Gender estimate based on estimated heterozygosity on chrX.
  17. em-cluster-chrX-het-contrast_gender_chrX_het_rate: Estimated heterozygosity on chrX.
  18. cn-probe-chrXY-ratio_gender_meanX: Average intensity of chrX CN probes.
  19. cn-probe-chrXY-ratio_gender_meanY: Average intensity of chrY CN probes.
  20. cn-probe-chrXY-ratio_gender_ratio: Ratio of average chrY CN probe intensity to average chrX CN probe intensity.
  21. cn-probe-chrXY-ratio_gender: Gender estimate based on ratio of chrY to chrX average CN probe intensities.

Options:

apt-probeset-genotype - program for determining genotype calls
from Affymetrix SNP microarrays. The model based algorithms for
making calls (brlmm/brlmm-p/birdseed) require multiple cel files
to be analyzed at once to learn the parameters for each SNP. 

usage:
  BRLMM (500K arrays):
    apt-probeset-genotype -c chip.cdf --chrX-snps chip.chrx \ 
         -o out-dir/ *.cel

  BRLMM-P (GenomeWide SNP 5.0 arrays):
    apt-probeset-genotype -c chip.cdf --chrX-snps chip.chrx \ 
         -o out-dir/ -a brlmm-p --read-models-brlmmp chip.models \
          *.cel

  Birdseed (GenomeWide SNP 6.0 arrays):
    apt-probeset-genotype -c chip.cdf --special-snps chip.specialSNPs \ 
         -o out-dir/ -a birdseed --read-models-birdseed chip.birdseed.models \
          *.cel

  See the apt-probeset-genotype manual for more information about
  birdseed including the latest improvements from The Broad.


options:
 Basic Info and Control Options
   -h, --help                           This message. [default 'false'] 
     --explain Explain a particular operation (i.e.
                          --explain brlmm or --explain brlmm-p).
                          [default ''] 
   -v, --verbose How verbose to be with status messages 0 -
                          quiet, 1 - usual messages, 2 - more
                          messages. [default '1'] 
     --version Output program version and quit. [default
                          'false'] 
   -f, --force Disable various checks including chip 
                          types. Consider using --chip-type option
                          rather than --force. [default 'false'] 
 Input Options
     --cel-files Text file specifying cel files to process,
                          one per line with the first line being
                          'cel_files'. [default ''] 
   -c, --cdf-file File defining probe sets. Use either
                          --cdf-file or --spf-file. [default ''] 
     --spf-file File defining probe sets in spf (simple
                          probe format) which is like a text cdf 
                          file. [default ''] 
     --chrX-snps File containing snps on chrX
                          (non-pseudoautosomal region). [default ''] 
     --special-snps File containing all snps of unusual copy
                          (chrX,mito,Y) [default ''] 
     --chrX-probes File containing probe_id (1-based) of 
                          probes on chrX. Used for copy number probe
                          chrX/Y ratio gender calling. [Experimental]
                          [default ''] 
     --chrY-probes File containing probe_id (1-based) of 
                          probes on chrY. Used for copy number probe
                          chrX/Y ratio gender calling. [Experimental]
                          [default ''] 
   -s, --probeset-ids Tab delimited file with column 
                          'probeset_id' specifying probesets to
                          genotype. [default ''] 
     --probeset-ids-reported Tab delimited file with column 
                          'probeset_id' specifying probesets to
                          report. This should be a subset of those
                          specified with --probeset-ids if that 
                          option is used. [default ''] 
     --probe-class-file File containing probe_id (1-based) of 
                          probes and a 'class' designation. Used to
                          compute mean probe intensity by class for
                          report file. [default ''] 
     --chip-type Chip types to check library and CEL files
                          against. Can be specified multiple times.
                          The first one is propigated as the chip 
                          type in the output files. Warning, use of
                          this option will override the usual check
                          between chip types found in the library
                          files and cel files. You should use this
                          option instead of --force when possible.
                          [default ''] 
     --snp-annotation-file SNP Annotation file. [default ''] 
     --cn-annotation-file CN Annotation file. [default ''] 
     --genotype-markers-cn-file Tab delimited file with copy number calls
                          for genotype probesets within copy number
                          regions [default ''] 
 Output Options
   -o, --out-dir Directory to write result files into. Any
                          previous results in directory will be
                          overwritten. [default '.'] 
     --table-output Output matching matrices of tab delimited
                          genotype calls and confidences. [default
                          'true'] 
     --output-forced-calls Output a separate file with forced calls.
                          [default 'false'] 
     --output-context Output a separate file with the allele
                          context used. This is only relevant for
                          marker type probesets which have multiple
                          groups of probes for each allele based on
                          the context of nearby SNPs. [default
                          'false'] 
     --cc-chp-output Output resulting calls in directory called
                          'cc-chp' under out-dir. This makes one AGCC
                          Multi Data CHP file per cel file analyzed.
                          [default 'false'] 
     --xda-chp-output Output resulting calls in directory called
                          'chp' under out-dir. This makes one GCOS 
                          XDA CHP file per cel file analyzed. Note
                          that this format is not supported beyond 
                          the Mapping500K chips, for subsequent chips
                          look at the CC CHP format instead. [default
                          'false'] 
     --cc-chp-out-dir Over-ride the default location for chp
                          output. [default ''] 
     --xda-chp-out-dir Over-ride the default location for chp
                          output. [default ''] 
     --summaries Output the summary values from the
                          quantifcation method for each allele. For
                          brlmm-p this will also write a file of
                          transformed summary values in contrast 
                          space used in the clustering. [default
                          'false'] 
 Analysis Options
   -a, --analysis String representing analysis pathway
                          desired. For example:
                          'quant-norm.sketch=50000,pm-only,brlmm'.
                          [default 'brlmm'] 
     --qmethod-spec Quantification Method to use for 
                          summarizing alleles. [default
                          'plier.optmethod=1'] 
     --read-models-brlmm File to read precomputed BRLMM snp specific
                          models from. [default ''] 
     --read-models-brlmmp File to read precomputed BRLMM-P snp
                          specific models from. [default ''] 
     --read-models-birdseed File to read precomputed birdseed snp
                          specific models from. [default ''] 
     --write-models Should we write snp specific models out for
                          analysis? [experimental] [default 'false'] 
     --feat-effects Output feature effects when available.
                          [default 'false'] 
     --use-feat-eff File defining a plier feature effect for
                          each probe. Note that precomputed effects
                          should only be used for an appropriately
                          similar analysis (i.e. feature effects for
                          pm-only may be different than for pm-mm).
                          [default ''] 
     --residuals Output the residuals from the 
                          quantification method if available. 
                          [default 'false'] 
     --target-sketch File specifying a target distribution to 
                          use for quantile normalization. [default 
                          ''] 
     --write-sketch Write the quantile normalization
                          distribution (or sketch) to a file for 
                          reuse with target-sketch option. [default
                          'false'] 
     --dm-thresh Minimum DM p-value to seed clusters with.
                          [default '.17'] 
     --dm-hetmult DM hetmultiplier to balance het/hom calls,
                          additive to log likelihood. [default '0'] 
     --prior-size How many probesets to use for determining
                          prior. [default '0'] 
     --list-sample Only sample for prior from list specified
                          via --probeset-ids, not entire chip.
                          [default 'false'] 
     --read-priors-brlmm File to load BRLMM priors from. Prior 
                          format is tab separated id, center, var, 
                          and center.var. [default ''] 
     --write-prior Write prior out to file in output-dir.
                          [default 'false'] 
     --norm-size Do contrast normalization using a sample of
                          this many snps (brlmm-p) [default '0'] 
     --write-norm Write covariate norm fcns to file [default
                          'false'] 
     --set-analysis-name Explicitly set the analysis name. This
                          affects output file names (ie prefix) and
                          various meta info. [default ''] 
 Gender Options
     --set-gender-method Explicitly force the use of a particular
                          gender method for genotype calling. Valid
                          values include: cn-probe-chrXY-ratio,
                          dm-chrX-het-rate,
                          em-cluster-chrX-het-contrast, 
                          user-supplied, and none. If you are 
                          supplied seed genotype calls, you can also
                          use supplied-genotypes-chrX-het-rate. When
                          not set, the default behavior depends on 
                          the analysis. [default ''] 
     --read-genders Explicitly read genders from a file.
                          [default ''] 
     --no-gender-force Perform analysis even without a suitable
                          gender method for genotype calling. 
                          [default 'false'] 
     --em-gender Enable EM Gender calling if special-snps or
                          chrX-snp file is provided. [default 'true'] 
     --female-thresh Threshold for calling females when using
                          cn-probe-chrXY-ratio method. [default
                          '0.48'] 
     --male-thresh Threshold for calling females when using
                          cn-probe-chrXY-ratio method. [default
                          '0.71'] 
 Advanced Options
     --kill-list Do not use the probes specified in file for
                          computing results. [experimental] [default
                          ''] 
     --dm-out Output any initial seed calls used by BRLMM
                          (seed default is DM calls). Only relevant
                          for BRLMM. [default 'false'] 
     --all-types Try and analyze all probeset types rather
                          than just genotyping. [Experimental]
                          [default 'false'] 
     --genotypes File to read seed genotypes from instead of
                          using DM to generate. [experimental]
                          [default ''] 
     --select-probes Output estimates of which probes are most
                          accurate [default 'false'] 
     --call-coder-max-alleles For encoding/decoding calls, the max number
                          of alleles per marker to allow. [default
                          '6'] 
     --call-coder-type The data size used to encode the call.
                          [default 'UCHAR'] 
     --call-coder-version The version of the encoder/decoder to use
                          [default '1.0'] 
 Execution Control Options
     --mem-usage How much memory (RAM) to use for this job 
                          in megabytes. Only relevant when
                          --use-disk=false. [default '0'] 
     --block-size How many probesets to process at once,
                          useful when memory is limited. If set to 0
                          program attempts to guess available RAM and
                          set appropriately. Only relevant if
                          --use-disk is false. [default '0'] 
     --max-block-size This sets a cap on how high the
                          blockSize*celFiles can go. When set to 0
                          there is no cap. Only relevant if 
                          --use-disk is false. [default '0'] 
     --use-disk Use disk based representation to avoid
                          excessive RAM use. [default 'true'] 
     --disk-dir Directory for temporary files when working
                          off disk. Using network mounted drives is
                          not advised. When not set, the output 
                          folder will be used. [default ''] 
     --disk-cache Size of memory cache when working off disk
                          in megabytes. [default '50'] 
 A5 output options
     --a5-global-file Filename for the A5 global output file.
                          [Experimental] [default ''] 
     --a5-global-file-no-replace Append or create rather than replace.
                          [Experimental] [default 'false'] 
     --a5-group Group name where to put results in the A5
                          output files. [Experimental] [default ''] 
     --a5-calls Output the genotype calls and confidences 
                          in A5 format. [Experimental] [default
                          'false'] 
     --a5-calls-use-global Use the global A5 file for calls and
                          confidences.[Experimental] [default 
                          'false'] 
     --a5-summaries Output the summary values from the
                          quantifcation method for each allele in A5
                          format. [Experimental] [default 'false'] 
     --a5-summaries-use-global Use the global A5 file for summaries.
                          [Experimental] [default 'false'] 
     --a5-feature-effects Output feature effects in A5 format.
                          [Experimental] [default 'false'] 
     --a5-feature-effects-use-global Use the global A5 file for feature
                          effects.[Experimental] [default 'false'] 
     --a5-residuals Output feature level residuals in A5 
                          format. [Experimental] [default 'false'] 
     --a5-residuals-use-global Use the global A5 file for residuals.
                          [Experimental] [default 'false'] 
     --a5-sketch Output normalization sketch in A5 format.
                          --write-sketch option will override this
                          option. [Experimental] [default 'false'] 
     --a5-sketch-use-global Put the sketch in the global A5 output 
                          file. [Experimental] [default 'false'] 
     --a5-write-models Output genotype models/posteriors in A5
                          format. --write-models option will override
                          this option. [Experimental] [default
                          'false'] 
     --a5-write-models-use-global Put the models in the global A5 output 
                          file. [Experimental] [default 'false'] 
 A5 input options
     --a5-global-input-file Filename for the group in the global input
                          file.[Experimental] [default ''] 
     --a5-input-group Group name for input. Defaults to 
                          --a5-group or if that is not set, then '/'.
                          [Experimental] [default ''] 
     --a5-sketch-input-global Read the sketch from the global A5 input
                          file. [Experimental] [default 'false'] 
     --a5-sketch-input-file Read the sketch from the an A5 input file.
                          [Experimental] [default ''] 
     --a5-sketch-input-group Group name to read the sketch from. 
                          Defaults to --a5-input-group. 
                          [Experimental] [default ''] 
     --a5-sketch-input-name The name of the data section. Defaults to
                          'target-sketch'. [Experimental] [default 
                          ''] 
     --a5-feature-effects-input-global Read the feature effects global A5 input
                          file. [Experimental] [default 'false'] 
     --a5-feature-effects-input-file Read the feature effects from the an A5
                          input file. [Experimental] [default ''] 
     --a5-feature-effects-input-group Group name to read the feature effects 
                          from. Defaults to --a5-input-group.
                          [Experimental] [default ''] 
     --a5-feature-effects-input-name The name of the data section. Defaults to
                          XXX.feature-response where XXX is the
                          analysis name and quant method. IE
                          'brlmm-p.plier'. [Experimental] [default 
                          ''] 
     --a5-models-input-global Read the Models from the global A5 input
                          file. The tsv5 name must be
                          'XXX.snp-posteriors'. [Experimental]
                          [default 'false'] 
     --a5-models-input-file Read the models from the an A5 input file.
                          [Experimental] [default ''] 
     --a5-models-input-group The group name where the models are 
                          located. Defaults to the analysis name.
                          [Experimental] [default ''] 
     --a5-models-input-name The name of the data section. Defaults to
                          XXX.snp-posteriors where XXX is the 
                          analysis name. IE 'brlmm-p'. [Experimental]
                          [default ''] 

Standard Methods:
 'birdseed'            quant-norm.sketch=50000,pm-only,birdseed
 'birdseed-dev'        quant-norm.sketch=50000.target=1000,pm-only,birdseed-dev
 'birdseed-dev.force'  quant-norm.sketch=50000.target=1000,pm-only,birdseed-dev.conf-threshold=1
 'birdseed-v1'         quant-norm.sketch=50000,pm-only,birdseed-v1
 'birdseed-v1.force'   quant-norm.sketch=50000,pm-only,birdseed-v1.conf-threshold=1
 'birdseed-v2'         quant-norm.sketch=50000.target=1000,pm-only,birdseed-v2
 'birdseed-v2.force'   quant-norm.sketch=50000.target=1000,pm-only,birdseed-v2.conf-threshold=1
 'birdseed.force'      quant-norm.sketch=50000,pm-only,birdseed.conf-threshold=1
 'brlmm'               quant-norm.sketch=50000,pm-only,brlmm.transform=ccs.K=4
 'brlmm-p'             quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.K=2.SB=0.003.MS=0.05
 'brlmm-p-plus'        quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.mix=1.bic=2.HARD=3.SB=0.45.KX=1.KH=1.5.KXX=0.5.KAH=-0.6.KHB=-0.6.transform=MVA.AAM=2.0.BBM=-2.0.AAV=0.06.BBV=0.06.ABV=0.06.copyqc=0.000001.wobble=0.05.MS=0.05
 'brlmm-p-plus.force'  quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.mix=1.bic=2.HARD=3.SB=0.45.KX=1.KH=1.5.KXX=0.5.KAH=-0.6.KHB=-0.6.transform=MVA.AAM=2.0.BBM=-2.0.AAV=0.06.BBV=0.06.ABV=0.06.copyqc=0.000001.wobble=0.05.MS=1
 'brlmm-p.force'       quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.K=2.SB=0.003.MS=1

Data transformations:
   rma-bg              Performs an RMA style background adjustment as
                       described in Irizarry et al 2003. 
   quant-norm          Class for doing quantile normalization. Can do
                       sketch and full quantile (just set sketch to
                       chip size or zero) and supports bioconductor
                       compatibility. 
   med-norm            Class for doing median normalization. Adjust
                       intensities such that all chips have the same
                       median (or average). 
   adapter-type-norm   Class for doing adapter type normalization.
                       Adjust intensities by adapter type. 
   mas5-bg             Performs a MAS 5 background adjustment as
                       described in Liu et al, Bioinformatics (2002). 
   gc-bg               Subtract bacground based on median intensity 
                       of probes with similar GC content. 

Pm Intensity Adjustments:
   pm-only   No adjustment. Just uses unmodified PM intensity values. 
   pm-mm     Use mismatch probe as adjustment for perfect match. Has
             strength of being unbiased, but often the mismatch probe
             binds the match target. 
   pm-gcbg   Do an adjustment based on the median intensity of probes
             with similar GC content. 
   pm-sum    Add itensity of PM probe for other allele to PM probes. 

Quantification Methods:
   plier         The PLIER (Probe Logarithmic Error Intensity
                 Estimate) method produces an improved signal by
                 accounting for experimentally observed patterns in
                 feature behavior and handling error at the
                 appropriately at low and high signal values. This
                 version of PLIER differs from the previous version 
                 by the addition of a SafteyZero, NumericalTolerance,
                 and FixPrecomputed. These options are intended to
                 improve the stability of PLIER results when using
                 precomputed feature reponse values. To get the older
                 PLIER behavior set SafetyZero to 0.0,
                 NumericalTolerance to 0.0, and FixPrecomputed to
                 false. 
   sea           The SEA (Simplified Expression Analysis) method
                 provides a simple signal estimate, using the
                 initialization algorithm from the PLIER (Probe
                 Logarithmic Error Intensity Estimate) method and
                 omitting the PLIER parameter fitting. SEA is useful
                 for single chip signal estimation. The version of
                 PLIER used by SEA differs from the previous version
                 by the addition of a SafteyZero, NumericalTolerance,
                 and FixPrecomputed. These options are intended to
                 improve the stability of PLIER results when using
                 precomputed feature reponse values. To get the older
                 PLIER behavior set SafetyZero to 0.0,
                 NumericalTolerance to 0.0, and FixPrecomputed to
                 false. 
   iter-plier    Do probe set quantification estimate by iteratively
                 calling PLIER with the probes that best correlate
                 with signal estimate. The version of PLIER used by
                 IterPLIER differs from the previous version by the
                 addition of a SafteyZero, NumericalTolerance, and
                 FixPrecomputed. These options are intended to 
                 improve the stability of PLIER results when using
                 precomputed feature reponse values. To get the older
                 PLIER behavior set SafetyZero to 0.0,
                 NumericalTolerance to 0.0, and FixPrecomputed to
                 false. 
   med-polish    Performs a median polish to estimate target and 
                 probe effects. Resulting summaries are in log2 space
                 by default. Used in summary step of RMA as described
                 in Irizarry et al 2003. 
   dabg          Calculates the p-value that the intensities in a
                 probeset could have been observed by chance in a
                 background distribution. Used as a substitute for
                 standard absent/present calls when mismatch probes
                 are not available. 
   mas5-detect   Calculates the p-value for detection of an expressed
                 gene using the MAS 5.0 algorithm. This is a
                 rank-based algorithm, using discrimination scores,
                 described in Liu et al., Bioinformatics (2002)
                 18:1593 and the Statistical Algorithms Reference
                 Guide. 
   mas5-signal   Calculates the average measurement for a probeset
                 using the MAS 5.0 algorithm. This is based on a
                 robust estimator, Tukey's biweight, described in
                 Hubbell et al., Bioinformatics (2002) 18:1585 and 
                 the Statistical Algorithms Reference Guide. WARNING:
                 The implementation in APT does not allow for signal
                 level normalization across the chip. See the FAQ 
                 item in the manual. 
   avgdiff       Calculates the average measurement for a probeset
                 using the MAS 4 average difference algorithm, namely
                 the average difference between the pm and mm probe
                 signal. 
   median        Use the median of probes for a particular chip as 
                 the summary. 

Analysis Streams:
   expr           Does expression summarization on probesets. 
   pca-select     Determines PCA for probes and picks probes that are
                  near the principal component as the probes to use
                  for downstream analysis. 
   spect-select   Picks probes that are similar to each other based 
                  on spectral cluster and normalized cut. 

version: apt-1.10.1 $Id: apt-probeset-genotype.cpp,v 1.236 2008/10/25 06:08:55 awilli Exp $

Frequently Asked Questions

Q. What is a probe_id?

A. See the FAQ item on probe IDs for more info.

Q. What do I do when I don't have enough memory to process all the data? (when --use-disk=false)

A. Starting with release 1.10.0, apt-probeset-genotype defaults to using temporary files rather than trying to keep everything in memory. If you use the option "--use-disk=false" you can force the older in-memory mode which is going to be sensitive to how much memory you have. To tweak how much memory is used when running in the in-memory mode you can manually set the --block-size command to specify how many probesets will be run at once. The program will then reduce memory by only loading those probesets into RAM. If the block-size option is unset the program will attempt to figure out how much available RAM you have and run in that memory. To fit in memory the program will often need to read the original CEL files multiple times. Also, if doing a quantile normalization try using a sketch (or subset) of the chip for the normalization. Sketch normalization is the default so this would only apply if you are using non-default options.

Q. How can I make apt-probeset-genotype run more probesets per iteration? (when --use-disk=false)

A. Starting with release 1.10.0, apt-probeset-genotype defaults to using temporary files and a single iteration. If you use --use-disk=false to override this behavior, you can manually set the --block-size flag to prevent apt-probeset-genotype from guessing the amount of probesets to be run per iteration. Instead it will use the supplied value.

Q. The program died with an error message like "Assertion failed: A->probes.size() == 2, file ...cpp." What does this mean?

A. This is symptomatic of trying to run BRLMM for a SNP with no MM probes. In its typical mode of running BRLMM relies on DM to generate intital seed calls, and the DM algorithm requires MM probes.

Q. The program died with an error message like "DmListenergetGenoCall() - Can't find genotypes for name: SNP_A-1780432". What does this mean?

A. This is symptomatic of having specified the wrong chrX file for the analysis. In order to reduce the likelihood of accidentally using the wrong chrX file apt-probeset-genotype checks to make sure that all the SNPs specified in the chrX file are present on the chips being analyzed. If it finds a SNP present in the chrX file that is not identified in the CDF file it will die with the above message. Note that if you want to bypass the requirement of a chrX file you can use the --no-gender-force option.

Q. The program died and I got an error message saying "Killed". What does this mean and what can I do?

A. Linux has a "feature" that it will promise more memory than it actually has in the hope that many programs won't actually be using all their memory at once. However, if linux does run short of memory it will start killing programs arbitrarily. You can read more about linux's OOM (out of memory) killer at at LWN.net.

Q. Why does apt-probeset-genotype require information regarding SNPs on chromosomes X/Y/Mito?

A. The SNPs on chromosome X are evaluated separately for XX (female) and XY (male) individuals as the intensity estimates for the males will generally be lower on X due to one missing chromosome. The prior is also adjusted to remove the het center as XY individuals should only have hom calls on the X chromosome. For BRLMM analyses gender is estimated using the method employed in the GTYPE software: individuals are called male if less than 7.5% of the snps on X are called as hets by the initial DM calls using a .33 confidence threshold. For BRLMM-P gender is estimated by use of an Expectation Maximiation (EM) algorithm on the PM probes for chrX SNPs to estimate the het rate.

Q. How is the mask section in the CEL file used?

A. It is not. The contents of this section of the CEL file are ignored.

Q. How can I find out more information about the analysis string:

    quant-norm.sketch=50000,pm-only,brlmm-p.CM=1.bins=100.K=2.SB=0.003.MS=0.05

A. Start with the included manual for apt-probeset-genotype. Usage information is also provided if you run apt-probeset-genotype without any arguments. Lastly, use the "--explain" option for more information about particular analysis methods. IE:

    apt-probeset-genotype --explain brlmm-p

Q. Wild cards do not work on windows. For example:

    apt-probeset-genotype ... *.CEL

A. APT relies on the command shell to do the wild card expansion (ie bash shell on NIX systems). The windows shell does not do wild card expansion, so there is no wild card expansion for APT when run from the windows shell. You may want to try a different windows shell or perhaps bash via cygwin. See the --cel-files option as an alternative to specify CEL files for analysis.

Advanced Topics

CHP File format.

This section explains the contents of CHP files for the various algorithms. For details on the formats or for an explanation of why the XDA CHP format is not supported for some chip types, see above.

XDA CHP file format

The XDA CHP file format is only supported for the BRLMM algorithm applied to the 100K or 500K arrays. Historically the genotyping XDA CHP file is closely tied to the DM model and while BRLMM uses the same format for backward compatibility it is important to note that the interpretation of some fields is different. Below are the names of the fields and corresponding BRLMM values that are stored in them.

The following parameters are saved in the CCHPFileHeader object:

A complete explanation of the XDA CHP file format can be found in a local copy of the Affymetrix Developer's Network file format documentation.

AGCC CHP file format

The AGCC CHP format consists of a header followed by a data section. The header section contains a large amount of information including the software version and the full set of parameters used in the clustering analysis. The data section consists of a matrix with a row for each SNP. The columns are:

Use of the clustering_space_x_value and clustering_space_y_value fields allows for plotting the data in the space that was used to perform the clustering. For BRLMM and BRLMM-P the x-value is 'transformed contrast' and the y-value is 'signal strength' - see the BRLMM and BRLMM-P whitepapers for more detail. For Birdseed (see http://www.broad.mit.edu/mpg/birdsuite/) this is A-signal va B-signal (linear scale, post quantile normalization and allele-specific median-polish).

A complete explanation of the AGCC CHP file format can be found in a local copy of the Affymetrix Developer's Network file format documentation.

A word about memory (RAM)

This section is only relevant if you specify --use-disk=false. Starting with release 1.10.0, apt-probeset-genotype default behavior is to use temporary files and a single iteration. As such, you should not see memory issues when using the default settings. If you decide to force the in-memory mode (--use-disk=false) then read on...

The most common challenge people have running apt-probeset-genotype is with RAM. apt-probeset-genotype will attempt to split up jobs into the amount of RAM that appears free on your computer. The job is split by subdividing the analysis into blocks of probesets (or SNPs). Small block sizes will subdivide the job more and require less memory at the expense of having to read the CEL files more frequently. On the other hand, using a smaller number of large blocks will require more memory but will place minimal load on reading CEL files.

The default behaviour (when forcing in-memory mode with --use-disk=false) is for apt-probeset-genotype to estimate the optimal block size based upon the amount memory that appears to be free at the time the job begins. This default behaviour can be overridden by the user with the use of the --block-size option, which specifies the number of probesets to be processed at one time. For example, specifying --block-size=20000 will analyze the data in batches of 20,000 probesets at a time.

A problem that may be encountered (especially on a multi-user or multi-processor system) is running out of memory when a run of apt-probset-genotype is initiated and then another big-memory process is started afterwards. In this circumstance the first instance of apt-probeset-genotype sees substantial free memory and chooses a large block-size, but then the second process grabs more of the memory and the first run of apt-probeset-genotype runs out of memory. This problem can be addressed by planning the work load on your machine and/or using an appropriately small block size with the --block-size option.

RAM usage (in bytes) for Mapping 500K data can be estimated by the following equation:

\[RAM \approx C \times \big[ (B \times P \times (F + 1)) + (S \times F)\big] + (B \times N) + (B \times D) + K\]

Below are some guildlines about how many probesets to run at once (i.e. the --block-size) in 1.9 Gig of RAM as a function of number of CEL files:

Note that the above recommendations won't use all of the 1.9 gigs of RAM. In addition to needing a relatively large amount of memory the program also needs relatively large blocks of contiguous memory and as RAM usage approaches the maximum available these get harder and harder to find. If you've got memory to spare the amount of RAM to run all the data at once as a function of Chips:

Note that on most 32 bit (i.e. Pentium, Xeon, Windows) systems you can't use than ~2 Gig of RAM with a single process, even if there is more available.

Custom Analyses:

While aliases for common analysis such as brlmm with default parameters are provided it is possible to construct custom analyses on the command line. There are both program options and analysis parameters that can be set to affect the results. Most people are familiar with the standard method for setting program options, but the specification of the analysis method and its parameters in apt-probeset-genotype works a little differently. The method for setting custom parameters to the analysis involves supplying a text representation of the analysis and parameters desired. This enables flexibility as each piece of an analysis is self-contained and they can be (almost) arbitrarily combined. Note that when using a custom analysis rather than an alias it is necessary to specify the entire analysis and not acceptable to pass custom parameters to the alias. For example, if you wanted to change the number of iterations brlmm performs you would have to specify 'quant-norm.sketch=50000,pm-only,brlmm.iterations=1' rather than just typing 'brlmm.iterations=1'

The current full default brlmm analysis is: 'quant-norm.sketch=50000,pm-only,brlmm' where there can be multiple chipstream modules (in this case a single quant-norm) separated by commas and the last two entries are the pm adjuster (pm-only) and quantification method (brlmm). Parameters to a particular step in the analysis are supplied in key=value pairs and separated by periods. For example 'quant-norm.sketch=50000' indicates that the chips should be quantile normalized and that a sketch (subset of total data) of size 50000 should be used to do the normalization. Using a sketch can significantly reduce the amount of memory needed with minimal impact on normalization values. To do quantile normalization with just the PM probes and resolve ties in the same manner as bioconductor's RMA version of quantile normalization you would specify 'quant-norm.sketch=50000.bioc=true.usepm=true'. All of the parameters possible can be seen by using the --explain option in conjunction with the name of the module (i.e. apt-probeset-genotype --explain quant-norm).

So a few examples custom analyses would be:

'pm-only,brlmm.transform=rvt' - No normalization, use rvt space for clustering in blrmm.

'med-norm,pm-mm,brlmm.het-mult=.9' - Do a median normalization, use a PM-MM adjustment for probes and a het multiplier of .9 to try and balance hom/het calls.

'rma-bg,quant-norm.sketch=50000.usepm=true.bioc=true,pm-only,blrmm.K=4.tranform=CCS' - Do an RMA style background subtraction followed by an RMA style quantile normalization using a subset of 50000 data points followed by brlmm in CCS (contrast centers space) space with K = 4.

Use the --explain option to get more information on what parameters are available for the various methods. For example, "--explain brlmm", "--explain brlmm-p", and "--explain birdseed".

Clustering Space Transformations:

There are a number of different transformations that are implemented for different spaces which can be specified via the transform parameter to brlmm and are detailed below. For all of these transformations $A$ and $B$ denote the intensity of the A and B alleles respectively as estimated by the quantification method (such as plier or RMA). $X$ and $Y$ denote the new coordinates that $A$ and $B$ will be transformed into.


Generated on Mon Nov 3 12:21:42 2008 for Affymetrix Power Tools by  doxygen 1.5.3